Skip to content

YfinanceValidation

Validates ticker symbols and OHLCV data quality using yfinance test downloads and Great Expectations.

Component Separation

This class handles validation only. For ticker lists see YfinanceTickers, for data scraping see YfinancePipeline.

Validation Checks

Category Checks Purpose
Schema Required columns (Open, High, Low, Close, Volume) Ensure complete data structure
Nulls Zero NaN/null tolerance Catch incomplete data
Price Logic High ≥ Low, Open/Close within range, all prices > $0.01 Detect bad data or API errors
Data Quality Std dev > 0.01, min 10 rows, unique dates Catch constant values and duplicates
Volume Non-negative (0 allowed) Validate trading activity

Why 18 Checks?

Schema (1) + Nulls (5) + Price bounds (4) + Price logic (4) + Data quality (3) + Unique dates (1) = 18 total checks

How It Works

Runs 18 Great Expectations checks to catch API glitches, incomplete data, and bad values before they hit the database. Checks cover schema completeness, null values, price logic (High >= Low, Open/Close within bounds), data quality (stddev > 0.01, min 10 rows, unique dates), and reasonable values (prices > $0.01, volume >= 0).

Returns a dict with validation results including which checks passed/failed. If the Great Expectations framework itself fails, raises an exception for Airflow to retry.

Validation Failures

All failures raise exceptions - no silent errors. Airflow retries automatically with exponential backoff.

Constants

Constant Value Purpose
TICKER_VALIDATION_TEST_DAYS 21 Calendar days for ticker test
MIN_TRADING_DAYS_FOR_VALIDATION 10 Minimum trading days required
MIN_OHLCV_ROWS_FOR_VALIDATION 10 Minimum rows for validation
MIN_PRICE_VALUE 0.01 Minimum valid price
MIN_STDDEV_VALUE 0.01 Minimum standard deviation

Ticker Validation

Each ticker is validated by downloading 21 calendar days of data (about 15 trading days). Valid tickers must return ≥10 trading days back. Catches delisted stocks, bad symbols, and API issues before bulk downloading.

Why 21 Days?

21 calendar days = ~15 trading days. Accounts for weekends, holidays, and newly listed stocks.

API Reference

Validation for ticker symbols and OHLCV data quality - Ticker validation with test downloads - OHLCV validation with Great Expectations

Source code in data_pipeline/sec_data_pipeline/yfinance/yfinance_validation.py
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
class YfinanceValidation:
    """
    Validation for ticker symbols and OHLCV data quality
    - Ticker validation with test downloads
    - OHLCV validation with Great Expectations
    """

    # Validation constants
    TICKER_VALIDATION_TEST_DAYS = 21  # Calendar days to test ticker validity
    MIN_TRADING_DAYS_FOR_VALIDATION = 10  # Minimum trading days required for valid ticker
    MIN_OHLCV_ROWS_FOR_VALIDATION = 10  # Minimum rows required for OHLCV validation
    MIN_PRICE_VALUE = 0.01  # Minimum valid price (prevents zero/negative prices)
    MIN_STDDEV_VALUE = 0.01  # Minimum standard deviation (detects constant values)

    def validate_ticker(self, ticker: str, test_days: int = 21) -> bool:
        """
        Validates a ticker by test-downloading recent data and checking if yfinance
        returns enough trading days. Catches delisted stocks, bad symbols, and API
        issues before bulk downloading.

        Downloads 21 calendar days (about 15 trading days) and checks if we got
        at least 10 trading days back. Threshold accounts for weekends, holidays,
        and newly listed stocks.

        Args:
            ticker: Symbol to validate
            test_days: Calendar days to test (default 21)

        Returns:
            True if yfinance returned >=10 trading days, False otherwise
        """

        # Calculate date range
        end_date = date.today()
        start_date = end_date - timedelta(days=test_days)

        # Download single ticker with explicit auto_adjust
        data = yf.download(
            tickers=ticker,
            start=start_date,
            end=end_date,
            progress=False,
            threads=False,
            auto_adjust=True  # Explicitly set to avoid warning
        )

        # Check if we got valid data with minimum trading days
        if data is not None and not data.empty and len(data) >= self.MIN_TRADING_DAYS_FOR_VALIDATION:
            return True
        else:
            return False

    def validate_ohlcv(self, df: pd.DataFrame, ticker: str) -> Dict[str, Any]:
        """
        Runs 18 Great Expectations checks on OHLCV data to catch API glitches,
        incomplete data, and bad values before they hit the database.

        Checks cover schema completeness, null values, price logic (High >= Low,
        Open/Close within bounds), data quality (stddev > 0.01, min 10 rows,
        unique dates), and reasonable values (prices > $0.01, volume >= 0).

        Args:
            df: DataFrame with Open, High, Low, Close, Volume columns
            ticker: Symbol for error messages

        Returns:
            Dict with validation results:
                - valid (bool): True if all checks passed
                - total_checks (int): Should be 18
                - passed (int): Checks that passed
                - failed (int): Checks that failed
                - failed_checks (list): Names of failed expectations

        Raises:
            Exception: If Great Expectations framework fails
        """
        # type: ignore - Great Expectations type stubs incomplete

        # Suppress GX progress bars and warnings
        import warnings
        import logging
        import os
        import sys

        warnings.filterwarnings('ignore')

        # Completely disable all GX logging and progress bars
        logging.getLogger('great_expectations').disabled = True
        logging.getLogger('great_expectations.core').disabled = True
        logging.getLogger('great_expectations.data_context').disabled = True

        # Redirect stderr to suppress tqdm progress bars
        old_stderr = sys.stderr
        sys.stderr = open(os.devnull, 'w')

        os.environ['GX_ANALYTICS_ENABLED'] = 'False'

        try:
            context = gx.get_context()
            data_source = context.data_sources.add_pandas(name=f"{ticker}_source")
            data_asset = data_source.add_dataframe_asset(name=f"{ticker}_asset")
            batch_def = data_asset.add_batch_definition_whole_dataframe(f"{ticker}_batch")

            # Comprehensive expectations for backtest-quality data
            expectations = [
                # 1. Schema validation - required columns exist (order doesn't matter)
                gx.expectations.ExpectTableColumnsToMatchSet(
                    column_set={"Open", "High", "Low", "Close", "Volume"}
                ),

                # 2. Null/NaN validation - zero tolerance
                gx.expectations.ExpectColumnValuesToNotBeNull(column="Open"),
                gx.expectations.ExpectColumnValuesToNotBeNull(column="High"),
                gx.expectations.ExpectColumnValuesToNotBeNull(column="Low"),
                gx.expectations.ExpectColumnValuesToNotBeNull(column="Close"),
                gx.expectations.ExpectColumnValuesToNotBeNull(column="Volume"),

                # 3. Positive price validation - no negative or zero prices
                gx.expectations.ExpectColumnValuesToBeBetween(column="Open", min_value=self.MIN_PRICE_VALUE),
                gx.expectations.ExpectColumnValuesToBeBetween(column="High", min_value=self.MIN_PRICE_VALUE),
                gx.expectations.ExpectColumnValuesToBeBetween(column="Low", min_value=self.MIN_PRICE_VALUE),
                gx.expectations.ExpectColumnValuesToBeBetween(column="Close", min_value=self.MIN_PRICE_VALUE),

                # 4. Volume validation - non-negative only (0 volume is valid)
                gx.expectations.ExpectColumnValuesToBeBetween(column="Volume", min_value=0),

                # 5. Price logic validation - High >= Low
                gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
                    column_A="High", column_B="Low", or_equal=True
                ),

                # 6. Open/Close within High/Low range
                gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
                    column_A="High", column_B="Open", or_equal=True
                ),
                gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
                    column_A="Open", column_B="Low", or_equal=True
                ),
                gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
                    column_A="High", column_B="Close", or_equal=True
                ),
                gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
                    column_A="Close", column_B="Low", or_equal=True
                ),

                # 7. Data variability - no constant values (stddev > 0)
                gx.expectations.ExpectColumnStdevToBeBetween(column="Close", min_value=self.MIN_STDDEV_VALUE),
                gx.expectations.ExpectColumnStdevToBeBetween(column="Volume", min_value=self.MIN_STDDEV_VALUE),

                # 8. Minimum row count - at least 10 trading days
                gx.expectations.ExpectTableRowCountToBeBetween(min_value=self.MIN_OHLCV_ROWS_FOR_VALIDATION),

                # 9. Unique dates - no duplicate timestamps
                gx.expectations.ExpectColumnValuesToBeUnique(column="Date") if "Date" in df.columns else None,
            ]

            # Remove None values (for conditional expectations)
            expectations = [e for e in expectations if e is not None]

            # Run all validations and collect results
            batch = batch_def.get_batch(batch_parameters={"dataframe": df})

            results = []
            failed_checks = []

            for expectation in expectations:
                result = batch.validate(expectation)
                results.append(result)

                if not result.success:
                    # Get expectation type for better logging
                    expectation_type = type(expectation).__name__
                    failed_checks.append(expectation_type)

            passed = sum(1 for r in results if r.success)
            failed = len(results) - passed

            # Restore stderr
            sys.stderr = old_stderr

            return {
                'valid': failed == 0,
                'total_checks': len(results),
                'passed': passed,
                'failed': failed,
                'failed_checks': failed_checks
            }

        except Exception as e:
            # Restore stderr
            sys.stderr = old_stderr

            # If validation setup fails, raise exception for Airflow to retry
            raise Exception(f"OHLCV validation framework failed for {ticker}: {str(e)}") from e

validate_ohlcv(df, ticker)

Runs 18 Great Expectations checks on OHLCV data to catch API glitches, incomplete data, and bad values before they hit the database.

Checks cover schema completeness, null values, price logic (High >= Low, Open/Close within bounds), data quality (stddev > 0.01, min 10 rows, unique dates), and reasonable values (prices > $0.01, volume >= 0).

Parameters:

Name Type Description Default
df DataFrame

DataFrame with Open, High, Low, Close, Volume columns

required
ticker str

Symbol for error messages

required

Returns:

Type Description
Dict[str, Any]

Dict with validation results: - valid (bool): True if all checks passed - total_checks (int): Should be 18 - passed (int): Checks that passed - failed (int): Checks that failed - failed_checks (list): Names of failed expectations

Raises:

Type Description
Exception

If Great Expectations framework fails

Source code in data_pipeline/sec_data_pipeline/yfinance/yfinance_validation.py
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
def validate_ohlcv(self, df: pd.DataFrame, ticker: str) -> Dict[str, Any]:
    """
    Runs 18 Great Expectations checks on OHLCV data to catch API glitches,
    incomplete data, and bad values before they hit the database.

    Checks cover schema completeness, null values, price logic (High >= Low,
    Open/Close within bounds), data quality (stddev > 0.01, min 10 rows,
    unique dates), and reasonable values (prices > $0.01, volume >= 0).

    Args:
        df: DataFrame with Open, High, Low, Close, Volume columns
        ticker: Symbol for error messages

    Returns:
        Dict with validation results:
            - valid (bool): True if all checks passed
            - total_checks (int): Should be 18
            - passed (int): Checks that passed
            - failed (int): Checks that failed
            - failed_checks (list): Names of failed expectations

    Raises:
        Exception: If Great Expectations framework fails
    """
    # type: ignore - Great Expectations type stubs incomplete

    # Suppress GX progress bars and warnings
    import warnings
    import logging
    import os
    import sys

    warnings.filterwarnings('ignore')

    # Completely disable all GX logging and progress bars
    logging.getLogger('great_expectations').disabled = True
    logging.getLogger('great_expectations.core').disabled = True
    logging.getLogger('great_expectations.data_context').disabled = True

    # Redirect stderr to suppress tqdm progress bars
    old_stderr = sys.stderr
    sys.stderr = open(os.devnull, 'w')

    os.environ['GX_ANALYTICS_ENABLED'] = 'False'

    try:
        context = gx.get_context()
        data_source = context.data_sources.add_pandas(name=f"{ticker}_source")
        data_asset = data_source.add_dataframe_asset(name=f"{ticker}_asset")
        batch_def = data_asset.add_batch_definition_whole_dataframe(f"{ticker}_batch")

        # Comprehensive expectations for backtest-quality data
        expectations = [
            # 1. Schema validation - required columns exist (order doesn't matter)
            gx.expectations.ExpectTableColumnsToMatchSet(
                column_set={"Open", "High", "Low", "Close", "Volume"}
            ),

            # 2. Null/NaN validation - zero tolerance
            gx.expectations.ExpectColumnValuesToNotBeNull(column="Open"),
            gx.expectations.ExpectColumnValuesToNotBeNull(column="High"),
            gx.expectations.ExpectColumnValuesToNotBeNull(column="Low"),
            gx.expectations.ExpectColumnValuesToNotBeNull(column="Close"),
            gx.expectations.ExpectColumnValuesToNotBeNull(column="Volume"),

            # 3. Positive price validation - no negative or zero prices
            gx.expectations.ExpectColumnValuesToBeBetween(column="Open", min_value=self.MIN_PRICE_VALUE),
            gx.expectations.ExpectColumnValuesToBeBetween(column="High", min_value=self.MIN_PRICE_VALUE),
            gx.expectations.ExpectColumnValuesToBeBetween(column="Low", min_value=self.MIN_PRICE_VALUE),
            gx.expectations.ExpectColumnValuesToBeBetween(column="Close", min_value=self.MIN_PRICE_VALUE),

            # 4. Volume validation - non-negative only (0 volume is valid)
            gx.expectations.ExpectColumnValuesToBeBetween(column="Volume", min_value=0),

            # 5. Price logic validation - High >= Low
            gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
                column_A="High", column_B="Low", or_equal=True
            ),

            # 6. Open/Close within High/Low range
            gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
                column_A="High", column_B="Open", or_equal=True
            ),
            gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
                column_A="Open", column_B="Low", or_equal=True
            ),
            gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
                column_A="High", column_B="Close", or_equal=True
            ),
            gx.expectations.ExpectColumnPairValuesAToBeGreaterThanB(
                column_A="Close", column_B="Low", or_equal=True
            ),

            # 7. Data variability - no constant values (stddev > 0)
            gx.expectations.ExpectColumnStdevToBeBetween(column="Close", min_value=self.MIN_STDDEV_VALUE),
            gx.expectations.ExpectColumnStdevToBeBetween(column="Volume", min_value=self.MIN_STDDEV_VALUE),

            # 8. Minimum row count - at least 10 trading days
            gx.expectations.ExpectTableRowCountToBeBetween(min_value=self.MIN_OHLCV_ROWS_FOR_VALIDATION),

            # 9. Unique dates - no duplicate timestamps
            gx.expectations.ExpectColumnValuesToBeUnique(column="Date") if "Date" in df.columns else None,
        ]

        # Remove None values (for conditional expectations)
        expectations = [e for e in expectations if e is not None]

        # Run all validations and collect results
        batch = batch_def.get_batch(batch_parameters={"dataframe": df})

        results = []
        failed_checks = []

        for expectation in expectations:
            result = batch.validate(expectation)
            results.append(result)

            if not result.success:
                # Get expectation type for better logging
                expectation_type = type(expectation).__name__
                failed_checks.append(expectation_type)

        passed = sum(1 for r in results if r.success)
        failed = len(results) - passed

        # Restore stderr
        sys.stderr = old_stderr

        return {
            'valid': failed == 0,
            'total_checks': len(results),
            'passed': passed,
            'failed': failed,
            'failed_checks': failed_checks
        }

    except Exception as e:
        # Restore stderr
        sys.stderr = old_stderr

        # If validation setup fails, raise exception for Airflow to retry
        raise Exception(f"OHLCV validation framework failed for {ticker}: {str(e)}") from e

validate_ticker(ticker, test_days=21)

Validates a ticker by test-downloading recent data and checking if yfinance returns enough trading days. Catches delisted stocks, bad symbols, and API issues before bulk downloading.

Downloads 21 calendar days (about 15 trading days) and checks if we got at least 10 trading days back. Threshold accounts for weekends, holidays, and newly listed stocks.

Parameters:

Name Type Description Default
ticker str

Symbol to validate

required
test_days int

Calendar days to test (default 21)

21

Returns:

Type Description
bool

True if yfinance returned >=10 trading days, False otherwise

Source code in data_pipeline/sec_data_pipeline/yfinance/yfinance_validation.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def validate_ticker(self, ticker: str, test_days: int = 21) -> bool:
    """
    Validates a ticker by test-downloading recent data and checking if yfinance
    returns enough trading days. Catches delisted stocks, bad symbols, and API
    issues before bulk downloading.

    Downloads 21 calendar days (about 15 trading days) and checks if we got
    at least 10 trading days back. Threshold accounts for weekends, holidays,
    and newly listed stocks.

    Args:
        ticker: Symbol to validate
        test_days: Calendar days to test (default 21)

    Returns:
        True if yfinance returned >=10 trading days, False otherwise
    """

    # Calculate date range
    end_date = date.today()
    start_date = end_date - timedelta(days=test_days)

    # Download single ticker with explicit auto_adjust
    data = yf.download(
        tickers=ticker,
        start=start_date,
        end=end_date,
        progress=False,
        threads=False,
        auto_adjust=True  # Explicitly set to avoid warning
    )

    # Check if we got valid data with minimum trading days
    if data is not None and not data.empty and len(data) >= self.MIN_TRADING_DAYS_FOR_VALIDATION:
        return True
    else:
        return False