Skip to content

Historical Parallel DAG

Bulk historical data loader. Scrapes ticker lists, validates them, downloads multi-year OHLCV data, and stores everything in the database.

What It Does

Gets ticker lists from public sources (Wikipedia, iShares, NASDAQ), validates each one by test-downloading recent data, then downloads the full historical range you specify. Everything runs in parallel - up to 20 tickers processing at once.

Good for initial database setup or backfilling new ticker lists.

Configuration

Parameter Options Default Description
years_back 1-10 10 Years of historical data to download
ticker_source sp500, russell3000, nasdaq, all sp500 Which ticker list to use
ticker_limit 0-5000 0 Limit tickers (0 = no limit)

Trigger: Manual via Airflow UI

Task Flow

%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'14px'}}}%%
graph LR
    A[get_tickers] --> B[convert_to_list]
    B --> C[validate_ticker<br/>PARALLEL 20x]
    C --> D[filter_valid]
    D --> E[register_security<br/>PARALLEL 20x]
    E --> F[download_ohlcv<br/>PARALLEL 20x]
    F --> G[validate_ohlcv<br/>PARALLEL 20x]
    G --> H[insert_ohlcv<br/>PARALLEL 20x]
    H --> I[download_metadata<br/>PARALLEL 20x]
    I --> J[insert_metadata<br/>PARALLEL 20x]
    J --> K[generate_report]

    style A fill:#f0f0f0,stroke:#333,color:#000
    style B fill:#f0f0f0,stroke:#333,color:#000
    style C fill:#e1f5ff,stroke:#333,color:#000
    style D fill:#f0f0f0,stroke:#333,color:#000
    style E fill:#e1f5ff,stroke:#333,color:#000
    style F fill:#e1f5ff,stroke:#333,color:#000
    style G fill:#e1f5ff,stroke:#333,color:#000
    style H fill:#e1f5ff,stroke:#333,color:#000
    style I fill:#e1f5ff,stroke:#333,color:#000
    style J fill:#e1f5ff,stroke:#333,color:#000
    style K fill:#d4edda,stroke:#333,color:#000

Parallelization

  • Max concurrent: 20 tasks at once (max_active_tis_per_dag)
  • Independence: Each ticker processes separately
  • Failure handling: Failed tickers don't block others (trigger_rule: 'all_done')

Retry Strategy

Task Retries Delay Why
validate_ticker 2 1 min Network flakiness
register_security 2 1 min DB connection issues
download_ohlcv 3 2 min yfinance rate limits
validate_ohlcv 1 1 min Validation rarely fails temporarily
insert_ohlcv 3 1 min DB under load
download_metadata 3 2 min Network issues
insert_metadata 3 1 min DB under load

All use exponential backoff.

Example Run

# Parameters
ticker_source = 'sp500'
years_back = 1
ticker_limit = 5

# Result
5 tickers × 1 year  2 minutes
# Parameters
ticker_source = 'sp500'
years_back = 10
ticker_limit = 0

# Result
~500 tickers × 10 years  2-3 hours
# Parameters
ticker_source = 'all'
years_back = 10
ticker_limit = 0

# Result
~6000 tickers × 10 years  12-18 hours