This project is a training lab to build a real Airflow ETL pipeline.
The business case:
- Extract daily historical crypto market snapshots from CoinGecko.
- Transform nested API payloads into a clean tabular shape.
- Load data into Postgres with idempotent upsert logic so DAG reruns do not create duplicates.
Target coins in scope:
- bitcoin
- ethereum
- solana
- How to run Airflow locally using Astro CLI.
- How to run an additional database service alongside Airflow with Docker Compose override.
- How to configure Airflow connections for external systems.
- How to design an ETL DAG with extraction, transformation, and loading tasks.
- How to write rerunnable pipelines using ON CONFLICT upsert.
- How to troubleshoot common local development issues.
- dags/crypto_history_etl.py: Main training DAG.
- dags/exampledag.py: Default example DAG from Astro init.
- docker-compose.override.yml: Adds target Postgres and remaps metadata Postgres port.
- .env: Defines runtime environment variables, including target Airflow connection URI.
- requirements.txt: Python dependencies and Airflow providers.
Two Postgres databases run locally:
- Airflow metadata Postgres
- Target analytics Postgres used by this ETL pipeline
Port mapping in docker-compose.override.yml:
- Airflow metadata Postgres: host 5434 -> container 5432
- Target DB (target-db): host 5433 -> container 5432
Why this matters:
- Host port 5432 is commonly already in use.
- Splitting ports avoids conflicts and keeps both databases available.
Install and verify:
- Docker Desktop
- Astro CLI
Then check Astro CLI:
astro versionFrom project root:
astro dev startAirflow UI will be available at:
If Airflow is already running and you changed config or dependencies:
astro dev restartThis project already contains docker-compose.override.yml with the target-db service.
Expected target DB credentials:
- user: admin
- password: password
- database: crypto_db
- host port from your machine: 5433
This project sets the connection via .env:
AIRFLOW_CONN_TARGET_DB_CONN=postgresql://admin:password@host.docker.internal:5433/crypto_db
Why host.docker.internal:
- Tasks run inside Airflow containers.
- localhost inside container is not your host machine.
- host.docker.internal allows container-to-host access on Docker Desktop.
Optional validation command:
astro dev run connections get target_db_connThe pipeline DAG is dags/crypto_history_etl.py.
DAG id:
- intern_crypto_history_pipeline
Flow:
- start_pipeline
- create_table
- process_crypto task group
- end_pipeline
Inside process_crypto, each coin has:
- extract_coin
- transform_coin
- load_coin
Key ETL behaviors:
- Extract: pulls historical daily data from CoinGecko.
- Transform: isolates date, coin, price_usd, market_cap_usd, volume_usd.
- Load: uses PostgresHook and upsert to prevent duplicates.
In Airflow UI:
- Open DAGs.
- Find intern_crypto_history_pipeline.
- Unpause DAG.
- Trigger DAG.
- Open Graph view and task logs.
CLI checks:
astro dev run dags list
astro dev run dags list-import-errorsAfter a successful run, table created in target DB:
- crypto_historical
Columns:
- date
- coin
- price_usd
- market_cap_usd
- volume_usd
Primary key:
- (date, coin)
This key powers idempotent upsert behavior.
Current Python dependencies in requirements.txt:
- requests
- pandas
- apache-airflow-providers-postgres
Notes:
- The DAG currently relies on requests and Postgres provider.
- pandas is available for future transforms.
Symptom:
- Docker bind error on localhost:5432.
Fix:
- Keep metadata Postgres remapped to 5434 and target DB on 5433 in docker-compose.override.yml.
Symptom:
- YAML warning for !override.
Fix:
- Workspace setting in .vscode/settings.json includes yaml.customTags for !override sequence.
Symptom:
- airflow.providers.postgres.operators.postgres import fails.
Fix:
- DAG uses SQLExecuteQueryOperator from common SQL provider in dags/crypto_history_etl.py.
Symptom:
- Parsing error in newer Airflow runtime.
Fix:
- Use schedule instead of schedule_interval in dags/crypto_history_etl.py.
Symptom:
- Connection fails resolving target-db from Airflow task.
Fix:
- Use target_db_conn mapped to host.docker.internal:5433 via .env.
astro dev start
astro dev restart
astro dev run dags list
astro dev run dags list-import-errors
astro dev run connections get target_db_connBy the end of this lab, interns should be able to:
- Build and run a production-style ETL DAG locally.
- Store daily crypto history in Postgres reliably.
- Re-run pipelines safely without duplicate rows.
- Diagnose and resolve common Airflow local dev issues.