Step-by-Step Setup Guide
1. Install Prerequisites
Make sure these are installed:
- Docker Desktop (Windows/Mac) OR Docker Engine (Linux)
- Docker Compose (already included in Docker Desktop)
- Git (optional, but helpful)
Verify installation:
docker compose version
2. Create Project Folder
cd data-engineering-pipeline
3. Add Files
docker-compose.yml
Create docker-compose.yml inside data-engineering-pipeline/ with the content I provided earlier.
PostgreSQL Init Script
Make a folder postgres/
and add file init.sql
:
nano postgres/init.sql
Paste:
id SERIAL PRIMARY KEY,
city VARCHAR(50),
temperature FLOAT,
humidity FLOAT,
timestamp BIGINT
);
Airflow DAGs Folder
Make airflow/dags/ folder:
mkdir -p airflow/dags
Inside it, create three files:
- weather_pipeline.py (Airflow DAG)
- fetch_weather.py (data ingestion)
- process_weather.py (processing + insert to PostgreSQL)
Paste the scripts I gave you earlier into these files.
4. Start Services
Run:
docker compose up -d
This will start:
- PostgreSQL (5432)
- Kafka + Zookeeper (9092)
- Airflow (8080)
- Superset (8088)
Check running containers:
docker ps
5. Access UIs
Airflow → http://localhost:8080
- Default user:
airflow
/airflow
(or create your own withairflow users create
) Superset → http://localhost:8088
- Login:
admin / admin
(as defined in docker-compose)
6. Trigger the Pipeline
-
Go to Airflow UI → Enable DAG
weather_pipeline
. - It will:
- Run
fetch_weather.py
→ Pull data from OpenWeather API → Send to Kafka. - Run
process_weather.py
→ Consume Kafka data → Insert into PostgreSQL.
7. Build Dashboard in Superset
- Log in at http://localhost:8088.
- Connect PostgreSQL:
- Host:
postgres
- Port:
5432
- Database:
weatherdb
- User:
postgres
- Password:
postgres
- Create a new dataset from
weather_data
table. - Build charts:
- Line chart: temperature trend over time.
- Bar chart: average humidity per city.
- Alerts: filter by thresholds.
8. Stopping the Pipeline
To stop services:
docker compose down
To stop & remove all data volumes (start fresh):
docker compose down -v
Now your local machine runs a production-style data engineering pipeline with open-source tools.
0 Comments