Build and Run a Data Engineering Pipeline on Your Local System

Build and Run a Data Engineering Pipeline on Your Local System

 Step-by-Step Setup Guide

1. Install Prerequisites

Make sure these are installed:

Verify installation:

docker --version
docker compose version

2. Create Project Folder

mkdir data-engineering-pipeline
cd data-engineering-pipeline

3. Add Files

docker-compose.yml

Create docker-compose.yml inside data-engineering-pipeline/ with the content I provided earlier.

PostgreSQL Init Script

Make a folder postgres/ and add file init.sql:

mkdir postgres
nano postgres/init.sql

Paste:

CREATE TABLE IF NOT EXISTS weather_data (
    id SERIAL PRIMARY KEY,
    city VARCHAR(50),
    temperature FLOAT,
    humidity FLOAT,
    timestamp BIGINT
);

Airflow DAGs Folder

Make airflow/dags/ folder:

mkdir -p airflow/dags

Inside it, create three files:

  • weather_pipeline.py (Airflow DAG)
  • fetch_weather.py (data ingestion)
  • process_weather.py (processing + insert to PostgreSQL)

Paste the scripts I gave you earlier into these files.

4. Start Services

Run:

docker compose up -d

This will start:

  • PostgreSQL (5432)
  • Kafka + Zookeeper (9092)
  • Airflow (8080)
  • Superset (8088)

Check running containers:

docker ps

5. Access UIs

6. Trigger the Pipeline

  1. Go to Airflow UI → Enable DAG weather_pipeline.
  2. It will:
    • Run fetch_weather.py → Pull data from OpenWeather API → Send to Kafka.
    • Run process_weather.py → Consume Kafka data → Insert into PostgreSQL.

7. Build Dashboard in Superset

  1. Log in at http://localhost:8088.
  2. Connect PostgreSQL:

  • Host: postgres
  • Port: 5432
  • Database: weatherdb
  • User: postgres
  • Password: postgres

  1. Create a new dataset from weather_data table.
  2. Build charts:

  • Line chart: temperature trend over time.
  • Bar chart: average humidity per city.
  • Alerts: filter by thresholds.

8. Stopping the Pipeline

To stop services:

docker compose down

To stop & remove all data volumes (start fresh):

docker compose down -v

 Now your local machine runs a production-style data engineering pipeline with open-source tools.

Post a Comment

0 Comments