Data Engineering Live Project Example with Open-Source Tools

Project Goal

Build a real-time data pipeline that collects data from a public API, processes it, stores it in a database, and visualizes insights.

We’ll use only free/open-source tools.

Project Use Case: Real-Time Weather Data Pipeline

1. Data Source (Ingestion)

Tool: Python (free & open-source)
What: Use the OpenWeatherMap free API to fetch live weather data (temperature, humidity, etc.) for different cities.

How: Write a Python script that pulls JSON data every 10 minutes.

2. Message Queue / Stream Processing

Tool: Apache Kafka (free & open-source)
What: Push the API data into a Kafka topic for streaming.
Why: Kafka allows scalable ingestion and decouples data producers from consumers.

3. Data Transformation (ETL/ELT)

Tool: Apache Spark (free & open-source)
What: Consume Kafka messages and clean the data (e.g., convert temperature units, filter missing values).
Output: Structured data (timestamp, city, temperature, humidity).

4. Data Storage

Tool: PostgreSQL (free & open-source relational database)
What: Store the cleaned weather data in a fact table with city-level granularity.
Why: PostgreSQL is reliable, widely used, and integrates well with BI tools.

5. Data Orchestration

Tool: Apache Airflow (free & open-source)
What: Schedule and monitor the pipeline:

Step 1: API ingestion → Kafka
Step 2: Spark job → PostgreSQL

Why: Airflow manages workflows & ensures automation.

6. Data Visualization (Dashboard)

Tool: Apache Superset (free & open-source BI tool)
What: Create real-time dashboards:

Average temperature by city
Humidity trends over time
Alerts for extreme weather

Pipeline Architecture


OpenWeather API → Python Script → Kafka → Spark (ETL) → PostgreSQL → Superset (Dashboard)
                               ↑
                             Airflow (Orchestration)

Possible Extensions