1. Project Overview:
Goal and Purpose:
The primary goal of this project is to showcase my technical proficiency in data engineering and business intelligence (BI), with a particular focus on modern ELT (Extract, Load, Transform) practices, dimensional modeling, and the creation of insightful visualizations. This project serves as a practical demonstration of my ability to deliver an end-to-end analytics solution.
Scope:
The project includes the following key activities:
- Web Scrapping: Scrapping data from Taiwan Real Estate Transaction Price Registration System
- Extract and Load: Stored Scraped data in structured storage.
- Data Transformation: Cleaning, modeling, and aggregating the data using ELT workflows to prepare it for analytical querying.
- Semantic Dataset: Designing a clear, easy-to-query dimensional model and a semantic (metrics) layer to facilitate meaningful analytics.
- Dashboard Design and Implementation: Creating interactive dashboards guided by simulated stakeholder interactions (via ChatGPT), ensuring relevance, clarity, and alignment with realistic business requirements and user scenarios.
2. Data Sources and Infrastructure
Data Sources
- city_code.csv
- Manually collected through google search and wikipedia
- Taiwan Real Estate Transaction Price Registration System
- land, building, and parking space transactions
- Pre-sale house (TBD)
Tools and Technologies
-
Programming Language:
-
Cloud Storage:
-
Data Warehouse:
-
Extract and Load:
-
Transformation:
-
CI/CD:
-
Semantic Layer (Data Mart):
-
Visualization:
For the source code, please visit the github repo
Infrastructure Overview
Web scraping → Raw data storage in Google Cloud Storage → Data ingestion and loading into BigQuery using Meltano → Data transformation and modeling in BigQuery using dbt → Semantic dataset creation in Power BI → Interactive dashboards and analytics in Power BI.
3. Data Models & Transformations
Data Mart & Data Modeling Description:
This project uses a dimensional modeling approach in data marts optimized for analytical queries. The schema shown as the follows:
erDiagram
"MODEL.HOUSE_AND_RENTING.DIM_BUILDING_TYPE" {
string building_type_id
string building_type
string building_type_origin
timestamp last_batched_at
}
"MODEL.HOUSE_AND_RENTING.DIM_CITY_DISTRICT" {
string city_county
string city_id
string city_district_surrogate_key
string district_township
timestamp last_batched_at
}
"MODEL.HOUSE_AND_RENTING.DIM_DATE" {
date date_day
date prior_date_day
date next_date_day
date prior_year_date_day
date prior_year_over_year_date_day
int64 day_of_week
int64 day_of_week_iso
string day_of_week_name
string day_of_week_name_short
int64 day_of_month
int64 day_of_year
date week_start_date
date week_end_date
date prior_year_week_start_date
date prior_year_week_end_date
int64 week_of_year
date iso_week_start_date
date iso_week_end_date
date prior_year_iso_week_start_date
date prior_year_iso_week_end_date
int64 iso_week_of_year
int64 prior_year_week_of_year
int64 prior_year_iso_week_of_year
int64 month_of_year
string month_name
string month_name_short
date month_start_date
date month_end_date
date prior_year_month_start_date
date prior_year_month_end_date
int64 quarter_of_year
date quarter_start_date
date quarter_end_date
int64 year_number
date year_start_date
date year_end_date
string year_month_number
}
"MODEL.HOUSE_AND_RENTING.DIM_LAYOUT" {
string number_of_bedrooms_categorized
int64 building_number_of_bedrooms
int64 building_number_of_living_rooms
int64 building_number_of_bathrooms
string building_number_of_partitions
string layout_id
timestamp last_batched_at
}
"MODEL.HOUSE_AND_RENTING.DIM_TRANSACTION_TYPE" {
string transaction_type
}
"MODEL.HOUSE_AND_RENTING.FCT_REAL_ESTATE_TRANSACTION" {
string city_district_surrogate_key
string layout_id
string building_type_id
string transaction_id
date transaction_date
string transaction_type
string address_land_slot
date construction_completion_date
int64 number_of_lands
int64 number_of_buildings
int64 number_of_parking_spaces
float64 land_total_tranferred_area_square_meter
float64 building_transferred_area_square_meter
float64 parking_transferred_area_square_meter
float64 main_building_area_square_meter
float64 auxiliary_building_area_square_meter
float64 balcony_area_square_meter
float64 total_price_ntd
float64 price_per_square_meter
float64 parking_total_price_ntd
float64 building_price_per_square_meter
float64 net_private_building_price_per_square_meter
float64 net_exclusive_area_price_per_square_meter
timestamp _sdc_batched_at
}
"MODEL.HOUSE_AND_RENTING.FCT_REAL_ESTATE_TRANSACTION" }|--|| "MODEL.HOUSE_AND_RENTING.DIM_BUILDING_TYPE": building_type_id
"MODEL.HOUSE_AND_RENTING.FCT_REAL_ESTATE_TRANSACTION" }|--|| "MODEL.HOUSE_AND_RENTING.DIM_CITY_DISTRICT": city_district_surrogate_key
"MODEL.HOUSE_AND_RENTING.FCT_REAL_ESTATE_TRANSACTION" }|--|| "MODEL.HOUSE_AND_RENTING.DIM_LAYOUT": layout_id
"MODEL.HOUSE_AND_RENTING.FCT_REAL_ESTATE_TRANSACTION" }|--|| "MODEL.HOUSE_AND_RENTING.DIM_DATE": date_day--transaction_date
"MODEL.HOUSE_AND_RENTING.FCT_REAL_ESTATE_TRANSACTION" }|--|| "MODEL.HOUSE_AND_RENTING.DIM_TRANSACTION_TYPE": transaction_type
Transformation Logic:
Please refer to dbt transformation logic
4. Data Quality & Testing
Tests & Validation
Data quality is maintained through automated tests defined within dbt models, including:
Source Layer:
- Uniqueness tests: Ensuring transaction IDs are unique.
- Non-null checks: Ensuring critical fields (e.g., transaction dates, price values) contain no null values.
- Pattern Checks: Ensuring critical fields (e.g., date, district) have the correct pattern
Data Mart Layer:
- Uniqueness tests: Ensuring transaction IDs are unique.
- Non-null checks: Ensuring critical fields (e.g., transaction dates, price values) contain no null values.
- Referential integrity checks: Confirming dimension keys in fact tables have valid dimension records.
These tests run automatically during each dbt deployment as part of CI/CD pipelines (GitHub Actions), ensuring data reliability and preventing issues from reaching the BI layer.
5. Semantic Dataset, BI & Visualization
Semantic Dataset & Setup
A semantic model was developed within Power BI, connected directly to the well-transformed tables in Google BigQuery. This model defines key measures, fact, and dimension tables, and their relationships—enabling self-service analytics and consistent reporting logic across all visuals.
Key features include:
- Galaxy schema design with clearly defined fact and dimension tables.
- Centralized definitions for KPIs such as average price, price per square meter, and transaction counts.
- Built-in hierarchies for drill-downs (e.g., region → district).
Notes:
- All data transformations are performed using dbt.
- Minimal business logic is handled in Power BI’s Power Query, keeping the semantic layer clean and focused.
-
- Data refresh schedule: Dashboards are refreshed regularly on the first Monday after the 1st, 11th, and 21st of each month. This aligns with the update frequency of the Taiwan Real Estate Transaction Price Registration System and considers limitations of the Power BI Free version (which lacks automatic scheduled refresh on cloud datasets).
BI Tool & Setup
The dashboards were built using Power BI and connected directly to the semantic dataset, ensuring that KPIs across different reports / dashboards share with the same definitions.
DAG(TBD)
Dashboard Catalog
- Market Analysis for Investor (Click to view the dashboard)
- Market Analysis for Homebuyer (TBD)
6. Future Enhancement
- A better schema design for dimension, especially building type and transaction types
- Ingest pre-sales house data
- Ingest rental Data
- Martet Analysis for HomeBuyer