In the early days of data and analytics, traditional data warehouses on RDBMS databases served as the backbone for companies' analytical needs, providing a structured environment for storing and processing large volumes of structured data. However, as the volume and variety of data grew—encompassing semi-structured data like JSON files and unstructured data such as images and videos—traditional data warehouses began to show their limitations. Also, the sheer volume of this data (specially for ML use cases) strained traditional data warehouses. This is where the concept of the Data Lakehouse emerged, combining the best of both worlds. A Data Lakehouse integrates the flexibility and scalability of Data Lakes, which can accommodate all types of data, with the robust performance and analytics capabilities of data warehouses. Here I am explaining the Data Lakehouse implementation for one of our Financial Marketplace customers.
To build our Data Lakehouse, we implemented a medallion architecture, which organizes data into three distinct layers: bronze, silver, and gold. In this setup, AWS S3 serves as the storage solution for both the bronze and silver layers. The bronze layer ingests raw, unprocessed data, accommodating a variety of formats and structures.
The silver layer refines the data, transforming it into a more structured format suitable for analysis. Here, we leveraged our custom ETL framework, named "DRIFT," which utilizes AWS Glue for efficient processing of high-volume datasets.
For the gold layer, we opted for Amazon Redshift, utilizing Redshift Spectrum to seamlessly integrate the refined data from the silver layer. This allows us to perform advanced analytics and complex queries across both structured and semi-structured data efficiently.
We adopted the Hudi table format with Parquet files to enhance data management and performance. Hudi provides capabilities like incremental data processing, transaction support, ACID compatibility, time travel.
Parquet file format enables efficient storage and retrieval of large data and is supported by most big data technologies.
A configurable ETL framework developed at Bajaj Technology Services using python for quickly creating new data pipelines to load data in Lakehouse.
Using DRIFT framework to setup jobs in Data Lakehouse involves simple configuration changes in four layers
Gold layer is built on AWS redshift as combination of views on top of Redshift external tables (using Redshift spectrum) and physical summary/aggregated tables on Redshift. Aggregated summary tables store frequently queried and frequently aggregated data from multiple base gold tables.
Gold layer is used for data consumption for analytics or developing BI reports using Tableau.
Here are some sample use cases we built on Data Lakehouse that require integration between application data and large external data sets
Our implementation for Data Lakehouse for Bajaj Markets leverages AWS services and our custom ETL framework DRIFT to enable integration of application data in data warehouse with data from external sources (and large/unstructured internal data sets) ensuring seamless access for advanced analytics.