Delta Live Tables on Databricks (ETL)
A short and simple explanation of Databricks Delta Live Tables (DLTs)
What is Delta Live Tables?
I've had the opportunity to work extensively with Delta Live Tables (DLTs) on multiple projects. Simply said, Delta Live Tables is a powerful technology that can be used as an ETL framework for building end-to-end data pipelines. However, when I first encountered it, I found its complexity a bit overwhelming.
In this article, I aim to focus on the core concepts to provide you with a clear and simple understanding.
Why not simply a workflow?
Delta Live Tables simplifies the process of building data pipelines by offering flexibility in data processing methods. You can utilize SQL or PySpark to manipulate the data according to your requirements. Additionally, you have the freedom to work with multiple datasets and even materialize them (more on that later) if necessary, providing valuable visibility into your pipeline.
It's essential to note that Delta Live Tables is not limited to handling only complex pipelines exclusively. In one instance, we utilized it to stream only a handful of live tables, and from there, we queried the data (including aggregations and joins) directly into a PowerBI Dashboard. We essentially used the streaming tables, which you can see in your catalog, and did everything else in the SQL-Editor of PowerBI. The data sources are not limited to tables only; they can vary, including parquet or CSV files.
Different types of datasets within a DLT pipeline:
Now let’s get back to datasets mentioned earlier. For Delta Live Tables there are three dataset types.
Streaming tables: Optimal if fast data processing and low latency is desired. It also does not recompute all source data with each update, making it much better than simple workflows where each run is basically recomputing all of the data.
The syntax can be as easy as it gets:
CREATE OR REFRESH STREAMING TABLE loading_cus_calls AS
SELECT *
FROM STREAM(myCatalog.schema.tbl_customer_calls)
Be aware, you cannot use aggregations nor joins in a streaming table. However, you can do quality checks, comments and set contraints (depending on the plan you decide for) but that’s not the scope of this article here. More on that on another day.
Materialised views (or live tables): The data within the view is materialized and can be checked. You can call the views in your catalog and verify the data along the way. In this step, you can apply your typical transformations, joins, and aggregations.
CREATE OR REFRESH LIVE TABLE agg_cus_calls AS
SELECT cusomterId, productCatgeroy, count(*)
FROM myCatalog.schema.tbl_customer_calls
LEFT JOIN myCatalog.schema.tbl_product
GROUP BY cusomterId
Views: The data within the view is not materialised and therefore not published in your catalog.
There are a ton you can do with Delta Live Tables, but I hope I could give you a basic idea of how you can start off your projects with this useful technology.