Data Quality with Databricks Delta Live Tables (ETL)
Why Databricks Delta Live Tables Enhances Your Data Quality
Recently, I published an article where I briefly explain how to start and schedule a Delta Live Table Pipeline. I promised to explore the features of Delta Live Tables one by one and explain what each one can actually do for you. Yesterday, I had some “nice” experiences in the office regarding data quality. So I thought it might be important to write about it. Data quality is becoming increasingly important. Nearly every mid-sized company possesses its own data, but the quality of this data is crucial. Whether you are a data scientist or a data analyst, working with accurate data is fundamental to effective analytics and insights. I mean, we base our decisions on data by asking the right questions. Yet even if you are asking the right questions, it is always assumed that the underlying data is accurate. If the data is incorrect, you will most likely make the wrong decisions for your company. I think I have convinced you and made it clear just how crucial doing the boring work is. Now let’s get to it.
I have expectations
In Delta Live Tables, you can use expectations (not exceptions) to set rules for your data quality. These expectations ensure that the data in your tables meets the standards you set. You can easily apply these expectations to your queries using Python decorators or SQL constraint clauses. Last time, I showed you how to start a pipeline. As you can see, I did not use expectations as it is completely optional.
It consists of basically two things:
You create a condition statement that is checked against your data. For instance, customerId is not null.
Based on the result, true/false, you can declare an action that has to happen. There are mutliple actions you can define and that’s what makes it interesting:
Warn: Invalid records are written to the target, and failures are reported as a metric for the dataset. If you don’t mention the action, the “warn action” will be the default.
Drop: Invalid records are removed before writing to the target, and failures are reported as metrics for the dataset.
Fail: Invalid records stop the update from succeeding, requiring manual intervention before re-processing.
Some examples
Now, let’s start with a simple expectation that checks the data for our standard. However, in this case we don’t define an action. Thereby, it will by default trigger the warn action which neither interrupts the pipeline nor it processes the data.
CREATE OR REFRESH [STREAMING OR LIVE] TABLE table_name (
CONSTRAINT expectation_name EXPECT (orderCount > 0)
) AS
SELECT * FROM tbl_product
However, as we have more options than just a warning. Let’s have a quick look how we can define an action.
CREATE OR REFRESH [STREAMING OR LIVE] TABLE table_name (
CONSTRAINT expectation_name EXPECT (orderCount > 0)
[ON VIOLATION { FAIL UPDATE | DROP ROW }]
) AS
When to use what?
Drop Rows:
Minor Data Quality Issues: For example, if a dataset includes non-critical metadata fields with occasional missing values, dropping these rows makes sure the main analysis remains unaffected.
High-Volume Data: In IoT sensor data streams, occasional corrupted readings can be dropped without impacting overall trend analysis.
Real-Time Applications: For a live analytics dashboard, dropping rows with minor errors ensures the data flow remains uninterrupted and timely.
Fail the Pipeline:
Critical Data Quality: In financial transaction processing, any invalid records could lead to incorrect financial reporting, necessitating a pipeline failure to ensure accuracy.
Regulatory Compliance: Healthcare data must be accurate for legal reasons. Any deviation or error would require halting the pipeline to maintain compliance.
Batch Processing Integrity: For monthly sales reports, even a single invalid record can distort the entire dataset's metrics, so failing the pipeline ensures data integrity.
Warn:
Monitoring Data Quality: When integrating a new data source, setting the action to warn allows you to log and monitor invalid records without stopping data flow, useful during initial development and testing phases.
In conclusion, leveraging expectations in Databricks Delta Live Tables offers a powerful and flexible approach to ensuring data quality in your data pipelines. Unlike simple where statements, expectations provide comprehensive monitoring over invalid and faulty data but also great visibility into data quality issues. It surely makes sure your code is more maintainable and robust.
For any data-driven organization looking to maintain high standards of data integrity, adopting expectations in Delta Live Tables is a highly recommended practice.