Mastering Continous Jobs in Databricks

Job Management in Databricks by using APIs to call the job into an action

Jun 14, 2024

In the fast-changing world of data engineering, managing streaming jobs well is very important. As data pipelines get more complex and the need for real-time data processing grows, being able to control these jobs becomes a key skill. Today, we’ll look at a powerful toolset within Databricks that every data engineer should know: Managing streaming jobs

Why Streaming Jobs Matter

Streaming jobs are key for real-time data processing. They let businesses handle data as it comes in, allowing quick reactions to events or anything that requires you to act quickly. This is critical for things like fraud detection, real-time analytics, and IoT data processing.
In one of my cases, it was indeed a fraud detection case. The project required a quick reaction which meant that a regular job with a scheduler will not cut it.
But wait — why not just schedule a job to run every few minutes?

Scheduled Jobs vs. Streaming Jobs

Scheduled jobs run at set times. While this works for many tasks, it doesn’t fit well for real-time data needs. Here’s why:

Latency: Scheduled jobs have delays. If a job runs every minute, any event happening right after the job runs will wait almost a minute to be processed. This delay can be a problem for time-sensitive tasks like fraud detection.
Resource Use: Additionally, clusters on databricks (due to cost saving protocols) need to time to start. If you have a job that runs every 15 minutes, the cluster will shut down before the new job is even triggered again. Streaming jobs use resources efficiently by handling data as it arrives. Scheduled jobs may need to start resources repeatedly, which can be less efficient and more costly.

You can run your Databricks job periodically with the Scheduled trigger type or ensure there’s always an active run of the job with the Continuous trigger type.
With scheduled jobs, you can set them to run at specific times: every minute, hourly, daily, weekly, or monthly. With continuous jobs, Databricks makes sure there is always one active run of the job. A new run starts right after the previous one finishes, whether it succeeds or fails, ensuring the job is always running if needed.

Continuous Jobs Need External API Calls

Managing continuous jobs (streaming jobs) is more complex than scheduled jobs. They need external API calls to start, stop, and manage them. This is because streaming jobs run all the time, processing data as it streams in, unlike scheduled jobs that start and stop at set times. Managing these jobs well makes sure they run smoothly, and this is where Databricks helps.

Now I will show you a step-by-step guide that you can utilize to manage your Databricks streaming jobs. This script lets you start, pause, or restart jobs easily in only 5 steps (more or less).

Step 1: Import Necessary Modules

First, we import essential Python modules:

import requests
import json

These modules help us handle HTTP requests and JSON data, which are important for talking to the Databricks Jobs API.

Step 2: Generic POST Request Function

This function sends POST requests to Databricks endpoints:

def generic_post_same_workspace(headers, body, relative_path):
    url = f"https://{spark.conf.get('spark.databricks.workspaceUrl')}{relative_path}"
    response = requests.post(url, json=body, headers=headers)
    return response.status_code, response.json()

Step 3: Functions to Manage Streaming Jobs

These functions handle pausing, starting, and stopping streaming jobs:

def pause_streaming_job(job_id, auth_header):
    relative_path = "/api/2.1/jobs/update"
    body = {"job_id": job_id, "new_settings": {"continuous": {"pause_status": "PAUSED"}}}
    return generic_post_same_workspace(auth_header, body, relative_path)

def start_streaming_job(job_id, auth_header):
    relative_path = "/api/2.1/jobs/update"
    body = {"job_id": job_id, "new_settings":{"continuous": {"pause_status": "UNPAUSED"}}}
    return generic_post_same_workspace(auth_header, body, relative_path)

def stop_current_runs(job_id, auth_header):
    relative_path = "/api/2.1/jobs/runs/cancel-all"
    body = {"job_id": job_id}
    return generic_post_same_workspace(auth_header, body, relative_path)

Step 4: Orchestrate Job Management

This function runs the start, pause, or restart actions for each job:

def run_start_stop(action, job_id, auth_header):
     if action == "start":
         status, response = start_streaming_job(job_id, auth_header)
     elif action == "stop":
         status, response = pause_streaming_job(job_id, auth_header)
         status, response = stop_current_runs(job_id, auth_header)
     elif action == "restart":
         status, response = pause_streaming_job(job_id, auth_header)
         status, response = stop_current_runs(job_id, auth_header)
         status, response = start_streaming_job(job_id, auth_header)
     else:
         print("No valid action")
     print(f"{job_id}: Status: {status}, Response: {response}")

The "restart" action is crucial for maintaining the reliability and performance of streaming jobs. For example, if you have a streaming job processing real-time sensor data for anomaly detection, a failure in the job could disrupt the entire monitoring system. Using the "restart" action, you can quickly stop and start the job, ensuring continuous data processing with minimal downtime.

Step 5: Execute the Script

Finally, we run the script — in this case we want a certain job (through its job_id) to start with a defined action (start, stop or restart):

run_start_stop("stop", 123456789, auth_header)

This script automates routine job management tasks, reducing the chance of mistakes and allowing you to respond quickly to changes by pausing, starting, or restarting jobs as needed. You can easily scale your data processing without the job management complexities and the script is highly customizable to fit your specific needs.

Additionally, you can modify the code to use a list and loop through the job ids that need to start at certain times. For example, in my case, I needed to start and end jobs at 9 AM and 9 PM. I wrote this script and set a scheduler based on these times. The script checked the current time and either stopped or started the pipeline accordingly. While there are other solutions, this one works perfectly for me.

Conclusion

In this blog post, we have explored the importance of streaming jobs for real-time data processing and how they differ from scheduled jobs. We walked through a script that automates the management of streaming jobs — allowing you to start, pause, or restart them with ease. This script helps ensure your data pipelines remain efficient and scalable.

I hope you found this guide helpful. If you have any questions or feedback, please drop a comment below. Don't forget to like this post if you found it useful — your support helps me create more content like this. Happy coding!

Chartgrid - Analytics & Engineering

Discussion about this post

Ready for more?