Databricks

Accelerate Data with Databricks’ Spark Declarative Pipelines in Apache Spark 4.0

D
Data & AI Insights CollectiveJun 19, 2025
3 min read

Bringing Declarative Pipelines to Apache Spark

When you’re juggling batch jobs, streaming ingests, backfills, retries, and dependency chains, it’s all too easy to end up knee-deep in boilerplate “glue code.” That’s exactly why Databricks has handed over Spark Declarative Pipelines—previously known as Delta Live Tables—to the Apache Spark™ open-source project. Instead of wrestling with orchestration details, you simply declare what your end tables and transformations should look like, and Spark figures out how to execute them, optimizing for performance and resilience all the way.

Why Declarative Pipelines Matter

In traditional ETL you spend hours wiring up incremental loads, checkpoint locations, retry logic, and DAG ordering—work that every data team ends up reinventing. With declarative pipelines you:

  • Drop the boilerplate: No more manual read(), write(), and dependency loops.
  • Scale reliably: Automatic checkpointing and retry logic keep streaming jobs humming.
  • Stay consistent: A shared standard across teams means fewer “works on my cluster” surprises.

Key Features at a Glance

FeatureWhat It DoesWhy It Matters
Declarative APIsUse @dlt.table (Python) or CREATE PIPELINE (SQL) to define tables & viewsFocus on business logic, not plumbing
Unified Batch & StreamingOne API for both historic backfills and real-time ingestionSimplifies development and maintenance
Automatic OrchestrationSpark infers execution order, parallelism, and backfill strategyNo external scheduler required
Checkpointing & RetriesBuilt-in support for saving state and retrying failed stagesRobust pipelines, even under heavy load
Execution TransparencyInspect the underlying Spark plan for every stageTune performance without black-box abstractions
SQL & Python SupportDefine pipelines in your language of choiceLow barrier to entry; integrates with existing codebases

How It Works in Practice

Below is a real-world Python example using the current Databricks Lakeflow (Delta Live Tables) API. It mirrors what you’ll see in the upcoming Apache Spark release—just a change of namespace from dlt to the Spark-native module when it lands in open source:

import dlt # import the Databricks declarative module from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() @dlt.table( name="users", # final table name comment="All users loaded from Delta Lake" ) def users(): return spark.read.format("delta") \ .load("s3://path/to/users") # batch read :contentReference[oaicite:3]{index=3} @dlt.table( name="events", # streaming source comment="Raw events from Kafka topic" ) def events(): return spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "broker:9092") \ .option("subscribe", "events") \ .load() # streaming read :contentReference[oaicite:4]{index=4} @dlt.table( name="user_event_summary", comment="Aggregated event counts per user" ) def user_event_summary(): return ( dlt.read_stream("events") # pull in streaming table .join(dlt.read("users"), "user_id") .groupBy("user_id") .count() # group-by aggregation )

Notice there’s no explicit .write() or manual DAG management—Spark’s engine handles that for you.

Who Stands to Gain

  • Data Engineers reclaim dozens of weekly hours by cutting out “glue code.”
  • Analytics Teams get consistent pipelines that simplify testing, lineage, and CI/CD.
  • Platform Operators standardize deployment, monitoring, and governance across every workload.

Getting Started

  1. Upgrade to Apache Spark 4.0+ (Declarative Pipelines is included in this release).
  2. Explore the Lakeflow/Delta Live Tables docs to try out Python and SQL examples.
  3. Contribute: Follow the open JIRA proposal and community discussion to shape the API’s future.

By moving from imperative scripts to a declarative model, Spark Declarative Pipelines empowers you to declare intents, not steps—so you can ship reliable data products faster, and spend your time on insights instead of infrastructure. Give it a spin today, and see how much cleaner your ETL can become.

Tecyfy Takeaway

Declarative pipelines aren’t just a new feature—they’re a new way of thinking. Define your tables and transformations in a few lines, let Spark optimize execution under the hood, and voilà: faster development, stronger reliability, and full transparency. That’s modern data engineering, simplified.

Share this article