Spark Just Got Real: Benchmarking the New Real-Time Mode Against Micro-Batch
Benchmark of the new feature in spark 4.1 which changes completely how spark deal with streaming data
Spark Just Got Real: Benchmarking the New Real-Time Mode Against Micro-Batch
How Spark 4.1's realTime trigger crashes the latency floor — and how to measure it yourself with Kafka and PySpark
By António Rosa · May 7, 2026 · 8 min read
If you've ever stood up a Spark Structured Streaming job and watched the end-to-end latency hover stubbornly around 800 ms — no matter how aggressively you set processingTime — you know the drill. Below a certain point, the micro-batch tax kicks in: planning, scheduling, log writes, and state checkpointing eat any time you save by shrinking the trigger interval. So you'd cap your SLA at "a couple of seconds," ship it, and quietly recommend Flink to anyone who asked for tens of milliseconds.
That conversation just changed. Spark 4.1, released in December 2025, introduced Real-Time Mode (RTM) — a new trigger type that processes events as they arrive, with internal Databricks benchmarks showing p99 latencies from a few milliseconds up to ~300 ms depending on the transformation. And the migration is, almost embarrassingly, a one-line change.
This piece walks through what RTM actually changes under the hood, the exact code diff, and a runnable Kafka + PySpark benchmark you can use to measure the gap on your own hardware. The full project — docker-compose, producer, two consumer scripts, and a latency analyzer — is at the end of the article.
The Micro-Batch Tax
Spark Structured Streaming's whole identity has been "treat a stream as a sequence of small batches." Every trigger fires the same machinery that runs your nightly ETL: the driver queries the source for new offsets, plans the job, schedules stages in topological order, executes them, writes results, and finally commits a checkpoint to durable storage. Each of those steps is cheap on its own. Stacked together, they form a fixed cost per batch.
Two of those costs hurt the most when you try to go fast:
Sequential stage scheduling. Stages with shuffle boundaries — groupBy, joins, aggregations — wait for all upstream tasks to finish before the next stage starts. The reducer can't read shuffle files until every mapper has flushed. With small batches, that synchronization barrier dominates.
Per-batch durability. Before and after every micro-batch, Spark writes commit logs to object storage and uploads state store snapshots. These guarantee exactly-once semantics, but each round-trip to S3 or ADLS easily costs hundreds of milliseconds.
The net effect: shrink your trigger below ~500 ms and latency stops improving. Shrink it further and it actually gets worse, because you're now paying the fixed cost more often without amortizing it across enough records.
What Real-Time Mode Actually Changes
The Databricks engineering team's framing is precise: RTM doesn't throw out the micro-batch architecture, it generalizes it. The micro-batch becomes a checkpoint interval (5 minutes by default), and inside that interval, three things behave differently.
Longer epochs with continuous data flow. Instead of bounded "process offsets 1001–5000" batches, RTM runs a single long batch and pulls fresh records as they appear. Checkpointing and state snapshots happen at epoch boundaries — every 5 minutes by default — so their fixed cost is amortized over millions of records instead of a few hundred.
Concurrent stages. This is the big one. The driver schedules every stage of the query at once, and reducers start consuming shuffle output the moment any mapper produces it. There's no "wait for stage 1 to finish before launching stage 2." The flip side is a hard requirement: you need at least as many task slots as the total number of tasks across all stages. If your cluster can't reserve them all simultaneously, RTM won't start.
Streaming shuffle. Classic Spark shuffle is disk-based: mappers write files locally, the shuffle service indexes them, reducers fetch them. RTM bypasses disk and pipes shuffle data through memory between executors. Less buffering, lower per-record overhead, no waiting for files to close.
The trade-off is honest. Longer epochs mean less frequent checkpoints, which means more replay work on failure. Concurrent scheduling means dedicated slots, which means higher idle cost. RTM is a latency optimization that you pay for in resource reservation and recovery time. For an ETL pipeline that's fine running every 30 seconds, micro-batch is still the cheaper, simpler answer.
The One-Line Diff
Here is a vanilla Kafka-to-Kafka streaming query in micro-batch mode:
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
spark = SparkSession.builder.appName("microbatch").getOrCreate()
events = (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "events")
.load()
.selectExpr("CAST(value AS STRING) as value")
.select(expr("from_json(value, 'event_id STRING, event_time LONG, payload STRING')").alias("e"))
.select("e.*")
)
(
events.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("topic", "results")
.option("checkpointLocation", "/tmp/ckpt-microbatch")
.outputMode("append")
.trigger(processingTime="500 milliseconds") # micro-batch
.start()
.awaitTermination()
)And here is the same query in real-time mode, on Databricks (DBR 16.4 LTS+):
# Set on the cluster (Spark Config tab), or in code:
spark.conf.set("spark.databricks.streaming.realTimeMode.enabled", "true")
(
events.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("topic", "results")
.option("checkpointLocation", "/tmp/ckpt-realtime")
.outputMode("update") # required for RTM
.trigger(realTime="5 minutes") # checkpoint interval, not batch interval
.start()
.awaitTermination()
)Two real differences. The trigger keyword changes from processingTime to realTime, and the value you pass is the checkpoint interval, not the batch interval. Output mode must be update. That's it. Same DataFrame API, same UDFs, same Kafka connector.
One footnote for open-source Spark 4.1.0
The Java/Scala RealTimeTrigger class ships in the OSS Apache Spark 4.1.0 jars, but PySpark's DataStreamWriter.trigger() doesn't expose realTime= as a kwarg yet — that wrapper currently lives in Databricks Runtime only. Until the upstream PR lands, you can call the Java side directly via py4j; it's the same trigger, just constructed from Python:
jvm = spark._sc._jvm
RealTimeTrigger = jvm.org.apache.spark.sql.execution.streaming.RealTimeTrigger
j_trigger = RealTimeTrigger.apply("5 minutes") # Scala: RealTimeTrigger.apply()
writer = (
out.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "results")
.option("checkpointLocation", "/tmp/ckpt-realtime")
.outputMode("update")
)
writer._jwrite = writer._jwrite.trigger(j_trigger)
query = writer.start()In Scala the equivalent is the one-liner the Databricks docs show:
import org.apache.spark.sql.execution.streaming.RealTimeTrigger
.trigger(RealTimeTrigger.apply("5 minutes")) // or .apply() for the 5-min defaultThe Benchmark
To measure the gap end-to-end I built a small harness that publishes timestamped events to a Kafka topic, runs both consumer variants, and computes percentile latency from a results topic. The full project lives in this article's companion repo (see the link at the end), but the latency probe is short enough to show inline:
# producer.py — stamp each event with epoch-ms event_time
import json, time, uuid
from kafka import KafkaProducer
producer = KafkaProducer(
bootstrap_servers="localhost:9092",
value_serializer=lambda v: json.dumps(v).encode(),
linger_ms=0,
acks=1,
)
target_rps = 2000
interval = 1.0 / target_rps
while True:
event = {
"event_id": str(uuid.uuid4()),
"event_time": int(time.time() * 1000),
"payload": "x" * 128,
}
producer.send("events", event)
time.sleep(interval)The consumer's job is just to forward each record to a results topic, preserving event_time. Latency is the wall-clock difference between the moment the analyzer reads the record from results and the original event_time baked in by the producer:
# analyze_latency.py — measure end-to-end p50/p95/p99
import json, time, statistics
from kafka import KafkaConsumer
consumer = KafkaConsumer(
"results",
bootstrap_servers="localhost:9092",
value_deserializer=lambda b: json.loads(b.decode()),
auto_offset_reset="latest",
)
samples = []
deadline = time.time() + 60 # collect for 60 seconds
for msg in consumer:
now_ms = int(time.time() * 1000)
samples.append(now_ms - msg.value["event_time"])
if time.time() > deadline:
break
samples.sort()
def pct(p): return samples[int(len(samples) * p) - 1]
print(f"n={len(samples)} p50={pct(0.5)}ms p95={pct(0.95)}ms p99={pct(0.99)}ms")Run it once with consumer_microbatch.py, again with consumer_realtime.py, and you have a direct comparison.
What I Actually Measured
Same query, same Kafka, same hardware (a single Apple Silicon laptop running Spark 4.1.0 in local mode against Kafka in Docker). Producer pinned at 2,000 events/s, 60 s of measurement after a 15 s warmup, end-to-end latency = analyzer_read_time − producer_event_time:
Three things to read off this table:
-
Median latency drops from a quarter-second to two milliseconds. That's the headline RTM number, and it shows up exactly where the architecture predicts: micro-batch's floor is the cost of one trigger cycle, RTM has no such cycle.
-
The p95 stays in the single digits (6 ms vs 457 ms). The win isn't a median trick — the bulk of the distribution is sub-10 ms.
-
Throughput is identical. Both runs processed ~120k samples in the 60 s window. RTM didn't cut latency by sacrificing events/s at this rate.
The tail closes somewhat — p99 is "only" 11× better — because on a single node the long-tail noise is dominated by JVM GC pauses and a single Kafka broker, neither of which RTM can fix. Databricks' published numbers on tuned clusters reach into single-digit-millisecond p99; the shape (a clean order-of-magnitude reduction at every percentile) is what holds across hardware tiers.
A few gotchas worth knowing before you wire RTM into anything that matters:
-
The cluster must have enough task slots to schedule all stages concurrently. A query with three shuffle boundaries and 200 shuffle partitions wants 600+ slots reserved up front. On a small cluster this means you simply can't run RTM, or you have to reduce
spark.sql.shuffle.partitions. -
Output mode is restricted to
update. If your sink expectsappendsemantics — say, an Iceberg table that doesn't tolerate updates — you'll need a different sink or a different mode. -
Failure recovery replays from the last checkpoint, which by default is 5 minutes back. For an ETL job that's nothing; for a sub-second SLA pipeline that's a long time of duplicate processing. Tune the trigger interval against your recovery budget.
-
applyInPandasWithStateisn't supported in RTM yet.transformWithStateis, but with semantic differences:handleInputRowsis invoked per row rather than per key per batch.
Run it yourself
The full harness — docker-compose for Kafka + a browser UI, producer, both consumer scripts, the latency analyzer, and a one-shot run_benchmark.sh — is on GitHub:
👉 https://github.com/armrosadev1991/spark-rtm-benchmark
docker compose up -d # Kafka on :9092, Kafka UI on :8080
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install --upgrade 'kafka-python>=2.1.0'
export JAVA_HOME=/opt/homebrew/opt/openjdk@21/libexec/openjdk.jdk/Contents/Home
./run_benchmark.shOpen http://localhost:8080 while it runs to see the events and results topics filling up live.
References
- 7 Minutes to Understand the New Spark Streaming Feature that Changes Everything — Vu Trinh, Modern Data 101
- Introducing Real-Time Mode in Apache Spark™ Structured Streaming — Databricks Blog
- Breaking the Microbatch Barrier: The Architecture of Apache Spark Real-Time Mode — Databricks Blog
- Real-time mode in Structured Streaming — Databricks documentation
Tags: Apache Spark · Spark Streaming · Spark · Databricks · Azure Databricks
Comments
No comments yet. Be the first to leave one below.
Leave a comment