Databricks "Best Practices" That Are Actually Outdated in 2026
Stop following 2023 advice in 2026. Half of what's on Stack Overflow and old blog posts is fighting the platform instead of using it.
Databricks "Best Practices" That Are Actually Outdated in 2026
Stop following 2023 advice in 2026. Half of what's on Stack Overflow and old blog posts is fighting the platform instead of using it.
If you've spent the last few years on Databricks, you've probably internalised a set of rules that everyone repeats in code reviews:
- Always partition by date.
- "Schedule
OPTIMIZEandVACUUMnightly." - "Tune
spark.sql.shuffle.partitionsfor your workload." - "Call
.repartition(n)before writing." - "
Z-ORDERon the columns you query most." - "Don't use Serverless SQL, it's too expensive."
Each one of these was good advice at some point. Most of them are no longer the right default in 2026.
Databricks has spent the last two years quietly shipping features that automate exactly the things engineers used to tune by hand: Liquid Clustering, Predictive Optimization, AQE auto-coalescing, Optimized Writes, Automatic Liquid Clustering, and a much more competitive Serverless SQL. Predictive Optimization is now enabled by default for all new Unity Catalog managed tables, workspaces, and accounts. The platform is no longer a Spark distribution you tune — it's an opinionated lakehouse that increasingly tunes itself.
The pattern is simple: Databricks is automating what you used to do manually. The job of a senior data engineer is shifting from "squeeze performance out of low-level knobs" to "design the right table, then get out of the way."
Here are the six rules I see most often in production codebases that are actively making things worse in 2026.
Myth 1 — "Partition your Delta tables by date"
For years, the default ETL recipe on every Bronze and Silver table looked like this:
CREATE TABLE events (
event_id STRING,
user_id STRING,
event_ts TIMESTAMP,
event_date DATE,
payload STRING
)
USING DELTA
PARTITIONED BY (event_date);It made sense in 2020. It usually doesn't in 2026.
Why it's outdated. Hive-style partitioning has a list of well-documented failure modes that most teams just live with: skew on low-volume days, tiny files on high-cardinality partition keys, the inability to change the key without rewriting the entire table, and the eternal "do I partition by day, week, or month?" debate. Concurrent writes within the same partition aren't even possible without extra work.
What to do instead — Liquid Clustering. Liquid Clustering is GA on Delta tables since Databricks Runtime 15.2, and Databricks now explicitly recommends it for all new tables, including streaming tables and materialized views.
CREATE TABLE events (
event_id STRING,
user_id STRING,
event_ts TIMESTAMP,
payload STRING
)
USING DELTA
CLUSTER BY (event_ts, user_id);Or, on Unity Catalog managed tables on DBR 15.4 LTS+:
CREATE TABLE events (...) CLUSTER BY AUTO;CLUSTER BY AUTO lets the platform observe your query workload and pick the keys for you, evolving them over time as patterns change. Customers reporting on the public preview saw queries up to 10× faster on gold tables with no manual tuning.
The nuance no carousel will tell you. Liquid Clustering is not a free lunch:
- It cannot coexist with partitioning or
ZORDER. Migrating an existing partitioned table is a one-way decision. - It is incremental on write. To re-cluster historical data after enabling LC on an existing table, you need
OPTIMIZE FULL(DBR 16.0+). - Metadata-only queries (
SELECT DISTINCT partition_col) can be slower on LC tables than on partitioned ones, because LC has to scan file statistics rather than list folders. - For tables under ~1 TB, you usually don't need either partitioning or clustering. Ingestion-time clustering plus data skipping handles it.
The rule of thumb I use: default to CLUSTER BY (or CLUSTER BY AUTO on UC managed) for any new table over a few hundred GB, and only reach for partitioning in the rare cases where you need physical isolation for compliance, retention, or partition-level overwrites.
Myth 2 — "Schedule OPTIMIZE + VACUUM as cron jobs"
Walk into any data platform team and you'll find a notebook called nightly_maintenance that loops through every Delta table and runs OPTIMIZE followed by VACUUM. It costs real money, sometimes runs for hours, and almost nobody knows whether it's actually needed.
Why it's outdated. Predictive Optimization (PO) decides which tables benefit from maintenance, when to run it, and at what retention. Per the docs, it's enabled by default for accounts created on or after November 11, 2024, and existing accounts have been getting it gradually since May 2025 with completion targeted around April 2026. The Databricks engineering team's own year-in-review post reports exabytes vacuumed and hundreds of petabytes compacted in 2025 alone — at scale that no team's cron job can match.
What to do instead. For Unity Catalog managed tables, just enable PO at the account, catalog, or schema level and let it run:
ALTER CATALOG production ENABLE PREDICTIVE OPTIMIZATION;Then verify what's actually happening via the system table:
SELECT
metric_name,
operation_type,
table_name,
metric_value,
usage_unit,
event_time
FROM system.storage.predictive_optimization_operations_history
WHERE event_time >= current_date() - INTERVAL 7 DAYS
ORDER BY event_time DESC;PO runs ANALYZE, OPTIMIZE, and VACUUM on serverless compute (billed on a serverless jobs SKU), so you trade your nightly classic-cluster spend for usage-driven serverless cost.
The nuance. PO has real edges to be aware of:
- It only applies to Unity Catalog managed tables. External tables, Hive metastore tables, and Delta Sharing recipients are not covered. If you still have legacy externals, your cron job stays.
- It will not run
VACUUMif you've configureddelta.deletedFileRetentionDurationbelow 7 days. Set retention before enabling PO. - It needs query traffic to learn from. Cold tables that nobody queries won't get optimised.
- Regional availability is not universal — check your workspace.
If your environment is mostly Hive metastore + external tables, this myth doesn't fully apply yet. But the migration path is clear: move to UC managed tables and let PO do its job.
Myth 3 — "Tune spark.sql.shuffle.partitions manually"
The classic Spark interview question. Twenty different blog posts will give you twenty different formulas — total_data_size / 128MB, cores * 2, cores * 3, "set it to 200 and pray."
Why it's outdated. Adaptive Query Execution (AQE) has been on by default in Databricks since DBR 7.3 LTS. It coalesces small post-shuffle partitions, splits skewed ones, and even switches sort-merge joins to broadcast hash joins at runtime when one side turns out to fit in memory. In Databricks' own TPC-DS benchmarks, AQE delivered up to 8× speedup on the most-improved queries, and most of those gains came from dynamic partition coalescing — exactly the thing you used to manually tune.
The harder truth: a static spark.sql.shuffle.partitions = 800 cannot beat AQE because AQE has runtime statistics that the optimiser doesn't see at plan time. Manual tuning frequently makes things worse, not better, particularly when data volumes vary across stages or queries.
What to do instead. Leave it alone. If you really want to give AQE more headroom on a specific job, raise the initial number so AQE has more buckets to coalesce:
# Don't do this - it caps AQE's freedom
spark.conf.set("spark.sql.shuffle.partitions", 200)
# Better — give AQE a generous starting point, let it coalesce
spark.conf.set("spark.sql.adaptive.enabled", "true") # already true by default
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128MB")The exception that justifies manual intervention: persistent data skew on a specific join key that AQE's skew handling doesn't fully resolve, or pathological broadcast misses where you genuinely know better than the optimiser. In those cases, use hints (/*+ BROADCAST(t) */, /*+ SKEW('t', 'col') */) instead of fighting the global config.
Myth 4 — "Avoid Serverless SQL — too expensive"
This is the one I hear most often from teams whose Databricks bill is dominated by classic SQL warehouses sitting half-idle waiting for someone to run a query.
Why it's outdated. The per-DBU rate of Serverless SQL is higher, yes. But for bursty, intermittent, or BI-style workloads the total cost can be lower because:
- Startup is 2–6 seconds vs roughly 4 minutes for Pro/Classic warehouses.
- There's no idle cluster cost — you don't pay for the warehouse keeping itself warm "just in case."
- Intelligent Workload Management (IWM) uses ML to predict query cost and scale rapidly to match demand, instead of the reactive "wait 5 minutes in queue, then upscale" logic in classic warehouses.
- DBU pricing on Serverless includes the underlying compute, so you're not double-paying VMs in your cloud account.
Independent benchmarks consistently show Serverless winning on cold-start and bursty workloads. Pro/Classic only beats Serverless on sustained, continuously warm workloads where the cache stays hot — a pattern that's much rarer in practice than people assume.
What to do instead. Don't take a position. Benchmark. Pick three representative workloads:
- A bursty BI dashboard that fires queries throughout the day.
- A scheduled hourly ETL job.
- A sustained ad-hoc analyst session.
Run each on Serverless and Classic for two weeks. Use the system.billing.usage and system.compute.warehouse_events tables to compare DBU consumption and wall-clock query latency. The answer for each workload type will surprise you in at least one direction.
The nuance. Serverless autoscaling can scale out aggressively under load, which is great for performance but can produce cost spikes that are harder to predict. Set max_num_clusters deliberately, and configure spending limits at the budget level.
Myth 5 — "Call .repartition(n) before every write"
Open any Spark codebase from 2019–2022 and you'll find this everywhere:
(df
.repartition(200)
.write
.format("delta")
.mode("append")
.saveAsTable("silver.events"))The intent was good: control the file count, avoid the small-files problem, get nicely sized Parquet files. The problem is that this advice is now actively wrong.
Why it's outdated. Two things have changed:
- Optimized Writes rebalances data into well-sized files automatically before writing.
- Auto Compaction kicks in synchronously after a write to compact small files.
Both are always enabled for MERGE, UPDATE, and DELETE operations on DBR 10.4 LTS and above — you cannot disable them. For Unity Catalog managed tables on SQL warehouses or DBR 11.3 LTS+, Databricks autotunes the target file size based on table size.
A manual .repartition(n) before a write does three things:
- Forces a full shuffle of the data, which is expensive.
- Overrides whatever file size the platform would have chosen based on table size and write pattern.
- Creates exactly
nfiles regardless of data volume — fine whennis right, terrible when it isn't (and it usually isn't).
What to do instead. Trust Optimized Writes. For most workloads, this is enough:
(df
.write
.format("delta")
.mode("append")
.saveAsTable("silver.events"))If you want explicit control:
ALTER TABLE silver.events
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'delta.targetFileSize' = '256mb'
);The nuance. There are still legitimate cases for manual repartitioning:
- You're bucketing by a known downstream join key and want files aligned to it.
- You're writing to a partitioned external table that downstream Athena/Trino queries will read, and you want predictable file counts per partition.
- You're producing files for a non-Spark consumer (a
.csvexport, a parquet feed for an external system) where exact file counts matter.
Outside of those, .repartition(n) is almost always cargo-culted code from a 2020 tutorial.
Myth 6 — "Z-ORDER on your most queried columns"
The carousel calls this one out, but it deserves more depth because the migration is genuinely tricky.
Why it's outdated. ZORDER was the only way to get multi-column data skipping on Delta tables for a long time. It works, but it has serious operational costs:
- It's not incremental — every
OPTIMIZE ZORDER BY (…)rewrites all touched files from scratch, with significant write amplification. - It runs on a schedule, so newly written data isn't clustered until the next run.
- It cannot coexist with Liquid Clustering.
- Choosing the right
ZORDERcolumns requires knowing your query patterns up front and committing to them.
What to do instead. Liquid Clustering, again. It does multi-column clustering natively, applies incrementally on write (no full rewrite needed), and — when you turn on CLUSTER BY AUTO with Predictive Optimization — Databricks evolves the clustering keys as your query patterns evolve.
The migration path. This is where it gets practical:
-- 1. Drop ZORDER strategy from your maintenance code (no DDL needed,
-- ZORDER is a property of OPTIMIZE, not the table).
-- 2. Apply liquid clustering. Note: incompatible with existing partitioning.
ALTER TABLE silver.events
CLUSTER BY (event_ts, user_id);
-- 3. To re-organize historical data once, run OPTIMIZE FULL (DBR 16.0+).
OPTIMIZE silver.events FULL;
-- 4. From this point, regular OPTIMIZE handles incremental clustering.
-- Or, if the table is UC managed and PO is enabled, do nothing —
-- Predictive Optimization runs OPTIMIZE for you.If the table is partitioned, you'll need to recreate it without partitions before you can apply Liquid Clustering. Plan that migration carefully, especially for large tables.
The pattern: Databricks is automating what you used to do manually
If you re-read the six myths, the throughline is obvious. Every one of these moves the same direction: from a knob the engineer turns, to a behaviour the platform owns. That's not Databricks-specific — it's the same pattern you see with managed query optimisers in Snowflake and BigQuery.
The implication for our job is real. The differentiator is no longer "I know the right shuffle partition count for a 200GB join." It's:
- Designing the right table model (clustering keys, type system, schema evolution rules) up front.
- Knowing which automated features apply to your table type (UC managed vs external vs Hive) and which don't.
- Reading system tables (
system.storage.predictive_optimization_operations_history,system.billing.usage,system.lakeflow.*) to verify the platform is actually doing what you assume it is. - Knowing the exceptions — the workloads that genuinely need manual intervention — and being deliberate about them, not reflexive.
Stop fighting the platform. Save this article. The next time someone in your team writes PARTITIONED BY (event_date) on a brand-new Delta table, point them here.
Comments
What a useful article this is, a cheatsheet for using Databricks the best way, Thank u
Leave a comment