Data Talks on the Rocks

DuckDB Won By Refusing to Scale Out

Michael Driscoll

Author

December 16, 2025

Date

minutes

Reading time

CONTENTS

Example H2

Example H3

In 2018, Hannes Mühleisen and Mark Raasveldt made a decision that reviewers told them was career suicide.

They built an analytical database that would never scale out across multiple nodes. No distributed architecture. No cluster management. Just single-node processing at a time when the entire industry had decided that distributed systems were the only path forward.

But as Data Talks on the Rocks host Michael Driscoll explores in this week's conversation with the DuckDB creator, sometimes the bravest technical decision is refusing to solve Google's problems when you're building for everyone else.

The Front Door Problem Nobody Fixed

Setting up Postgres from scratch is miserable. You're editing opaque config files. Asking your admin for installation permissions. Some database systems still run protocols that predate TCP/IP.

Then Hannes discovered MongoDB years ago. Downloaded a binary. Made it executable. Boom. You could just start working.

The irony? MongoDB was deeply flawed back then. No schema. No transactions. No persistence guarantees.

But the experience was revelatory. "The experience doesn't have to suck," Hannes realized.

That insight became DuckDB's founding premise. Take the orthodoxy of databases (transactions, schemas, state-of-the-art execution engines) and combine it with the ease of use that MongoDB and SQLite demonstrated.

The storefront matters as much as what's behind it.

The Hill They Were Willing to Die On

"The fundamental hill that we were willing to die on was to bet on single node and scale up instead of scale out."

Paper reviewers literally wrote "if it doesn't scale out, it's useless" in their feedback.

The entire industry had decided that distributed systems were necessary. Google operated at Google scale. Therefore everyone needed distributed architectures.

Except 99% of companies aren't Google. And Google has very smart people who generally don't need outside help solving their infrastructure problems.

A company called Tailscale, founded by ex-Google engineers, decided to solve problems for that 99% instead. They're doing quite well.

DuckDB made the same bet but with data. Focus ruthlessly on single-node performance. Accept that this means giving up certain use cases, but dominate the ones that remain.

This constraint changes everything about implementation. When you know you're in one byte-addressable namespace, your aggregation operators can just point at hash tables. Three threads building hash tables can reference each other's memory directly. Try doing that across a thousand distributed nodes.

Distributed engines pay a massive algorithmic performance penalty. Every operator must be capable of running across hundreds of machines.

When people compare Spark running on a single node versus DuckDB on the same node, they typically see a 10x difference. Not because DuckDB engineers are smarter, but because DuckDB can make assumptions Spark cannot.

Many DuckDB jobs now complete faster than a Spark cluster can start up.

Why Hardware Finally Caught Up

When Hannes got his first M1 laptop, he was stunned. The performance felt impossible.

Hardware has shifted the balance between distributed and single-node systems in ways that haven't fully registered with most teams. The minimum viable Spark cluster size is roughly 100 nodes. Anything under that probably fits on a single powerful machine.

Google accidentally admitted this in a TensorFlow paper. Something like 99% of their machine learning jobs process less than 10 gigabytes of input data. Manageable on one machine.

Redshift published similar statistics recently. Most jobs run on surprisingly small datasets.

This reality encouraged DuckDB to double down on their uncompromising single-node design. The constraint became a massive competitive advantage.

There's also a language advantage. DuckDB and ClickHouse are C++. Most competitors are Java. When cache-efficient algorithms matter, compiled code wins decisively.

SQL That Doesn't Make You Hate Yourself

"GROUP BY ALL" solved a problem so obvious that every major database copied it within weeks.

The SQL engine knows which columns you're aggregating and which you're grouping on. Having users type this out manually just creates errors. DuckDB made it optional. Snowflake copied it. BigQuery copied it. Databricks copied it.

Hannes finds this hilarious. Watching Snowflake implement your features two weeks after you blog about them feels validating.

The broader philosophy: SQL is often the first language people use to interact with data. Analysts and data engineers write queries before building dashboards. They're not machines. They shouldn't have to tolerate seventies-era rigidity.

DuckDB now lets you put aliases before expressions with a colon. "SELECT a: 42" means select 42 with the alias 'a'. This cleans up SQL dramatically. Human readability matters.

The key insight: these improvements are for humans, not machines. Generated SQL can write out aggregation columns fine. Machines don't make mistakes. Humans do.

The 1.0 Release and Long-Term Thinking

DuckDB called their 1.0 release "Snow Duck." Accidentally, they released it the same week as Snowflake's summit. People worried DuckDB had been acquired.

The 1.0 designation carried specific meaning. Before this, every release changed the storage format. Users had to reload data from backup constantly.

With 1.0 came a commitment. This storage format will remain readable by every future DuckDB version for decades.

This matters because databases are special. When Hannes wrote DuckDB's Parquet reader, he discovered they needed workarounds for Hadoop bugs from 10 years ago. Those files still exist. The sins of the past visit themselves on the future indefinitely.

Eventually, Hannes envisions a world where nobody cares which DuckDB version they're running. Like SQLite. It should just work.

The Database Can Be Anywhere

Reading a Parquet file from S3 in DuckDB requires one line: "SELECT * FROM 's3://bucket/file.parquet'".

This simple capability unlocked a use case people immediately loved. Data trapped in cloud storage suddenly became queryable without loading it anywhere first.

But DuckDB breaks something more fundamental. The traditional assumption that databases live in one place.

You can put DuckDB in Lambda functions. On devices in the field. In phones. On storage nodes themselves. A major cloud provider uses it as an S3 Select alternative.

This architectural shift hasn't fully registered yet. The database doesn't have to be a server somewhere. It can be wherever computation happens.

When asked how he stays inspired after years of grinding on this project, Hannes's answer is simple. They enjoy making new crazy things work that nobody has thought about doing in a database engine. Then they watch how the world reacts.

When all the major cloud warehouses adopt your "stupid SQL feature" in two months, that's pretty motivating.

If you want to understand how refusing to solve the wrong problems leads to solving the right ones better than anyone else, this conversation is essential.

Watch the full episode on Data Talks on the Rocks.

Ready for faster dashboards?

Try for free today.

Get started

Related Articles

Python Was Built for Humans. AI Just Changed Everything.

How ClickHouse became one of the fastest-growing databases in the world

The Semantic Layer Problem Nobody Wants to Talk About

Ready for faster dashboards?