.png)
Data Talks on the Rocks 6 - Simon Späti

Data Talks on the Rocks is a series of interviews from thought leaders and founders discussing the latest trends in data and analytics.
Data Talks on the Rocks 1 features:
- Edo Liberty, founder & CEO of Pinecone
- Erik Bernhardsson, founder & CEO of Modal Labs
- Katrin Ribant, founder & CEO of Ask-Y, and the former founder of Dataroma
Data Talks on the Rocks 2 features:
- Guillermo Rauch, founder & CEO of Vercel
- Ryan Blue, founder & CEO of Tabular which was recently acquired by Databricks
Data Talks on the Rocks 3 features:
- Lloyd Tabb, creator of Malloy, and the former founder of Looker
Data Talks on the Rocks 4 features:
- Alexey Milovidov, co-founder & CTO of ClickHouse
Data Talks on the Rocks 5 features:
- Hannes Mühleisen, creator of DuckDB
Data Talks on the Rocks 7 features:
- Kishore Gopalakrishna, co-founder & CEO of StarTree
Data Talks on the Rocks 8 features:
- Toby Mao, founder of Tobiko (creators of SQLMesh and SQLGlot)
- Jordan Tigani, co-founder of MotherDuck
- Yury Izrailevsky, co-founder of ClickHouse
- Kishore Gopalakrishna, founder of StarTree
Data Talks on the Rocks 8 features:
- Joe Reis, author
Data Talks on the Rocks 9 features:
- Matthaus Krzykowski, co-founder & CEO of dltHub
Data Talks on the Rocks 10 features:
- Wes McKinney, creator of Pandas & Arrow
Recently, I've been collaborating with Simon Späti on a number of essays. For the sixth round of Data Talks on the Rocks, I interview Simon and we dive deep into the following topics.
- The journey from being a data engineer to technical author.
- The hype behind Bluesky - beyond the growing community, it has rich data that is open and available.
- The job to be done for data folks - are we analysts? are we data developers? are we data engineers?
- The latest trends in the data space - object storage, schema evolution, data modeling, and declarative data stacks.
A conversation with Simon Späti on open social data, DuckDB’s rise, the declarative data stack, and why data modeling still matters more than ever. Check out the video interview.
For all the churn in data, some things change fast and some things do not change at all.
The tools change. The abstractions change. The architectures change. Every few years, the industry invents a new “stack,” a new category, or a new doctrine for how analytics should work.
But underneath the churn, the hard problems remain stubbornly familiar: how to ingest messy data reliably, how to manage schema evolution, how to model business concepts clearly, and how to make systems that stay usable as complexity grows.
That tension sat at the center of a recent conversation between Rill founder Michael Driscoll and data engineer, author, and technical writer Simon Späti on Data Talks on the Rocks. Their discussion ranged from Bluesky’s open social graph to DuckDB, object storage, orchestration, declarative systems, and the enduring importance of data modeling.
What emerged was a grounded view of the modern data stack: less hype, more judgment, and a renewed appreciation for open formats, portable compute, and systems that can adapt to change.
A career that spans multiple eras of data
Simon’s path into data started long before “data engineer” became a standard title.
He began in Switzerland through a dual apprenticeship system that combined formal study with hands-on work, first with Oracle and SAP, then later in banking systems built around traditional ETL, operational data stores, and long-term data warehouses. That background matters because it gives him a frame of reference that stretches across several generations of tooling.
He has worked through the eras of BI engineering, SQL-heavy warehouse development, “big data,” cloud tooling, and today’s more composable ecosystem. Along the way, he moved from database work into writing, consulting, technical education, and now authorship, including his ongoing book Data Engineering Design Patterns.
That long view gives Simon a useful skepticism about labels.
Ask what we should call the people who build modern analytics systems, and his answer is: it depends.
Titles like analyst, BI engineer, and data engineer are often less precise categories than local dialects. They vary by geography, by company maturity, and by team structure. In some organizations, a data engineer is mostly writing Python and managing infrastructure. In others, a BI engineer is deeply embedded in business logic and metric design. In still others, analysts are effectively doing production transformation work.
The point is not to win the taxonomy debate. The point is to recognize that modern data work often spans the full stack, from source systems to dashboards, and the boundaries between roles are blurrier than most job titles suggest.
Why Bluesky matters to data people
One reason Bluesky has suddenly become so interesting to the data community is not just that it feels like a new social home. It is that it exposes something rare: a rich, open, queryable stream of social data.
For practitioners used to working behind increasingly closed APIs and expensive platform boundaries, Bluesky feels like a throwback to an earlier internet — but with a modern architecture.
As Simon pointed out, Bluesky is compelling on two levels.
First, it recreates the social dynamics that made Twitter useful for technical communities: short-form discussion, public discovery, lightweight sharing, and a strong sense of ambient conversation.
Second, and more interestingly for data practitioners, it is built on an open protocol. The AT Protocol makes identity and data more portable. Multiple applications can be built on top of the same underlying network. And importantly, the data itself is accessible enough to invite experimentation.
That openness changes the relationship between platform and practitioner. Instead of being locked out of the underlying stream, data people can inspect it, query it, model it, and build on top of it.
In a world where most large platforms increasingly meter access, Bluesky feels almost anomalous: a live, open event stream with enough scale and richness to serve as a real-world analytics playground.
DuckDB and the rise of local-first analytics
If Bluesky offers a compelling source of open data, DuckDB represents an equally important shift in how that data can be processed.
For Simon, DuckDB felt immediately familiar in the best way: like a long-awaited answer to a problem data teams have had for years. Earlier generations of high-performance analytics required either heavyweight OLAP systems or tightly coupled enterprise tools that were fast but cumbersome to manage, hard to version, and difficult to integrate into modern workflows.
DuckDB changed that equation.
Its appeal is not just speed, though it is fast. Its appeal is that it delivers analytical power in a form factor that is radically lightweight: a single binary, easy to install, easy to embed, and capable of running close to the developer or the workload.
That unlocks a wide range of production use cases.
Sometimes DuckDB acts as a local analytical engine for interactive exploration. Sometimes it becomes a “zero-copy” compute layer that reads directly from open files in object storage without requiring data movement into a warehouse. Sometimes it serves as a fast intermediary inside pipelines, dramatically accelerating data transfers or test workflows. And sometimes it powers browser-based or edge-native analytics where data never needs to leave the local environment.
This is one reason DuckDB has become such a natural fit for the broader shift toward object storage and open table formats. If data lives in Parquet files on S3, teams increasingly want the freedom to query it directly, without paying a premium to copy it into a proprietary system first. DuckDB fits that model neatly: compute is portable, storage remains open, and the boundary between “development workflow” and “production workflow” gets much thinner.
That does not mean centralized warehouses disappear. It means teams have more options — and can choose the cheapest, fastest, or simplest execution model for a given job.
Open formats are changing the center of gravity
A major theme in the conversation was the growing importance of open formats and object storage as the foundation of modern data systems.
That shift matters because it separates the long-term value of the data from the short-term choice of compute engine.
When data is stored in a proprietary warehouse format, the vendor controls not just execution but access patterns, interoperability, and economics. When data is stored in open formats, teams gain leverage. They can experiment with different engines, share data more easily, and reduce lock-in.
This is one of the deeper structural changes underway in data infrastructure. The center of gravity is moving away from “the warehouse as the system” and toward a looser architecture in which storage, compute, transformation, and serving can be composed more flexibly.
Of course, openness creates new problems too.
A data lake can become a data swamp. Easier access can lead to more copies, more one-off workflows, and more ambiguity about source of truth. Open systems are not automatically well-governed systems.
That is why the old disciplines — modeling, contracts, lineage, catalogs, and shared definitions — matter just as much in this new world, if not more.
Orchestration is only one piece of the pipeline
When people talk about modern data stacks, they often collapse a whole data lifecycle into a few well-known tool categories: ingestion, orchestration, transformation, BI.
But Simon’s point was that real systems are more demanding than that taxonomy suggests.
Take something like the Bluesky Jetstream firehose. Grabbing a sample of events is easy. Building a durable system that continuously ingests, lands, normalizes, tracks, and serves that stream is a different challenge entirely.
You need somewhere reliable for the process to run. You need cheap landing storage. You need a strategy for schema drift and nested JSON. You need downstream transformations that can adapt when the source changes. You may also need lineage, documentation, discoverability, testing, and some form of semantic or business logic layer.
That is why orchestration alone is not enough. A production data system is not just scheduled code. It is a set of interconnected design decisions about reliability, change management, and how teams understand what their data actually means.
One of the most durable problems here is schema evolution.
Source schemas change. JSON payloads gain or lose fields. Event structures evolve. Naming conventions drift. And even after decades of tooling progress, this remains a central challenge of data engineering. Older systems managed it through release processes and tightly controlled warehouse development. Newer systems approach it with more flexible schemas, registries, contracts, or adaptive modeling techniques.
But the core truth has not changed: if your upstream changes and your system cannot absorb that change gracefully, the downstream breaks.
The promise of the declarative data stack
One way to reduce that fragility is to make systems more declarative.
This is a concept Simon has written about extensively, and it is worth taking seriously because it offers a coherent response to one of the biggest problems in modern data: too many moving parts, each with its own configuration model, runtime behavior, and assumptions.
In an imperative system, teams define exactly how something should happen. That provides flexibility, but it also creates complexity. Every edge case, every dependency, every transition between systems becomes a problem for humans to manage directly.
In a declarative system, teams define the desired state — what they want — and let the underlying engine determine how to achieve it.
SQL is the classic example. You specify the result you want, not the query plan. Kubernetes extends the same idea to infrastructure: define the state of the system, and the platform reconciles toward it.
The idea behind a declarative data stack is to apply that same philosophy more broadly across analytics systems.
Imagine defining transformation logic, serving logic, and semantic intent in a way that is portable across engines. Imagine data workflows where the contract between ingestion, orchestration, transformation, and BI is clearer because each system operates against a shared declaration rather than bespoke glue code.
That vision is not fully realized yet. And Simon is careful not to oversell it. Imperative workflows are often perfectly reasonable, especially in smaller systems. But as complexity grows, declarative approaches become increasingly valuable because they make systems easier to reason about, easier to integrate, and easier to recover when something changes.
The goal is not abstraction for its own sake. The goal is fewer brittle handoffs and a cleaner way to manage complexity across the stack.
Data modeling is still the heart of analytics
For all the excitement around new infrastructure, one of the strongest themes in the conversation was also the oldest: data modeling still matters.
It matters because businesses do not run on raw event streams. They run on definitions.
What counts as a user? What counts as revenue? What does profit include? Which table reflects the official customer entity? At what grain should a metric be computed? Where should it be materialized? How fresh does it need to be?
These are not minor implementation details. They are the substance of analytics.
The dream of dumping semi-structured data into a lake and handling everything “later” continues to collide with the same reality it always has: without well-designed models, downstream systems inherit ambiguity, duplication, and inconsistency.
Simon highlighted a few especially important evolutions here.
One is the rise of metrics definitions as shared, declarative assets. Rather than burying KPI logic inside multiple BI tools or dashboards, more teams are trying to define metrics centrally and expose them through consistent interfaces. That helps reduce drift and gives different consumers — dashboards, APIs, spreadsheets, applications — a common semantic foundation.
Another is the growing awareness that materialization is itself a modeling decision. Caching can happen in many places: raw landed files, transformed tables, aggregate tables, cube-like serving layers, semantic caches, or application-facing indexes. Deciding where to persist and where to compute on demand is not just a performance optimization. It is part of the model.
And then there is the conceptual layer: the actual work of understanding the business well enough to decide what should exist in the first place.
That work cannot be automated away. Good modeling sits at the intersection of technical possibility and business meaning. It requires conversation, context, tradeoffs, and judgment.
The most important skill may be judgment
Toward the end of the discussion, Michael made an observation that feels especially relevant right now: one of the most important qualities in a data practitioner is not just technical skill, but judgment.
That means the ability to work in ambiguity. To enter systems with missing documentation, half-explained schemas, legacy tables, and tribal knowledge scattered across teams. To ask the right questions. To recognize what matters. To make sensible tradeoffs before all the information is available.
In many organizations, data work still begins in a fog of war. You are handed database access, a few cryptic column names, and maybe the name of the one person who understands what a field really means. From there, you are expected to build something trustworthy.
That has always required more than coding ability.
And it may become even more important in the age of AI. As Simon noted, tools that generate SQL, code, transformations, or dashboard logic raise the premium on human evaluation rather than reducing it. When machines can produce more technical output, humans matter even more as curators of correctness, relevance, and meaning.
AI can help accelerate execution. It cannot replace taste.
What’s actually changing
So what is actually changing in the modern data stack?
Open protocols and open formats are creating new possibilities for portability and interoperability. Local-first analytical engines like DuckDB are lowering the cost and friction of working with data. Declarative approaches are offering a path through growing complexity. And teams are starting to rethink where compute belongs, where metrics should live, and how tightly coupled their systems need to be.
But the fundamentals remain remarkably stable.
Schema evolution still breaks pipelines. Data modeling still determines whether analytics are useful. Business definitions still matter more than tool categories. And good systems still depend on people who can navigate uncertainty with sound judgment.
The modern data stack may look different from the stacks that came before it. But the best data work still comes down to the same thing: creating clarity from messy, changing, real-world systems.
Ready for faster dashboards?
Try for free today.

