Data Talks on the Rocks

Data Talks on the Rocks 5 - Hannes Mühleisen, DuckDB

Michael Driscoll
Author
October 29, 2024
Date
5
 minutes
Reading time
Data Talks on the Rocks is a series of interviews from thought leaders and founders discussing the latest trends in data and analytics.

Data Talks on the Rocks 1 features: 

  • Edo Liberty, founder & CEO of Pinecone
  • Erik Bernhardsson, founder & CEO of Modal Labs
  • Katrin Ribant, founder & CEO of Ask-Y, and the former founder of Dataroma

Data Talks on the Rocks 2 features: 

  • Guillermo Rauch, founder & CEO of Vercel
  • Ryan Blue, founder & CEO of Tabular which was recently acquired by Databricks 

Data Talks on the Rocks 3 features:

  • Lloyd Tabb, creator of Malloy, and the former founder of Looker

Data Talks on the Rocks 4 features:

  • Alexey Milovidov, co-founder & CTO of ClickHouse

Data Talks on the Rocks 6 features:

  • Simon Späti, technical author & data engineer

Choosing DuckDB has been one of the easiest design decisions for our team. It’s a natural fit for both the user experience requirements of auto-profiling and interactive modeling in Rill Developer, and exploratory dashboards in Rill Cloud. For our fifth installment of Data Talks on the Rocks, I had the pleasure of interviewing Hannes Mühleisen, creator of DuckDB. In this hour long conversation without a single mention of AI (see full video here), we discuss the following topics.

  • The origin story of DuckDB to the big 1.0 release. I even brought up a research paper Hannes wrote back in 2016!  
  • Putting aside the portability, embeddability and in-process nature, why DuckDB has outpaced many other database offerings, and become such a favorite among devs.
  • The fundamental hill DuckDB is willing to die on -- single node and scale up instead of scale out.

I’ve noted some of my favorite highlights below:

Hannes (05:24) What we set out to do with DuckDB is to somehow bring the, I'd say, the orthodoxy of databases, perhaps, where there is some features that we think are important, like transactions, like schemas, like state of the art, execution engines, together with sort of this ease of use ... how can we make something that people will actually use and not hate.
Hannes (31:14) So we decided to start from scratch. It was a really hard thing to do because we had something that was working right? And you're gonna be at the same point that you were, like, five years down the road or something like that. That is a hard decision to make, right? And I can tell you why we felt confident to make this decision, and it is a very basic human reason for it, because I had just gotten tenure so they couldn't fire me for doing crazy things. And Mark, I should mention, Mark Raasveldt, my then Ph.D. student who started DuckDB together with me, he had finished all his papers for his Ph.D., but still had time left on his contract. So that meant that we were both kind of in this sort of like, “Hey, they can't do anything to us. We can just disappear and do some [stuff].”  And we actually didn't tell anyone for the longest time, I think for almost a year, we didn't tell anyone what we were doing.
Hannes (36:40) I think the fundamental hill that we were willing to die on six years ago and turned out not to have died on, was to bet on a single node and scale up instead of scale out. And I think that's the fundamental difference to most of the systems that you mentioned, and that’s, I think, something that we had this gut feeling that the scale out was not required. And it was really a hard thing to say back then, because we, I have actual review comments from papers we wrote where they said, “If it doesn't scale out, it's useless”, something to that effect, right? And, that was the common wisdom back then. Everybody was like, if it doesn't scale to Google scale. And, I mean, yeah, it's not, it's not useful, right?
Hannes (57:13) I think people have a hatred of SQL and I can see where it comes from. It comes from, like, the trailing comma kind of nonsense. And like, over rigidity, overarching rigidity. And people don't realize that in analytics, SQL is actually often the first language that you use to interact with data. Yes, there's tons of generated SQL, but if you think about, an analyst or a data engineer sitting there, they're not going to start building a dashboard from this data set they just found, they're going to write SQL queries and be like, “What on earth is this?” So that's why I think we have sort of realized that we are trying to make a language for humans, and not humans from the 1970s, either, but humans from 2024.
Hannes (1:11:48) One of the things that I think is interesting about DuckDB is that it breaks up the traditional sort of place where the database lives completely. In the past, it was clear, here's your database server, the OLAP cube warehouse lives on the same [server]. And what we're seeing with DuckDB is this crazy explosion of sort of creativity where you can put a data engine, you can put it in a lambda, you can put it in your device out in the field, and we have actually customers that are doing that. There's a big cloud provider that's basically making S3 select alternative with DuckDB, because they're like, “Yeah, we can put this on our storage, it's fine. It doesn't cost us anything.”

Michael 

I am excited today for our latest and greatest interview for Data Talks on the Rocks, sponsored by Rill Data, to have Hannes Mühleisen, creator of DuckDB, with me. Hannes, welcome to the show. 

Hannes (00:18)

Thanks Mike for having me, it’s a pleasure.

Michael (00:21)

So Hannes, the database ecosystem is really one of the most incredibly crowded spaces out there. In the last few years, we've seen new categories of databases emerging, like vector databases. DuckDB has since the first commit six years ago in 2018, has emerged as one of the fastest growing database technologies in the world. And you told me, when we spoke about this interview, that there are over 1 million visitors a month to the website, terabytes and terabytes of downloads of the DuckDB binaries, and as far as I can tell, 10s of 1000s, if not hundreds of 1000s, of users of DuckDB around the globe for analytics use cases. I guess what most of us know about DuckDB is that it's embeddable, it's ergonomic. We love the SQL ergonomics, and it's fast. But that alone doesn't really explain it. To the people out there wondering why DuckDB is so darn popular among developers, I'd love to hear, in the words of the creator, why do you think DuckDB has kind of outpaced all the other database offerings, and become such an incredible favorite among devs?

Hannes (00:45)

I think that's a great question, and obviously in one aspect, I'm incredibly grateful for people to actually care. I think our expectation when we first started making DuckDB was that nobody would care, because that's kind of the default when you make anything. But there was an aspect of sort of thinking about this, about the development, I think people call it the DevEx. I call it user experience. For me, database users are users. We had observed some very sort of varying, user experiences with data management systems in general, right? And it has to do with, there's many reasons for it. One of them is that some of these platforms come from the 70s and have essentially never changed. And there's some, there's some systems that still run protocols that predate TCP, and it just shows in the ecosystem. I don't know if you have ever tried running anything with DB2? 

Michael (02:54)

Years ago.

Hannes (02:58)

So not to sort of stomp on IBM in particular, but that is wild. Right there you're like setting environment variables, you're sourcing bash configuration scripts, and then you get a bunch of binaries magically appear in your path, and those are the ones that you use to talk to the database. And it's all like wild, honestly, and that's interesting, because that isn't, I mean, it makes these things look really old, but it doesn't necessarily mean they're old. It just means they're sort of, their storefront may be looking old, but for some of these systems, there is modern, new technology in the engine, but you would never think it would be so. And these, these old sort of interactions, they kind of made it hard, or made it frustrating to use these systems like I don't know if you've ever tried to set up Postgres from scratch, but it's not fun. Even with Postgres, you'll find yourself editing some opaque config files, and you have to ask your admin to be able to install the thing in the first place. And then you're in this world of …

Michael (04:07)

Which is probably why Amazon's RDS is one of their most incredibly popular offerings. Because you don't have to set up Postgres yourself. They just [give you] an easy button and take a margin on that. 

Hannes (04:21)

Yeah and I totally get it. And one thing that really shocked me when I looked at MongoDB the first time, many years ago, because what they had, they had this thing where there was a binary on the internet. You downloaded a binary, okay, and then you sort of made it executable, and you executed it, and boom. And there it was. It had a web interface up and running, and what it was, it was a REST API up and running, and you just should start and go. And for me, that was sort of a revelation, actually, like the experience doesn't have to suck. And the irony, of course, is that MongoDB is, in many ways, a very flawed system, right? You know, back then, now they have all these things. Back then, there was no schema, there were no transactions, there was no persistence, there was nothing. It was just like put your data in and hope for the best. I think that's fair to say, but the experience was great. And so what we set out to do with DuckDB is to somehow bring the, I'd say, the orthodoxy of databases, perhaps, where there is some features that we think are important, like transactions, like schemas, like state of the art, execution engines, together with sort of this ease of use that we saw with systems like Mongo. But also, SQLite is extremely famous for, yes, it just being non, unobtrusive. And I think that was one of the starting thoughts. It's like, how can we make something that people will actually use and not hate.

Michael (06:03)

Yes, wow. So you talked about the old storefront on DB2. It's sort of bringing the classical orthodoxy of databases, combining with the modernity of user or dev experience, right? So having a new storefront on a, frankly, a new classical back end. I guess I would say classical, sort of database orthodoxy. Let me ask, though, it's not just that the developer experience is fantastic. Again, as someone who's used DuckDB, I echo how lovely it is: the experience of downloading it and having it kind of running instantaneously, even, of course, we can talk a bit, maybe later, about WASM as an even more direct path to getting DuckDB. But as far as what you get after you go through that front door. What do you think are the things that, because there is SQLite out there, granted Postgres is maybe not the easiest to get running, but there is Postgres in the cloud, and lots of folks, in fact many companies are building on top of Postgres in the cloud. What do you think in terms of the capabilities of DuckDB differentiated after you get through that front door and have hopefully a joyful first install experience?

Hannes (07:26)  

I think that's fair, because that's, again, what did frustrate me with MongoDB, is that you had this kind of shiny front and then behind that there was, like, a bit of cardboard maybe. But in DuckDB we are really doing it properly. So we have a state of the art query execution engine, which is really based on decades of research here in Amsterdam. It's interesting that there's this database architectures research group here at the CWI, the National Research Institute where I was a senior researcher. I'm still a senior researcher for many years, and DuckDB came out of our research there. So we were quite well connected to the latest in database research, like, how do you build an execution engine for analytics? And the idea of vectorized query processing was something that originated out CWI from Marcin Zukowski [co-founder Snowflake] of and Peter [Boncz] and Niels [Nes] research way back when. And so we took that and we put it into DuckDB, and then we took ideas from Munich, the TU Munich, which is another hotbed of database research, where systems like Hyper, which is now in Tableau, and Umbra, also very innovative systems come from. And we took ideas like the morsel-driven parallelism, and put them into DuckDB. And we of course kept innovating ourselves, because we are database researchers, we can push the envelope, no problem. And then, the challenge is, really, to take the state of the art stuff and still manage to fit it in the sort of the shiny package without sort of burdening the user, right, without saying, “Oh, but you know, this only works on the latest Intel CPU”, or “this only works, you know, if you have that kind of, I don't know, disc”. One of the things we spend an enormous amount of time on is, like, making things work on ancient Linux versions. My nemesis is CentOS 6, right? Because it's one of these things that it's super easy and [what] many open source projects do, it's fine to say “Yeah, we use C++ 27” or whatever the most recent one was... and then that just automatically means that you're losing, I don't know, half the installed base of Linux out there, because there's just no way, no realistic way, that these people can ever kind of run these things right? So we spend enormous amount of time on that, and that sometimes really clashes with doing like the latest and greatest and shiniest or taking advantage of the latest and greatest and craziest APIs that Linux happen to have added in version 15.4 or whatever the latest is. I'm just exaggerating for this point. 

So that's an interesting sort of conflict, if you want, how do you kind of keep the simplicity intact and still make sure it is a state of the art engine. And we have some interesting tricks there, like, sometimes the compiler saves our butt. For example, there's this thing, but if you want to do a bit swap, like you have a 64 bit number and just want to turn it around, okay? Like network byte order swapping like in the olden days. So you can use an intrinsic built in byte swap, or whatever – bswap is what I think it's called – and, and that works, but that is probably not, that's not available on every compiler. That's something that works in GCC and maybe LLVM. And so go out and spend the time and go, okay, is there a C algorithm or C++ algorithm that it can write down, that the compiler will recognize, that it needs to optimize to this single instruction that you get with the build. And then we have to look at the compiler and blah, blah, blah, and make sure that actually, that's the assembly being generated. And that's, of course, a lot of work that we do that we only do so we can keep the experience nice. 

Another thing that we do is that we don't have dependencies, right. DuckDB does not have dependencies that you don't have to install boost or God knows what to get this thing running. There's no compiler dependencies. There's no runtime dependencies. We do use other people's stuff, but then we have to ship it, right. So then we go into these libraries and we actually patch them so they are portable and they are compatible on every single compiler that we get our hands on. It is an interesting challenge.  But I think so far, we've managed, and I think what I'm really proud of is that we've also managed to really innovate in the engine space as well. Even if our premise and part of our motivation is to keep the user experience great and to keep the portability great, we've still managed to innovate with things like out of core processing, where the intermediates within queries can become bigger than memory, and things, good things happening and and that kind of stuff. So that's an interesting sort of combo. I think that we're the space of challenges that we inhabit I would say.

Michael (12:59)

So the portability and that kind of universality of being able to run DuckDB everywhere, people maybe take that for granted, that it just works. But there's that saying that “when technology is sort of indistinguishable from magic at times, but it just kind of works”, right? There's a lot of magic behind the scenes to make that happen, and it feels like that's really important, because we've seen in other areas where if something works in 95% of places, or even 98%, it's still not good enough to just rely on and depend on. And so similar in the database space. I think when you've got 95% adherence to SQL, that's not enough, right? Because suddenly all the tools that depend on SQL cannot really operate, if only one in 20 queries kind of fails. I think one thing you talked about earlier, when we were kind of chatting about this interview, was how excited you were about the libraries, the community extensions that folks are starting to make plugins for DuckDB and then redistribute those plugins. How does that compliment or potentially conflict with what you just said about DuckDB, saying that it ships with no dependencies, and it can run everywhere, even on CentOS, the bane of your existence, CentOS 6.

Hannes (14:33 

Yeah, I think this is an excellent question. So the community extensions. I mean, let me start with extensions. The whole reason we have extensions, initially was to actually mitigate a dependency. So we want to be able to read from HTTPS resources, like among others, S3 can be served, and is often served over HTTPS. And to do HTTPS in a sort of responsible way. You kind of need open SSL or something comparable. And you really don't want to statically link open SSL, because A, it's gigantic, and B, there's a ton of sort of data files lying around on the user's computer that influences behavior like the certificate, it's trust. The certificate’s trust and then you have to do your own and you ship your own root certificates, and it's all, it's all big. So then that means you have to link to it, and that is not great, because, as I just said, right, we don't really want to depend on that. And there's also wonderful incompatibilities with inversions, yada yada yada.  

So in order to mitigate that, we came up with this plugin concept for DuckDB. We say, okay, the httpfs extension is actually a plugin that can depend on open SSL, but then DuckDB itself doesn't, right? And then DuckDB itself, if anyone wants to talk to an httpfs resource, we'll install the plugin, and then you have it, and then the dependency is mitigated. We don't punish everybody that doesn't want this. And over the years, this has grown, and it has actually been quite successful internally, because it allows, like this concept of extensions, allows us to keep the core compact, and free of evil things. And anything in extensions, it's like you can basically do more things in extensions. And there's several sets of extensions. There's some extensions that are in the source tree that we build, that we maintain. There is some extensions that are not in the tree that we maintain together. They're considered the core extensions. 

And then this year, we've actually opened up this extension ecosystem to anyone.  And you could do this before, but we had some, we had the extension need to be signed, so there's, there were some issues with trust, but this year, we actually opened this up, so now everybody can make what we call community extensions, and there is a central repository that we maintain. We also do the builds and the distribution of these binaries. So basically, you don't have to deal with that. You just register your extension. It's a bit like, I don't know if you know how Homebrew works, like you have a descriptor file somewhere, and then the CI will grab that and start building things and deploying things. And then the cool thing is, then, from within DuckDB, you can just say, install something from community, and it will just work, right? And with that, I think, is something that allows us to, kind of, focus on what is important, because there's other projects out there that bundle every library under the sun and become monstrous as a result. And very hard to build indeed. And we don't have that problem, because we can say, okay, if you can build core DuckDB, the extensions, they come as binaries, you can build them from source, but actually you don't have to, because we build binaries for like 20 platforms or whatever, and it's a bit like pip.  I think there's not a single person out there that knows how to build every single Python package, because every maintainer has their own thing, and they don't talk to each other. But in this case, yeah, we build all these things, and you can just use them, and that maintains, that also just sort of localizes expertise, right? As somebody knows a lot about, there's somebody that's kind of building like an Excel reading extension. Okay, I don't know anything about Excel, but that person presumably does, and then they can, they can go nuts on their Excel extension. And we don't even see it in that sense. I mean, we're glad that it happens. And of course, we see the extensions that have been published, but we don't have to review the code or anything like that. It's just a thing that happens.

Michael (19:13)  

You made the analogy to Homebrew, which is fascinating, and sort of thinking of DuckDB, not just as a database, but as a platform that others can build on. I do also know that the maintainer of Homebrew, that the burden of maintaining that ecosystem, I think, has been quite high at times. So, of course, I imagine that you're being very thoughtful in terms of how you can, as the “no good deed goes unpunished”, and as the community and the ecosystem grows, the maintenance burden, the testing burden of that library of plugins could increase. 

I do want to almost go back in time a bit. We started out just talking about where DuckDB is today again, a million [monthly] visitors onto the website. Great things are happening. We'll talk a little bit about the 1.0 release, I think that was pretty significant in terms of the architecture and the improvements made. 

But going back in time, you've been a researcher at CWI for many years, no surprise that DuckDB, nothing is ever sui generis, [so] no surprise that DuckDB came out of a very fertile soil of folks who've been thinking about vectorized processing, column oriented databases for a long time. I still am curious, because I was looking through some of your papers, and I just want to read you a quote from a paper I read in 2016 and maybe kind of ask the question of, at what point did you and colleagues say “Yes, we should do this. We should actually commit to building another database.” Because that decision is not easy, you're wise enough. Some people, this is almost by accident. They say “we do these things not because they're easy, but because we thought they were going to be easy.” But in your case, you really understood deeply the multi year slog that is building a new database. 

And I'll read from your your paper that you published in 2016, about vectorized UDFs and column stores, and what you wrote at the end of that paper for future directions was,

“When creating these MonetDB Python UDFs, we try to make it as easy as possible for data scientists to make and use user defined functions. However, they still have to write user defined functions and use SQL queries to use them if they want to execute their code in the database. They would prefer to just write simple Python or R scripts and not have to deal with database interaction.” 

So your conclusion in this paper seemed to be that the path to sort of democratizing vectorized functions was actually through things like Python and R and MATLAB. And the goal was to meet these, these really analysts, these data scientists, where they were. What changed and led you to, instead of trying to put database UDFs inside of Python, it feels like it's flipped, and now people are putting Python UDFs into your new platform, DuckDB. 

So same question there, I’d love to hear your thoughts on: how did you go from this paper to a year later deciding we're going to build DuckDB?

Hannes (23:05)

This is amazing. I mean, Mike, I'm impressed that you would look at my 2016 paper and look at the conclusions and confront me with something I haven't thought about in eight years. 

But I think it's a very valid question, and I think that we maybe just start with what brought us sort of to this world of the data analysts. We actually talked to them. So we had this, we have this long standing collaboration with the R community. I particularly have a long standing collaboration with the R community. We still work to this day. We work with Posit on something called duckplyr. I will say a little bit more about that in a second. But we have a good relationship with this community, and we had been talking, and people had sort of been using databases, and they didn't really like it.  And then they said, “Okay, we can start translating some statistical algorithms to SQL, and yes, that's if we have to, we will do it.” And then there were some things that were difficult, like R has seven different definitions for a median, for example, and you have to support all of them if you want to run somebody else's script or not, some of them. So we thought, okay, the first thing we can do with sort of a medium amount of complexity is do these UDFs, these user defined functions.  Right where you say, let's open up Pandora's box, the holy sort of shrine of the database, and just allow, like random, unwashed Python code to run in it, which is traditionally not done. Even the big systems out there, they usually use something called a UDF server because they don't trust these third party languages. And they run this in a separate process, and they pipe input output.

Michael (24:58

The inefficiencies associated with that are massive, right? Because you’re checking on every row, you're actually doing invocation row by row. So these UDFs  tend to be incredibly non-performant. 

Hannes (25:13)

Yeah. And it's incredibly non-performant and it's also just not the model that you have if you do statistics, and how you used to do columnar things, like you are in MATLAB, you’re used to working with matrices, not with scalar values. And languages like Python and R, have quite high just baseline interpreter startup costs. So if you do that for every row and you have, like the network IO on top of it, it's not great. So we came up with this idea of vectorized UDFs to kind of amortize that [cost], and that works really well with things like R and Python, because you have columnar interactions with things like base R [which] has this already, and Python has NumPy arrays, of course.

So that was the first approach, but we brought this then back to the R people and the Python people, and they were not loving it so much. So then that's when we started experimenting. I mean, of course, the paper was published in 2016. I think the work was done maybe a year or two earlier. 

That's when we really started experimenting with something called MonetDBLite, which was a MonetDB is another database system that was built at the database architectures group here [at CWI]. It's still around. But we started sort of refactoring it into something that can run in process, because we realized that we needed to copy the SQLite model for two reasons. And the SQLite model is, of course, it's an in-process database. We used to call it embedded. I now call it in-process, because when people think embedded, they think like small controllers or something like that. It's not that,

Michael (26:54)

They think sometimes almost physical, “physically embedded inside” like embedded systems.

Hannes (27:00)

Exactly, microcontroller. It's not that. So we started experimenting with sort of in-process. We realized we needed to go to the in-process deployment model. Because we looked at SQLite, we said, “Yes, this is it.” You don't have to sort of run a service that you have to maintain. If you want to run a Python script, you really don't want to configure, like, credentials or anything like that. You want the script to be able to run through and if, and we even had some crazy R code to in the background, download and install and run and start up and launch and configure a server, just to connect to it. And then that went wrong sometimes, and it was a giant nightmare. And so this in-process deployment model fixes that, because your database engine runs within the process, all is great. 

You have this other really wonderful advantage that things are already in-memory of the host process are accessible to the database, which is really relevant in things like R and Python, because you have data frames sitting there, you can access right back, and query results can become data frames again. 

Michael (28:07)

I would also just inject that same in-process, all of the value of writing in-process, I think there's another concentric circle of value that you create when you run, when DuckDB is portable and running on the very machine that the data is living on. We've also seen to the same point, you get access to all the things in process. When you run DuckDB locally, you also get access if you're inside the security perimeter of the laptop or the server that you're running on, and you don't have to think about network egress, but more importantly, network credentials, right, all the security. So when you bring that DuckDB system closer and closer to the things that it interacts with, you are able to kind of leapfrog all these gates and barriers that naturally exist between systems. 

Hannes (28:59)

Yeah, absolutely. So we started with this rebuild, with this refactoring of MonetDB to kind of do that and run in process. And we built something called MonetDBLite. But then we realized that that was not going to work out. I mean, we built it, people got excited and really liked it. And so what that really did is gave us confidence to say okay…

Michael (29:54)

By the way, what year was that? 

Hannes (29:57)

I mean, MonetDBLite. I think there was a very early prototype I think already in 2013 2014, or something like that. MonetDBLite as a project, you’d have to check the CRAN history of the package, 2014,’15, or ‘16, something like that. 

And it became clear that refactoring a database system that was designed for, like, client server, stuff to be an in-process system isn't gonna work, really, because there's things that the database systems that are in client server that do, they cannot do it in-process, like things like you cannot change your working directory. Sorry, but your other thing that's running that process also has a working directory, and there's only one for the process, and if you change it, that process is going to be unhappy. 

You cannot change things like the locale right to deal with like. You cannot just abort and pray and hope that your demon restarts you, because if you abort and everything is gone and the user is going to be very unhappy with you. So there are some, and that's really fundamental. 

There's also some things like some database systems love to override like signal handlers, right? Well, the problem is that Python also likes to override a signal handlers. Then you have this fight, who gets the signal handlers.

And so we needed to rethink some aspects of data management systems, to be like a good citizen in somebody else's process, and that turned out to be such a large amount of changes that we couldn't really do it. And then, of course, another aspect was that MonetDBLite execution engine wasn't state of the art anymore at that point, and we knew better with the vectorized paradigm. 

And so we decided to start from scratch. 

It was a really hard thing to do because we had something that was working right? And you're gonna be at the same point that you were, like, five years down the road or something like that. That is a hard decision to make, right? And I can tell you why we felt confident to make this decision, and it is a very basic human reason for it, because I had just gotten tenure so they couldn't fire me for doing crazy things. And Mark, I should mention, Mark Raasveldt, my then Ph.D. student who started DuckDB together with me, he had finished all his papers for his Ph.D., but still had time left on his contract. So that meant that we were both kind of in this sort of like, “Hey, they can't do anything to us. We can just disappear and do some [stuff].” And we actually didn't tell anyone for the longest time, I think for almost a year, we didn't tell anyone what we were doing. We would show up to group meetings and say, “Yeah, we were working on query parsing”, and they would look at us like, “Well, why on earth are you looking at query parsing”, right?

Michael (32:30)

Why do you both look so exhausted, as if you've been coding until three or four in the morning, right? 

Hannes (32:36)

Yeah. I mean, it is, it was an absolute grind the first couple of years, right? You have a gigantic sort of feature set of SQL, you mentioned it earlier, it is a giant feature space. You need to somehow cover it, and you have two people. I mean, eventually some people started joining, but it's still a fairly small team, and we, we have to, kind of, they have to focus, which is also why we love this extension model, right? Because it means that our team can focus on what is important, namely, making the core engine work well, and the rest is something that other people can do. 

Michael (33:13)

There's a quote, various versions of this quote, but one I recently read was that databases are an overnight success story after 10 years. In your case, it does feel like the journey has been faster than many databases. I mean, just by way of comparison. And maybe, we'll wade into that, which is that there's a lot of other database engines out there now. Clearly the in-process nature of DuckDB is a massive differentiator. Great to hear you going into depth about that. But as the community of users, community of companies that are leveraging DuckDB expands the advantages of that in-process, the use cases expand to places where people are saying, “Hey, I am willing to pay the premium of having an out of process system that can handle more scale, if it provides a set of benefits.” And so there are a set of engines out there that are, I would say people do look at when they're thinking about powering analytical applications. There's Apache Druid, which, of course, I know pretty well. There's Apache Pinot, there's ClickHouse. There's new engines out there. I think Hydrolix is a new company that's recently merged on the scene. There's StarRocks, which is essentially a fork of Apache Doris. As a database expert, I'm sure you pay some level of attention to these various systems here and there, but I guess I would ask from both, maybe having read some of the white papers of those systems, which, by the way, almost each of those I mentioned, I think, is more than a decade old, so they've been at it for a while. I think probably StarRocks is probably the youngest, or maybe Hydrolix is fairly new. But putting aside that portability, embeddability and in-process, and maybe thinking about looking forward, and I would actually also add that obviously, at Rill, we are significant users of DuckDB. The place where we sometimes, I think, start to hit the limits of what's possible with DuckDB is scale. And so, I guess I would just ask, this is maybe that's the one of the most important axes.  How do you compare position DuckDB against some of these other OLAP engines that people think of as vectorized? And aside from the portability and embeddability and process piece, which is a clear differentiator, what are some of the other pieces? And then maybe, specifically, [since] that's a general question. Specifically, how do you think or what do you aspire for DuckDB? Just DuckDB, we can think of MotherDuck as a cousin of DuckDB. Yeah, right, commercial cousin. How do you think about scale, how do you think about positioning against all these other OLAP engines out there?

Hannes (36:38)

I think this is an excellent question. And I think the fundamental hill that we were willing to die on six years ago and turned out not to have died on, was to bet on a single node and scale up instead of scale out. And I think that's the fundamental difference to most of the systems that you mentioned, and that’s, I think, something that we had this gut feeling that the scale out was not required.  And it was really a hard thing to say back then, because we, I have actual review comments from papers we wrote where they said, “If it doesn't scale out, it's useless”, something to that effect, right? And, that was the common wisdom back then. Everybody was like, if it doesn't scale to Google scale. And, I mean, yeah, it's not, it's not useful, right? 

Michael (37:29

If you don't have Google scale, you can't sell to that one customer Google. The one customer in the world that has Google scale is Google, right?

Hannes (37:38)

Yeah. And there's actually a great talk from Mike Stonebraker about how database people have started solving the whales problem, when the whales people like Google or companies like Google, when the whales have very smart people and are generally not interested in outside help to solve these problems, right? And there's an interesting joke about this where I think there's an inside joke about, “we are Google and everybody else is ‘tail scale’”, right? Because they are. 

And then there's somebody who made a company called Tailscale from ex-Google people who were like, you know what? Let's just solve the problems for the 99% of people who are not Google and will be rich. And turns out that works. 

So we kind of did the same thing, but with data, because we realized that the whole field was chasing whales, and I see some of the distributed OLAP systems also kind of falling in this trap, but sort of following this trend, right? Following this trend of “it needs to be distributed, it needs to be scaled out”, it needs to be mass and people really. And I think those were valid assumptions when these systems were made 10 years ago, if you wanted to safe bet.

I think our bet was the risky one to say, “No, no, no, no, single node” and it's what DuckDB as a project is, is really focused on our implementations of operators. For example, they do assume that we are living in one byte addressable namespace with the RAM, for example, and that's something that you just can't do. 

For example, if you take something like Druid that needs to run an aggregation across 1000 nodes, or Spark that needs to run an aggregation across 1000 nodes, their way of thinking about the aggregation operator is necessarily different from our way of thinking about the aggregation operator. 

And the differences are that we can basically say, hey, you know these, these three threads, they're building a hash table, and now they can just point at these hash tables with pointers and it will work, which is something that's unthinkable in a distributed engine. 

And the people that have distributed engines, or engines that can be distributed, they have to pay a huge price in terms of algorithmic performance, because everything, every operator they build, needs to be capable of running on 100 machines. And we don't have that, and that gives us a crazy edge. For example, if you compare things like Spark running on a single node versus DuckDB running on the same node, you typically see some like 10x difference in performance.  And it's not because we know more.  Maybe we do.  But it's because we can make assumptions that they cannot make. And I think that's and I would also differentiate…

Michael (40:42

The quote I heard from Michael Eastham of Tecton at one of the DuckCon events was that they had replaced aspects of their Spark pipelines with DuckDB. And the quote was, “Many DuckDB jobs now complete faster than a Spark cluster can start up.”

So anyway, it does seem that people are basically, there does seem to be a group of folks that are migrating Spark jobs that fit inside the envelope of a single node to DuckDB with significant performance improvement. 

Hannes (41:24)

I think there is something like a minimum viable Spark cluster size, right? And that has, and I think it's around 100 nodes. I think anything under that, you can probably burn off on a single node. And that's, I think, interesting, because that has shifted. And I think the reason it has shifted is because of the hardware developments, right? The hardware has become so much faster. You know, the disk has become so much faster around the CPUs. I mean, when I got my first M1 laptop, I was completely blown away with it. I mean, I'm sure you were too, right. It's like, “What, no, this cannot be that it's this fast.”  And that has really shifted the balance, but somehow this hasn't fully arrived in people's heads yet. 

And I would also want to take it, want to make another difference: I think ClickHouse and us are C++, and the others are, I think mostly Java, part of it Java, right? And I think there is this argument, flame war going on on Twitter, how the Java people say, “No, Java can be fast”, and so on and so forth. But we have seen that when it's coming really down to sort of cache efficient algorithms, it's really difficult to get this done in a garbage collected language. Yes, you can work around that. Yes, it's hard. And then I think that engines like us, like ClickHouse, like Polars, are kind of showing that compiled code is just able to perform much better in these scenarios. And that matters, because if you can make a single node algorithm two times faster, you're shoving, you're pushing the envelope from the minimal viable Spark cluster further out, right? If you get 2x faster, then suddenly it's 200 nodes [for the minimal viable Spark cluster], and that's interesting to see how that boundary shifts. 

And I think what's really, what shocked me, really, so I have two more stories on this. One was that at some point, Google published a paper on TF [TensorFlow] data. I don't know if you remember this paper, it was, but it basically admitted that, what was that, like, 99% of their job, machine learning jobs at Google have, like, less than 10 gigabytes. I don't know what the number was in input data. So it was like, Aha. 

Michael (43:48)

You know, flow, when you say TF data, tensor flow right? There's 99% of TensorFlow jobs, are 10 gigs or less, right? 

Hannes (44:00)

Something I don't, don't quote, like, I'm not entirely sure, with the numbers, but it was something manageable, and then that was something that really encouraged us to continue on our sort of hill dying experiment. And then just, I think, a couple of months ago, this Redshift paper came out that basically shows the same thing. The jobs that people run on Redshift, their input data is fairly small. And, yeah, I'm curious. 

I would be really curious to see Databricks numbers, actually, because that would, I don't think they have said anything about that yet, but they are one of the companies that, if they would make this kind of statistics available, it would be great. 

Now but I think that has really encouraged us in going for this sort of, I would say, uncompromising, single node design and development approach to production. And I think, I don't think anyone else has quite, you know, gone that crazy on that one.

Michael (44:55)

I’m always looking for quotes. But Goethe has a quote that says “in constraints like the master shows him or herself”, right? That constraints are what allow you to innovate. And so that constraint of we are going to focus on just solving for a single node clearly opens up other avenues that these other folks couldn't pursue. And then, of course, the world catches up. Those are bets that take many years to pay off. 

One thing we were chatting about, we mentioned in the introduction here was that the 1.0 release was something that came out a few months ago. That was certainly a landmark in terms of decimal numbering, but clearly symbolic. No one, everyone always, I think, has a debate in open source communities of, “Hey, when are we going to call this thing 1.0 right”? I think there’s a lot that goes into that decision to label it 1.0 and of course, there's stuff underneath that decision. So I'd love to just hear in your words, and you and Mark, obviously, now a team of folks decided to make this 1.0, tell me a little bit about why this was the time to call it 1.0, what was in that release, and how's it been going since? Because it can always be a little bit of nail biting as that goes out into the world.

Hannes (46:33

Yeah, I think it was a pretty big one for us. So I have a fun anecdote on it. So we called it Snow Duck. I don't know if you remember. Our versions have duck species as code names, and there is no such thing as a snow duck, so we invented it, it's anas nivis. We made up a Latin species name for it. 

Michael (46:53)

Not yet. Who knows, we've yet to discover a snow duck, and you'll have actually pre-named it.

Hannes (47:00)

We will clone one, but it was kind of funny, because we accidentally released it the week of the Snowflake Summit, and people were concerned we had been bought. So that was a bit of an accident.  And the reason we named it Snow Duck was because we are harkening back to Snow Leopard, the OS X release a few years back where they didn't do features, but only fixed bugs, because that was kind of what 1.0 was about. But our sort of rationale, on the 1.0 was really, we want to get the storage format sort of not finalized, but backwards compatible. 

So basically, in the previous sort of years, DuckDB has a storage format where you can save tables, you can load tables, it's very exciting, you can do transactions. But we had changed that format a lot, like, basically every, every sort of dot release we would say, like, “Hey, you need to restore your data from backup because we have a new storage from it. Unlucky you.” It was also one of the reasons why we always called our previous releases like preview releases, because we said, hey, we are not comfortable with you storing your data in this thing for good.  But at some point we realized that, what our sort of benchmark, if you want, for the 1.0 was that we committed to a storage format and that we kept the storage format alive for I don't know how many decades it's going to be. 

Michael (48:34)

When you say alive you mean stable, meaning, maybe not, it's certainly going to change, but the contract, or the commitment that DuckDB is making is that any future version of DuckDB will be able to read any DuckDB storage format from 1.0 on.

Hannes (48:57)

Yeah, exactly. And actually, so the funny thing is, because we did that change in the 0.9 release, then we observed it in the wild for, I think, almost half a year, and then we're like, okay, and then what? And so with that, built confidence in us that this is the right storage format. 

So the 1.0 release is actually under the hood is like, 0.94 or something, because we just changed the label, because we wanted the 1.0 something to be something that's long term, that's stable, that people can use, and they can use it five years from now, without any sort of major issues. 

Of course, newer versions are going to be better. Yes, more features, faster, better optimizer. You know, you have it, but if your use case is working with 1.0 we want your use case to still be working with 1.0 in five years. 

And I think we manage that because I've not seen any sort of tragic issues, let's say with 1.0. 

But I think it's also that databases are special in terms of numbering of versions, because people, it's such fundamental technology that people have rightfully I think, some level of expectation on what stability means, what long term means. 

And these data files, something that became clear to me, because I wrote the DuckDB’s parquet reader, something that became clear to me, when writing the parquet reader, is that we were dealing with workarounds for bugs in Hadoop from like 10 years ago. But these files, of course, still existed, right? So that meant that, yeah whenever we commit to this, we will have to basically make workarounds to the issues from [the past], anyways.

Michael (50:47)

Sins of the past right, are visited on in the future.

Hannes (50:55)

I think we have made changes so far, but they're backwards compatible. And if there's, like, new things, like new compression algorithms, then we put them behind a feature flag. And that's another thing that we've changed, after we did the 1.0 and, it all went well, and we were very happy, and we had a party, and everything. Now we are much more careful about breaking changes to things like SQL, right? Like if we change behavior of SQL, like a function changes its parameters, or whatever, we will probably now put that behind the feature flag, just because we need to, we just don't want to break anyone's use case. You should be able to upgrade, and people still do, although it's slowing down. We can still see a bunch as we see the extension downloads for DuckDB versions, and we still see a bunch of 0.9 things out there, and that's great. 

I think at some point my vision on this is actually, you should not have to care about the DuckDB version. Like you're not caring about which SQLite version you have. They're all fine, right? So you should not have to care. It's this 1.0, 1.15. Who cares, right? It should still work. It should still be reasonable. And I think we're going to get there, that in five years or so, there's going to be a sort of myriad of DuckDB versions running in production out there. And, yeah, and that's okay.

Michael (52:27)

It's the nature, probably, of I know you used the word in-process, but because of this portability, because of the fact that people can, at Rill we embed, we bundle DuckDB in with our distribution, and lots of other folks do as well. But you can imagine that when that happens, a lot of folks may not upgrade. You know, they have things that are working. They don't want upgrades. Any of us have gone through upgrading our back in the day, running my Linux laptop, it was like, put aside a day and a half to do an upgrade to your Linux laptop, because things would always break with very little benefit. So I think you'll probably see a lot, I imagine that the distribution of versions will start to shift towards the current stable versions, and people won't be [upgrading]. 

One thing you talked a little bit about is making the SQL dialect stable. One thing we haven't talked about, but I'd love to hear kind of the why behind DuckDB’s choices around its own SQL dialect. I'll speak as a user that, one example that I think those of us in the world of analytics just fell in love with at first sight was the GROUP BY ALL clause and there's a number of other ergonomic improvements that DuckDB has made to the dialect. The ability to exclude and include columns. I assume that not all these ideas are de novo. DuckDB may be inspired by other tools out there, but that taste, we talk about developer experience, those of us who spend so much time in SQL having a ergonomic SQL dialect where you can have a trailing comma right at the end of your all these little things, these little polishes, kind of add up to something that is just what makes people happy to use DuckDB. 

What inspired those changes? AND I think one of the reasons people don't change THE SQL dialect very often is they think, “Well, I don't want to sort of veer from the ANSI standard and create something over here.” I think there was a quote I read on Twitter recently that said, “No one wants to learn your garbage query language.” There's always the risk of [going] too far astray, and now you're not really in the standard SQL canon, to the extent that there is even ANSI, it's not standard, let's face it. So, yeah, I'd love to just hear your thoughts on what inspired this uniquely ergonomic SQL dialect of DuckDB? 

Hannes (55:25)

Yeah, I think, I think that's really interesting. I think that came from this unique combination of the people using the database also using it for stuff. And I think whenever we run into something that's annoying, we kind of make a mental note of it. Of course, we also listen to people's complaints, and we do get a lot of issues on our issue tracker every day. So when the same sort of footgun has been fired 20 times, we think, hey, maybe something can be done. And for example.

You mentioned the GROUP BY ALL that is something that it's absolutely clear to a SQL engine at the point of parsing and binding the query to the schema, which of the columns you are aggregating and which of the columns you're grouping on. And so having the user typing this out, and again, it's just a source of errors. And then we get to deal with those errors in our issue reports. Again, the same is true for ordering, right? You can say order by all, and it just means, like, order by the first column all the way back to the last column. And hey, that's gonna, that's gonna be totally, it's totally obvious what to do in that case. 

And GROUP BY ALL, it’s funny, because I think everyone out there from the big players has copied it at this point.

Michael (56:44)

Absolutely right. Snowflake has it. ClickHouse has it.

Hannes (56:48)

Databricks has it. Everyone has it. And it's hilarious, because we should have patented it. But, no, we absolutely love this, that we make something that Snowflake goes and copies us, like, two weeks later, I think they're watching our I mean, there's somebody who's watching our blog posts on that, and I think that's just great. 

And we really want, I think people have a hatred of SQL and I can see where it comes from. It comes from, like, the trailing comma kind of nonsense. And like, over rigidity, overarching rigidity.

And people don't realize that in analytics, SQL is actually often the first language that you use to interact with data. Yes, there's tons of generated SQL, but if you think about, an analyst or a data engineer sitting there, they're not going to start building a dashboard from this data set they just found, they're going to write SQL queries and be like, “What on earth is this?” 

So that's why I think we have sort of realized that we are trying to make a language for humans, and not humans from the 1970s, either, but humans from 2024. So we also have things like if the listeners are interested, we have a friendlier SQL with DuckDB blog post series. I think we're now at number three. They list all these things that we do. And there's a ton of this. Just today, I actually was working on another one. The pull request is open, so if people want to look at it, it's there.

So in SQL, your aliases for things are behind the thing, right? And it can be very annoying, because the things can be different lengths, and then you never can not really see what the aliases are, and then you have to sort of mentally parse these strings to find what the alias is. So now we're putting the A list. You can put the alias in front with a colon, so you can say, select A colon 42 and that means that you collect select with the alias A and that actually cleans up your SQL very nicely, because you cannot put all your ages left. You control the length of those. You can even make them like a line or whatever. And that also works in from clauses for like table aliases, function call aliases, blah, blah, blah. And that's a change that we can make in the parser. Maybe something on the parser. So we started with the Postgres parser, which is this, which is this ungodly sort of mess of yuck.

Michael (59:24)

I think we've looked at it internally, but I think it seems like this is a place where everyone starts, right? They start with the Postgres parser, because it’s the natural place to start.

Hannes (59:35)

Yeah, but they usually don't touch it. So we have taken this thing apart and put it back together like 15 times. And it is a bison parser, which is Yacc, which is a technology from the 60s, and then we have started. And that's still intact, like we haven't broken anything. And I think that's what I tell to people that say, “Oh, but you're changing the world.” Now we're not like you can still do the ANSI star stuff if you insist, right? But you can also do this other thing, and you also the other things are kind of meant for people to to enjoy interactively. They're not meant for like tools to necessarily generate, right? If you have a tool that generates aggregations, it can write out the aggregation columns, fine. 

Michael (1:00:22)

Machines don’t make mistakes, it’s humans that make mistakes. It's humans who care about indentation, formatting, readability, 

Hannes (1:00:33)

But that's kind of funny. Also on the parsers. We've actually written a research paper on a new parser for DuckDB that's going to be a blog post soon again as well. And it's accepted for publication at the CIDR Conference next January. So this is the thing, right? Sometimes we can even push the envelope, and we managed to push the envelope on the parser, so we're going to actually replace that parser in the near future to be even more flexible with what we can do. 

Michael (1:01:03)

So just to play that back, the idea is, well, you started with the Postgres parser and taken that apart and put it back together several times, [and] extended it. The goal is that soon, that DuckDB will have its own from the ground up parser written. It will be compatible, it'll be able to parse every line, every SQL line of DuckDB that's ever been written from 1.0 on.

Hannes (1:01:40)

Yeah. The reason we are doing it is because there were things we cannot do and yakk, the language that Python uses to describe parsers is, or the abstraction, is just limited. 

For the parser connoisseurs out there, there's a one token look ahead. So you have one token look ahead, which means that there's just things that you cannot do, and there's already constructs in SQL, like order by, as you can probably see, two tokens. And bonus points, people can put comments in between, that are kind of challenging, these very limited abstractions. 

And so I think it's time for an overhaul. People have been using yakk parsers for the last 50 years, and there's no reason. Computers have changed, so that's why we are revisiting it.

Michael (1:02:39)

I know we're definitely coming up on time. We probably go for many hours. I'm going to sort of take us for the maybe last question or set area to explore here. You talked about writing the parquet reader for DuckDB. And I think one of, speaking of ergonomics, I think one of the hallmark use cases that I've seen for DuckDB, I've experienced it personally. I've seen others kind of talk about on social media, Reddit, X, Twitter, etc, is the ability to just read a parquet file into select a parquet file right from, with just a simple ‘select all from s3://<parquet.file>’.

That's been an incredible use case, and one that people love using DuckDB for. There's some, there's a lot to unpack in that simple statement of writing a select, writing SQL against parquet files. I think what it opens up is this bigger trend or what it highlights is this bigger trend that we're seeing of the convergence of data lake architectures and conventional database architectures. We see with Databricks and Snowflake, the two big gorillas on the commercial stage right now of analytics infrastructure. And recently, Databricks paid $2 billion, it's reported, to acquire Tabular, right, and the guys behind Iceberg. Clearly they felt, and of course, Snowflake has continued to invest in their support for external tables backed, essentially by collections of parquet files with some metadata on top, aka Iceberg. 

What does this rise of data lake architectures mean for DuckDB? How is it influencing your roadmap? I know that DuckDB is playing a role here. But how do you see DuckDB in this sort of emerging data lake architecture of the future?

Hannes (1:05:11)

Yeah, I think it's super interesting. As you said, we have had this very convenient user experience to deal with parquet files on S3. And we really benefited from the data lake sort of movement there, because stuff that was formally locked up now just sat on some storage folder. The lake house formats, as they call them, they are interesting in a sense that previously walled gardens, at least on paper, are opening up. Although I have my doubts, actually, on to what degree that will happen. I think actually, I see a sort of resurgence of control with the schemas, because the schemas are now actually controlling things like access credentials and handing them out.  And so you have to actually talk to you know, Snowflake again, to get the coordinates for the thing or whatever. I think it's super interesting. 

We are working together with Databricks on support for Delta Lake, so they are working with us to make that better in DuckDB. And I think DuckDB has a big role to play here, simply, even if you only consider the read path as a sort of unobtrusive way of interacting with these files. 

But that is regrettably kind of predicated on how the access to these schemas is going to be handled by these big players. Because what I can see happening soon, is that while, in theory, all these things are going to be sitting on S3 to actually meaningfully interact with them, you need to talk to some API somewhere. There's going to be a gatekeeper in front of that API, and that's going to be, I don't know what you used to call a DBA in your company. 

So I have a bit of mixed feelings. On the one hand, I think, yeah, we're going to support, I mean, we do support reading Iceberg and I think at this point to a bigger degree Delta Lake files from DuckDB. And we will continue to work on this for sure. And I think it's a chance for us to access things that were previously locked up in Snowflake, for example.

But at the same time, I also, I'm the paranoid aspect of me also sees, I don't know, like some doors closing again, simply because we are maybe not, we're gonna not see possibly the wild west of an S3 bucket with just random files on it that that we got a lot of sort of happy users from, as you said. That people could just point DuckDB at a parquet file, and that would be fine. If that in the future is hidden behind an API, that's something that worries me a little bit. I mean, we are working on integrating with those APIs as well, but again, there is going to be a gatekeeper. And whether every data scientist in an organization is going to have just blanket credentials for everything, I have reasons to doubt that. 

So we'll see. I think it's interesting to see the two, the big ticket sort of sale that happens. It has certainly gotten a bunch of people excited. I think eventually, when we had this kind of thing, before we had like, orc and parquet, and there was, for some time, there was controversy, but eventually it went away. I think with Databricks having bought Tabular we there's a chance of these things [Delta Lake and Iceberg] just eventually merging together. I think that's the most logical outcome.

Michael (1:08:53)

When you say merging together, what two things do you see merging together?

Hannes (1:09:00)

Iceberg and Delta Lake, these formats just basically merging together. They're also the catalogs. They're just, that's just the most logical outcome. Because I think Ali [Ghosdi] has said this very well, “If you have choice, what you will end up with is people not choosing any of them”? And I think that's not helping anyone in that particular space. For us, we are in a bit of an outside perspective, we just want to read these things, 

Michael (1:09:29)

It does seem like, of course, this is the pendulum swing that always happens. Platforms start out as open, right, as open as they can be, and they attract a vast number of folks. They tout their openness. They say “don't get, don't get locked in on Snowflake, use our open formats.” And then, of course, over time, that openness starts to ossify or there's some barriers that are put around it. And to your observation that the data itself is not useful if you don't have access to metadata that describes what that [data] is. 

It does feel like one, one area where I still feel there's a lot of opportunity, in fact, I think untapped opportunity is in data that's living on the edge. And so in an area where DuckDB could obviously be valuable, is that if you think about, to your point around monolithic systems getting scale up versus scale out, we're putting not just chips in everything. We're increasingly putting solid state drives everywhere. And so, it's a Tesla vehicle driving, or whether it's a point of sale system. SQLite has been embedded in a lot of these, some kind of these edge systems for a long time doing OLTP. But it seems natural that rather than shipping data from every edge device to the cloud into an S3 bucket, there's more and more data that's going to be living resident on these edge devices. And there's this MinIO that I think is doing pretty well. So anyway, hopefully that will help swing that pendulum a little bit away from these closed formats that require APIs. 

Hannes (1:11:28)

That's a good way of thinking about it, I think. But one of the things that I think is interesting about DuckDB is that it breaks up the traditional sort of place where the database lives completely. In the past, it was clear, here's your database server, the OLAP cube warehouse lives on the same [server]. And what we're seeing with DuckDB is this crazy explosion of sort of creativity where you can put a data engine, you can put it in a lambda, you can put it in your device out in the field, and we have actually customers that are doing that. There's a big cloud provider that's basically making S3 select alternative with DuckDB, because they're like, “Yeah, we can put this on our storage, it's fine. It doesn't cost us anything.”

So I think that's what people haven't fully sort of grocked yet, perhaps, and I'm really excited about seeing like steps towards it every time it happens. It's that there's this fundamental architectural sort of difference between the databases in one place to the database can be anywhere, anywhere that is where they have compute. It can be on your phone. And, I mean, we are in Europe, people get very excited about autonomy over their data and keeping everything on the device. Yeah, no, why not? I think that's an excellent approach, and I would love to see that more.

Michael (1:13:07)

Well maybe it brings us full circle to the observation, the insight that you started with in that paper you wrote, years ago, which is, you were, in fact, trying to bring a database through these UDFs, ou were trying to bring a database into other environments, in that case, data science tools like R and Python and maybe even MATLAB eventually. But the idea of, instead of moving the data to those databases, actually bringing some of the database capabilities into UDF maybe UDFs were not enough. And so you said, Okay, maybe we'll bring the entire, we'll embed the entire database into Python, not just this user defined function.

Hannes (1:13:57)

Yeah and we see some funny people running DuckDB in, like a Spark UDF on Databricks. And I'm like [laugh].

Michael (1:14:05)

There's a name for that in the United States. I'm sure you've heard some version, it's called a turducken. When you cook a turkey and then you put a duck inside the turkey, that's called a turducken. So there's certainly many versions of turduckens out there. Spark-duckin perhaps would be the term there.

Now for my last question and I'm going to let you get to your busy Friday evening ahead.

You remain a prolific contributor to the DuckDB code base, Mark, obviously, as well. And it's been many years. And I just wonder, for those devs out there that are thinking about, or somewhere on their journey of building an open source tool with a growing group of users. 

You talk about the first couple years really were a grind to kind of get the first versions of DuckDB built. 

How do you stay inspired and creative when working on a long term, complex project like DuckDB after this many years? What are the sources of inspiration for you when you open your laptop and dive into working on this new parser, for instance, that you're going to release in a couple of months?

Hannes (1:15:30)

Yeah, it's an interesting question. Honestly, it's hard for me to say because I generally like databases. I know it's weird. It's something I enjoy. And it's not as crazy anymore, let's say as in, maybe the first couple of days [of building DuckDB]. I actually don't have a computer at home anymore, which I can recommend to all the computer people out there. And I come to the office, we have an in person office here in Amsterdam where most of the development team is, we come to the office, we talk to each other, we look at the night sort of load of issues that have popped up, and we do get inspired, actually, with what people are doing and what people are reporting, and we just basically look at that. 

And we have, of course, our partners, like Rill where we interact with more closely. And then, of course, there's a ton of sort of requests coming out of there. And then we think about those, and we think about the best ways of doing that, but we ourselves, also just from our own intrinsic motivation, start a lot of projects that you wouldn't do if you were just trying to sell a database engine. 

And I think it's something that the joy that we have building it, I think, is reflected in the user experience, I hope, at least. And, for example, like these friendly SQL features, we just want people to enjoy working with data, we want people to confidently interact with data and not be like [groans].  So I think really the motivation is we, we do this because we think it's the best thing to do. We enjoy it and we enjoy making things work. We enjoy making new, crazy things work that nobody has thought about doing in a database engine, and then seeing how the world reacts. It is quite a strong motivator to see so much adoption of our ideas. Like, if you mentioned it earlier, if everybody, if all the whales adopt your stupid SQL feature in like, two months, that's pretty motivating, I would say, right? 

Michael (1:17:50)

Well, clearly, the joy is felt by your users. It's great to know that the team is also enjoying the act of building this incredible technology. Thank you for creating DuckDB and bringing it to all of us who have been joyfully adopting it over the years, the millions of folks who are reading your website and docs every month certainly are obviously taking some joy in that process, and thanks for taking an hour on a Friday evening, more than an hour to speak. And we look forward to continuing the collaboration in the open with DuckDB, DuckDB Labs and the great community that you and Mark and the team have built. So Hannes, thanks for your time. Thanks for your creation. And we look forward to seeing more great things in the coming months and years.

Hannes (1:18:41)

Thank you so much for having me, Mike.

Ready for faster dashboards?

Try for free today.