Deep Dives

Beyond Snowflake: The Rise of Real-Time Databases

Simon Späti

Author

Date

minutes

Reading time

CONTENTS

Example H2

Example H3

If you have worked in the field for long enough, you have used a data warehouse. You may have heard the term OLAP (Online Analytical Processing) systems. Lately, both capabilities have been merged into cloud data warehouses such as Redshift, BigQuery, or Snowflake, and the lines between OLAP, a cube, and real-time streaming have blurred.

The question is, why would you need an OLAP system, or when do you need a cloud data warehouse, or is there perhaps even a better solution for low-latency dashboards or customer-facing apps? Can't we just use a fast database?

This article will explore how real-time databases can serve us with critical analytical requirements. We will explore the differences between a cloud DWH, an OLAP cube, and a real-time database. We will also examine the categories of today's analytics solutions and how to choose the right one. Finally, we will discuss the history of real-time analytics and how one would migrate to such a database.

Beyond Cloud Data Warehouses

Usually, people start with a cloud data warehouse or a SaaS solution that handles all the compute and storage. If you are more adventurous, you can build one yourself with a Lakehouse (extension to Data Lake). Or, if you have the team and the skills, you can make one based on what the open-source data engineering landscape provides us.

Either way, you have requirements such as business-facing BI dashboards, reports, or a data app that shows all measures and insights of your data platform that serves business people. The next step is typically to improve speed. Dashboards get bigger, and users query non-optimized data that is either too slow to calculate (sometimes needs ETL-load overnight) or filtering/clicking around is too laggy (above 1 second).

If you go with a cloud data warehouse vendor, the next level is cost. It doesn't matter if Google BigQuery, Snowflake, Azure Fabric, Oracle, or any other; the price will be high due to each query, no matter if production or development is running on expensive cloud compute. Additionally, you get locked in with that vendor. However, you get a lot of simplicity with simpler architecture and peace of mind. Some complexity, such as running on Kubernetes and orchestration, is abstracted with a custom service within that vendor's platform. This abstraction is super convenient if that service delivers high-quality development or customer service and solves our requirements. Quite the opposite if that is not the case.

As always, this heavily depends on the majority and level of functionality you need, and is not easy to make a choice.

Cloud Data Warehouse Losing Its Ground

The cost-performance dilemma is an important point. As with, let's say, Snowflake, you pay-per-query, which comes expensive quickly, especially if your analytics is starting to pay off and growing into more departments. Budgeting or predicting costs is getting more complex.

This rising cost is one reason cloud data warehouses are getting bad press and losing ground. But what else should we use? What are the alternatives?

What if we pre-aggregate data so that queries get cheaper? We don't pay the actual computing of running the query, but "only" the cache to keep the aggregations in memory. What if we could leverage column-oriented designs better to profit from with fast filters and drilling down? Aren't we coming full circle and talking about OLAP systems? Let's see if we can answer these questions.

The Evolution of Analytical Systems

Let's look at where we come from and why history has chosen analytical systems such as OLAP cubes and cloud data warehouses.

The OLAP Pioneer

Early in my career, I was working with OLAP cubes in Oracle and SAP. But the one that didn't fade was SQL Server Analysis Services (SSAS) from Microsoft. During my consultancy, SSAS was used by every client. It was the magic ingredient to create extremely fast and responsive dashboards; we could hide complex calculations inside the cube and pre-calculate them.

Additionally, Microsoft built an Excel connection. The integration means that the biggest BI tool, Microsoft Excel, could query the same data from the pre-aggregated and well-defined KPIs within seconds. That meant users didn't need to recreate the business calculations repeatedly in their siloed Excel sheets; instead, they could do it centrally once. Everyone could profit from that.

Obviously, there are downsides. With SSAS cubes, it was the bottleneck of a central BI team. Changes to SSAS cubes were not made in minutes; it took a long time to change, test, and redeploy in most cases. However, the advantages still outweigh the downsides to a large extent.

To this day, SSAS, mostly tabular on Azure cloud, is used to power complex and heavy Power BI dashboards. But many new modern OLAP systems started to grow too. They were open-source and code first, compared to the SSAS, which is visual only, making it hard to collaborate, automate, or generate new cubes or metrics.

OLAP cubes were revolutionary due to their pre-calculated aggregations and slice-and-dice capabilities on the fly. However, the mentioned limitations, such as rigid schema and long processing times, eventually led to their decline.

A Brief History of OLAP and Related Analytics

A short evolution of OLAP systems shows that shortly after Edgar F. Codd coined the term OLAP, the first MOLAP systems (Essbase) were created. In 1998, with the first release of Microsoft OLAP Services with SQL Server 7.0, OLAP was known more widely and, later, in the early 2000s, used across businesses around the world. SSAS revolutionized business intelligence by making multidimensional analysis accessible to a broader audience. Its integration with Excel, the world's most widely used analytical tool, was particularly transformative.

2005, the C-Store project was announced as having columnar rather than row-based storage. This innovation sparked a new wave of analytical database designs that would eventually transform the OLAP landscape.

Vectorized query execution, pioneered by systems like MonetDB and later adopted by ClickHouse (developed at Yandex in 2010), processed data in column "chunks" rather than individual values. This approach took full advantage of CPU cache hierarchies, pipelining, and SIMD instructions, delivering order-of-magnitude performance improvements. 2012 Apache Druid at Metamarkets, 2013 Apache Pinot developed at LinkedIn. 2016 ClickHouse became open-source, and in 2017, Apache Doris became open-source by Baidu. Around 2018, StarRocks was forked based on Doris.

These major releases led to today's growth of cloud data warehouses and hybrid OLAP warehouse systems. Especially with the decoupling from storage and compute becoming standard.

Historical timeline of OLAP systems | Image by State of Open Source Read-Time OLAP Systems 2025

Looking at the key elements of such analytical systems, we see that storage innovation, query processing, scalability, and aggregation methods are playing a big part here:

Did we've Come Full Circle?

While the technical evolution of analytical systems appears to be a linear progression, it is cyclical, addressing the same fundamental challenge: balancing performance with flexibility.

Early OLAP systems achieved incredible speed through pre-aggregation but sacrificed flexibility. Modern systems return to OLAP principles with dynamic rather than static aggregation, enabled by hardware advances that early OLAP architects could only dream of.

Categories of Analytical Databases

Let's now focus on today, see what different categories of analytical databases there are, and see if we can find differences between cloud data warehouses and OLAP cubes.

Mike Driscoll said in our discussion: "All of these databases belong in the category of analytical databases—they all leverage column-oriented designs and are fast for aggregations and filters, and are considered OLAP databases".

Mike further shared that the significant difference between these analytical databases is the trade-off between cost (per-TB-stored) and performance (per-TB-scanned). I agree, and this is what we've seen happening lately. On the one side, cloud data warehouses like Snowflake use decoupled cloud storage + compute architectures, offering lower cost but with lower performance.

Cloud data warehouses introduce higher latencies per query since there's no guarantee that the data is in memory; it may need to be fetched from or scanned from its internal storage, or even slower, from a data lake in the form of an object storage. Notably, none of these offerings are open source; they are pure software as a service (SaaS).

Conversely, real-time databases like ClickHouse, Pinot, Druid, and DuckDB achieve higher performance by co-locating compute and storage. These systems keep frequently accessed data and aggregations in memory, enabling extremely fast query performance. Similar to what the initial OLAP solutions such as SSAS and others did best initially.

Different Categories of Analytical Databases

These discoveries lead us to two main categories of analytical databases:

Cloud data warehouses (Snowflake, Redshift, BigQuery, Azure Fabric, Firebolt)
Real-time databases (ClickHouse, Pinot, Druid, DuckDB, Starrocks, Doris, Kylin).

If we categorize real-time databases further, they might be the trade-off between scale vs. complexity.

Pinot and Druid are highly scalable and designed from the ground up as distributed systems. ClickHouse was initially architected as a single-node system but has since added support for horizontal scaling in its OSS and commercial cloud offering. DuckDB is intentionally non-distributed and has a single-node architecture (though MotherDuck is working on a cloud version).

At the end of the day, we engineers need to choose which analytics database technology we need based on these trade-offs between cost, performance, scale, and complexity—which only their business use cases can ultimately determine. Let's explore related technologies and how to choose the right one for us next.

Related Categories (Streaming, Vector, Federation)

Related technologies that we exclude for the comparison in this article are essential to include.

There are many streaming technologies out there. In a sense, real-time databases combine streaming and cloud data warehouses that update and handle events. Estuary categorized in different buckets such as capture, transport, operation, and analytics on one axis, and open-source, hybrid and managed (closed-source) on the other:

Streaming Data Landscape | Image by Estuary

This overview is interesting, as we have mixed technologies that do only the discussed compute, others do compute + storage, and others are mixed. Let's see further how we can choose the right path for us.

However, two more related categories are doing similar things to what is discussed here. vector database (Pinecone, Qdrant, Weaviate), which are powerful backend databases for Large Language Models (LLMs). These are mostly a mix of compute and storage but are optimized for AI. A simple and straightforward way for AI engineers to store and query their rehashed data.

Another one is data virtualization or Federation (Dremio, Trino, Apache Arrow). It's a middle ground between cloud DWHs and real-time databases, where we virtually join different data sources as a logical entity without the need to move and copy data around or pre-aggregate, though there are caches, too.

Combining Streaming and Batch is Not New

This reminded me of Delta Lake Tables, or Open Table Formats when we tried to use them as a sink for both stream and batch jobs—unifying batch and streaming sources and eliminating the need for a Kappa Architecture. This is important to remember with additional features such as built-in CDC, streaming capabilities, and sharing features. More of this is well explained in Beyond Lambda: Introducing Delta Architecture or through code examples.

Single-Node Processors

Also related, check out single-node processors such as Polars, Pandas, Apache Arrow, DataFusion and DuckDB.

The Real-Time Databases

So, what are real-time databases?

Real-time databases allow us to have extremely fast response time, which is needed for time-critical analytics. They enable real-time updates through direct streaming ingestion, similar to streaming solutions. They allow a more cost-friendly approach if you have an analytical, query-heavy workload.

Real-time databases use modern OLAP technologies to combine the best of traditional OLAP systems with modern analytical capabilities. Some of the benefits of real-time databases:

Sub-second query response times enable interactive dashboards and analytics experiences
Columnar storage optimization dramatically speeds up aggregations and filtering by reading only the needed columns, reducing I/O bottlenecks.
Cost efficiency is achieved through co-located compute and storage architecture that eliminates expensive data movement between layers. Co-located compute also allows direct access to data without network transfer delays, reducing query latency.
Real-time data ingestion supports streaming and batch processing, enabling fast insights from fresh data without separate pipelines.
Vectorized processing leverages modern CPU capabilities to process data in chunks rather than row-by-row, delivering performance gains.
Lower operational costs for query-heavy analytical workloads as pre-calculated and not charged for each query.
More flexibility through eliminating the rigidity of traditional cubes while maintaining their performance benefits for analytical queries.
Open-source foundations provide vendor independence and community-driven innovation, reducing lock-in risks compared to proprietary solutions.

More details such as differences between the system architectures of ClickHouse, Druid, Pinot, Doris, and Starrocks are shown in the illustration by Alireza Sadeghi:

Comparison of major open-source OLAP database systems showing architectural differences | Image by Alireza Sadeghi on the State of open-source OLAP 2025

Real-time databases serve analytical data in sub-seconds but have an additional overhead and some disadvantages; let's look at how to choose the right tool for the right job.

Choosing Your Path

So, how do you ensure the best choice for you?

Choosing the approach and solutions to replace a cloud data warehouse for low-latency analytics solutions requires careful consideration of specific needs, constraints, and objectives. Let's explore how to make this decision strategically.

For example, if you need a lot of joins or cleansing of your data (e.g., complex data pipelines), an OLAP cube isn't optimized for joins. Pre-joining or cleansing the data beforehand is something that cloud data warehouses are much more suitable for.

Also, adding an additional layer to an existing cloud data warehouse might be more complex and cost more initially. Still, it is worth investing in if customer-serving analytics handles all required queries in sub-seconds compared to multiple seconds or minutes.

But is it more costly? Do we need an additional layer, or can we use a cheap data lake and server our real-time database directly? What is the ROI of potentially migrating from a cloud data warehouse?

Analytical Database Selection Guide

Below is a high-level overview of what we've discussed here and a guide to help you choose the right analytical solution.

Dataset Size/Need	Recommended Approach	Key Characteristics
Smaller Datasets (less than 100GB)	Single-node OLAP engines like DuckDB, embedded ClickHouse, Polars	Fast local performance with minimal complexity Perfect for departmental analytics or small data Direct querying of data lake formats without operational overhead Integrated architecture in a single process Often embedded within applications
Cloud Data Warehousing	Cloud data warehouses like Snowflake, Redshift, BigQuery, Azure Fabric, Firebolt	Decoupled object storage and compute architecture Lower cost for large data volumes (per-TB-stored) Strong support for complex joins and transformations Flexible scaling for big datasets Pay-per-query pricing models
Query Across Data Sources	Federation/Virtualization like Dremio, Trino, Apache Arrow	Query data where it resides without copying Logical unification of disparate sources No data movement required Caching for performance optimization SQL interface over varied data stores
Real-Time Analytics	Real-time databases like ClickHouse, Pinot, Druid, StarRocks, Doris, Kylin	Sub-second query response times Support for streaming data ingestion Optimized for high query throughput Co-located compute and storage for performance Cost-efficient for query-heavy workloads Data often resident in memory

Categories are not strictly defined

The lines between categories are blurred. OLAP systems now add warehouse features, while data warehouses add OLAP-like capabilities. This convergence reflects an underlying fact that the business needs to constantly—deliver accurate insights quickly and cost-effectively.

Hybrid Approaches

Many modern architectures combine multiple approaches to leverage their respective strengths.

For example, Data Warehouse + OLAP pairs warehouse capabilities for complex transformations and joins with OLAP for serving analytical queries, which is ideal when you need both complex data manipulation and fast query responses. The Data Lake + OLAP pattern stores raw data in a lake while materializing aggregates in an OLAP engine, which is best suited when data volume is high but query patterns are predictable. Finally, Federation + Materialization allows querying data in place when possible while materializing frequently accessed paths, which is particularly effective when data is distributed across multiple systems.

Understanding When to Use Real-Time Databases

Three key reasons one would choose a real-time OLAP solution are stated below.

You need sub-second queries - If your use case demands response times under 1-2 seconds:
- IoT applications with millions of events
- AdTech platforms requiring immediate campaign optimization
- User-facing dashboards where performance impacts adoption
- Financial systems needing real-time market insights
You're running query-heavy workloads - When query volume is high:
- Cloud data warehouses with per-query pricing become expensive
- Pre-aggregating common patterns can dramatically reduce costs
- Predictable pricing becomes easier with computer-based vs. query-based billing
You want zero-disk architecture benefits - Real-time databases systems can:
- Use S3/object storage for persistence (like ClickHouse's S3-backed MergeTree)
- Keep hot data in memory for performance
- Scale storage and compute independently

Zero-Disk vs. Zero-ETL

If you are confused about zero-ETL or wonder if they are related, these address different challenges. "Zero-disk" uses cloud object storage instead of dedicated disks, while "zero-ETL" eliminates traditional data movement pipelines. Real-time databases often use zero-disk for cost optimization, while zero-ETL focuses on federation capabilities.

Additional Characteristics and Consideration

Furthermore, we can look at data characteristics, query patterns, technical constraints, or ergonomic factors we may have.

Check volume and growth rate, how frequency (batch vs streaming) is updated, and structure and complexity called data characteristics. Analyze what's your query patterns. Do you have many predefined vs ad-hoc queries and complex joins? And how many concurrent queries need to run. These determine the acceptable response time.

Do you have technical constraints? For example, do you have the team expertise? Do you need to be on-prem or on the cloud? What integration do you need with other tools from the data stack do you have? Consider security and compliance.

Developer-friendliness or ergonomic factors play another role. Can you quickly deploy and integrate into your DevOps tooling? How is cost predictability? And what is the business impact if insights are delayed?

Migration Strategy

Let's say you have a cloud data warehouse and have discovered, with the above guide, that a real-time database would suit your needs better.

Migration Guide

There are many migration guides. For example the Clickhouse guide offers to migrate from DBMS, ETL/ELT tools, object storage, or the cloud data warehouse Redshift.

Migrating to real-time databases requires thoughtful schema optimization - denormalize where appropriate, choose sort keys based on query patterns, and use compact data types to improve compression and performance.

Your migration approach depends mainly on your infrastructure constraints and data volume. PUSH methods using ETL tools like Airbyte and dlt offer transformation capabilities, and PULL methods enable direct database connections. At the same time, PIVOT approaches using cloud storage work best for large-scale migrations with parallel processing.

For a successful transition, implement a phased migration starting with a single data mart while running systems in parallel to validate query results and performance. Establish proper monitoring, backup procedures, and scaling strategies for your new environment. Consider operational factors like data freshness requirements and query patterns to ensure your real-time database delivers the expected improvements.

Common Pitfalls and How to Avoid Them

Underestimating schema optimization is perhaps the most common mistake. While most real-time databases can handle any schema, performance varies dramatically with proper design. Start by profiling your existing queries to identify access patterns before migration.

Another frequent pitfall is attempting to migrate all workloads simultaneously; instead, prioritize use cases where real-time performance delivers immediate value. Be cautious about replicating your warehouse's data modeling patterns, as real-time databases often perform best with different joins, aggregations, and partitioning approaches.Finally, don't neglect data validation. Implement reconciliation processes to verify data consistency between systems during parallel operation phases.

ROI and Cost Considerations

Real-time databases deliver ROI through multiple avenues—direct cost savings on query-heavy workloads, substantial performance improvements with sub-second query times, and increased business agility by encouraging more data exploration without penalty.

Organizations often report significant reductions in their analytics spending after migration, particularly those with high query volumes or dashboards that previously required expensive compute resources. Track metrics like dashboard load times, user engagement with analytical tools, and the business impact of faster decision-making to quantify your migration's value.

Unlike cloud data warehouses that charge per query or compute time, real-time databases often use a resource-based pricing model with more predictable costs. This shift can dramatically reduce expenses for analytical workloads since you're not paying for each operation, though initial infrastructure setup may require investment.

The engineering trade-offs include potentially higher complexity in deployment and maintenance. Still, reduced query expenses and performance gains can typically offset these costs. Plus, the real-time databases have better and better onboarding scripts and documentation, as seen above.

A good exercise is to conduct a three-month total cost comparison between your current warehouse and the proposed real-time database to establish a clear financial baseline for your migration decision, including the benefits you'll gain from either solution.

Looking Ahead: The Future of Analytical Systems

It's always hard to predict the future; I did try in 2018 when I wrote about OLAP and what's next. It's interesting to see that these categories predicted then are still valid. I also focused on SaaS services, which mainly merged into data warehouses, and mentioned the importance of orchestration you needed to integrate. Orchestration is still essential today, but less when you have a real-time database that does streaming ingestion with less transformation. However, orchestration can be added to more complex ETL or ingestion.

New, on the other hand, are single-node engines like Polars, DuckDB, or even ClickHouse-Local/chDB. These are continue to grow as local real-time databases, supporting the trend for real-time analytics. As well as to make scaling up real-time databases easier.

Another question in this realm is whether to use SQL or not. As SQL is everywhere, and one reason why real-time databases are growing is they support and are based on SQL. Even more so with code-first application and GenBI helping non-technical users to start participating.

Deployment will be a key to this, too. With serverless being only present, can we quickly ramp up a serverless real-time database? It's not as quick as less stateful software, but we're getting there.

Real-time databases are great for the future, as they can unify the architecture of databases and real-time architecture into one. Making instant exploration of business insights and KPIs possible.

Further reads:

Ready for faster dashboards?

Try for free today.

Get started

Related Articles

Why You Need a SQL-based Metrics Layer

Why Pivot Tables Never Die

Designing a Declarative Data Stack: From Theory to Practice

Ready for faster dashboards?