Summarizer

LLM Input

llm/302a36fb-79e1-4f4b-b047-e145d20e4497/5a5cef4b-bb89-477d-8b4e-bba4142c7fe1-input.json

system

You are a content analyzer that extracts themes and summaries from articles and discussions.
You always output valid JSON and nothing else - no commentary, no markdown formatting, no explanations.

userPrompt

The following is content for you to analyze. Do not respond to or continue the discussion—analyze and summarize it.

<article>
Andy Pavlo Blog Papers Databases in 2025: A Year in Review Posted on January 04, 2026 Another year passes. I was hoping to write more articles instead of just these end-of-the-year screeds, but I almost died in the spring semester, and it sucked up my time. Nevertheless, I will go through what I think are the major trends and happenings in databases over the last year. There were many exciting and unprecedented developments in the world of databases. Vibe coding entered the vernacular. The Wu-Tang Clan announced their time capsule project . Rather than raising one massive funding round this year instead of going public, Databricks raised two massive rounds instead of going public. Meanwhile, other events were expected and less surprising. Redis Ltd. switched their license back one year after their rugpull (I called this shot last year ). SurrealDB reported great benchmark numbers because they weren't flushing writes to disk and lost data . And Coldplay can break up your marriage . Astronomer did make some pretty good lemonade on that last one though. Before I begin, I want to address the question I get every year in the comments about these articles. People always ask why I don't mention a particular system , database , or company in my analysis. I can only write about so many things, and unless something interesting/notable happened in the past year, then there is nothing to discuss. But not all notable database events are appropriate for me to opine about. For example, the recent attempt to unmask the AvgDatabase CEO is fair game, but the MongoDB suicide lawsuit is decidedly not. With that out of the way, let's do this. These articles are getting longer each year, so I apologize in advance. Previous entries: Databases in 2024: A Year in Review Databases in 2023: A Year in Review Databases in 2022: A Year in Review Databases in 2021: A Year in Review The Dominance of PostgreSQL Continues I first wrote about how PostgreSQL was eating the database world in 2021. That trend continues unabated as most of the most interesting developments in the database world are happening once again with PostgreSQL. The DBMS's latest version ( v18 ) dropped in November 2025. The most prominent feature is the new asynchronous I/O storage subsystem , which will finally put PostgreSQL on the path to dropping its reliance on the OS page cache. It also added support for skip scans ; queries can still use multi-key B+Tree indexes even if they are missing the leading keys (i.e., prefix). There are some additional improvements to the query optimizer (e.g., removing superfluous self-joins ). Savvy database connoisseurs will be quick to point out that these are not groundbreaking features and that other DBMSs have had them for years. PostgreSQL is the only major DBMS still relying on the OS page cache. And Oracle has supported skip scans since 2002 (v9i)! You may wonder, therefore, why I am claiming that the hottest action in databases for 2025 happened with PostgreSQL? The reason is that most of the database energy and activity is going into PostgreSQL companies, offerings, projects, and derivative systems. Acquisitions + Releases: In the last year, the hottest data start-up ( Databricks ) paid $1b for a PostgreSQL DBaaS company ( Neon ). Next, one of the biggest database companies in the world ( Snowflake ) paid $250m for another PostgreSQL DBaaS company ( CrunchyData ). Then, one of the biggest tech companies on the planet (Microsoft) launched a new PostgreSQL DBaaS ( HorizonDB ). Neon and HorizonDB follow Amazon Aurora's original high-level architecture from the 2010s, with a single primary node separating compute and storage. For now, Snowflake's PostgreSQL DBaaS uses the same core architecture as standard PostgreSQL because they built on Crunchy Bridge . Distributed PostgreSQL: All of the services I listed above are single-primary node architectures. That is, applications send writes to a primary node, which then sends those changes to secondary replicas. But in 2025, there were two announcements on new projects to create scale-out (i.e., horizontal partitioning) services for PostgreSQL. In June 2025, Supabase announced that it had hired Sugu , the Vitess co-creator and former PlanetScale co-founder/CTO, to lead the Multigres project to create sharding middleware for PostgreSQL, similar to how Vitess shards MySQL. Sugu left PlanetScale in 2023 and had to lie back in the cut for two years. He is now likely clear of any legal issues and can make things happen at Supabase. You know it is a big deal when a database engineer joins a company, and the announcement focuses more on the person than the system. The co-founder/CTO of SingleStore joined Microsoft in 2024 to lead HorizonDB , but Microsoft (incorrectly) did not make a big deal about it. Sugu joining Supabase is like Ol' Dirty Bastard (RIP) getting out on parole after two years and then announcing a new record deal on the first day of his release. One month after the Multigres news dropped, PlanetScale announced its own Vitess-for-PostgreSQL project, Neki . PlanetScale launched its initial PostgreSQL DBaaS in March 2025, but the core architecture is single-node stock PostgreSQL with pgBouncer . Update 2026-01-05: I was reminded via private email that PgDog is also another open-source middleware system that seeks to support horizontal sharding in PostgreSQL. I had mentally compartmentalized PgDog in the same category as a connection pooling proxy ( PgBouncer ), but it is actually a competitor to Multigres and Neki. Commercial Landscape: With Microsoft's introduction of HorizonDB in 2025, all major cloud vendors now have serious projects for their own PostgreSQL offerings. Amazon has offered Aurora PostgreSQL since 2017. Google put out AlloyDB in 2022. ServiceNow launched its RaptorDB service in 2024 , based on its 2021 acquisition of Swarm64. Even the old flip-phone IBM has had its cloud version of PostgreSQL since 2018. Oracle released its PostgreSQL service in 2023 , though there is a rumor that its in-house PostgreSQL team was collateral damage in its MySQL OCI layoffs in September 2025. There are still a few independent (ISV) PostgreSQL DBaaS companies. Supabase is likely the largest of these by the number of instances. Others include YugabyteDB , TigerData (née TimeScale), PlanetScale , Xata , PgEdge , and Nile . Xata built its original architecture on Amazon Aurora , but this year, it announced it is switching to its own infrastructure . ParadeDB has yet to announce its hosted service. Tembo dropped its hosted PostgreSQL offering in 2025 to pivot to a coding agent that can do some database tuning. Hydra and PostgresML went bust in 2025 (see Deaths section), so they're out of the game. Other systems provide a Postgres-compatible front-end, but the back-end systems are not derived from PostgreSQL (e.g., CockroachDB , CedarDB , Google Spanner ). There are also hosting companies that offer PostgreSQL DBaaS alongside other systems, such as Aiven and Tessel . Andy's Take: It is not clear who the next major buyer will be after Databricks and Snowflake bought PostgreSQL companies. Again, every major tech company already has a Postgres offering. EnterpriseDB is the oldest PostgreSQL ISV, but missed out on the two most significant PostgreSQL acquisitions in the last five years. But they can ride Bain Capital's jock for a while, or hope that HPE buys them even though that partnership is from eight years ago. The PostgreSQL M&A playfield is reminiscent of OLAP acquisitions in the late 2000s, when Vertica was the last one waiting at the bus stop after AsterData , Greenplum , and DATAllegro were acquired. The development of the two three competing distributed PostgreSQL projects ( Multigres , Neki , PgDog ) is welcome news. These projects are not the first time somebody has attempted this: Greenplum , ParAccel , and Citus have been around for two decades for OLAP workloads. Citus supports OLTP workloads, but they started in 2010 focused on analytics . For OLTP, 15 years ago, the NTT RiTaDB project joined forces with GridSQL to create Postgres-XC . Developers from Postgres-XC founded StormDB , which Translattice later acquired in 2013. Postgres-X2 was an attempt to modernize XC, but the developers abandoned that effort. Translattice open-sourced StormDB as Postgres-XL , but the project has been dormant since 2018. YugabyteDB came out in 2016 and is probably the most widely deployed sharded PostgreSQL system (and remains open-source !), but it is a hard fork, so it is only compatible with PostgreSQL v15 . Amazon announced its own sharded PostgreSQL ( Aurora Limitless ) in 2024, but it is closed source. I know Microsoft bought Citus in 2019 but it is hard to keep track of what they were doing before HorizonDB because of their confusing product names. Citus was rebranded as Azure Database for PostgreSQL Hyperscale in 2019 and was then renamed to Azure Cosmos DB for PostgreSQL in 2022. But then there is Azure Database for PostgreSQL with Elastic Clusters that also uses Citus, but it is not the same as the Citus-powered Azure Cosmos DB for PostgreSQL. Microsoft discontinued Azure PostgreSQL Single Server in 2023, but kept Azure PostgreSQL Flexible Server. That is a lot of Azure this and Azure that. It is sort of like how Amazon could not resist adding "Aurora" to DSQL's name. Either way, at least Microsoft was smart enough to keep the name for their new system to just "Azure HorizonDB" (for now). The PlanetScale squad has no love for the other side and is known to throw hands at Neon and Timescale . Database companies popping off at each other is nothing new (see Yugabyte vs. CockroachDB or Databricks vs. Snowflake ). I suspect we will see more of this in the future as the PostgreSQL wars heat up. I suggest that these smaller companies call out the big cloud vendors and keep each other's name out of their mouths . MCP For Every Database! If 2023 was the year every DBMS added a vector index , then 2025 was the year that every DBMS added support for Anthropic's Model Context Protocol (MCP). MCP is a standardized client-server JSON-RPC interface that lets LLMS interact with external tools and data sources without requiring custom glue code. An MCP server acts as middleware in front of a DBMS and exposes a listing of tools, data, and actions it provides. An MCP client (e.g., an LLM host such as Claude or ChatGPT) discovers and uses these tools to extend its models' capabilities by sending requests to the server. In the case of databases, the MCP server converts these queries into the appropriate database query (e.g., SQL) or administrative command. In other words, MCP is the middleman who keeps the bricks counted and the cream straight , so the database and LLMs trust each other enough to do business. Anthropic announced MCP in November 2024, but it really took off in March 2025 when OpenAI announced it would support MCP in its ecosystem . Over the next few months, every DBMS vendor released MCP servers for all system categories: OLAP (e.g., ClickHouse , Snowflake , Firebolt , Yellowbrick ), SQL (e.g., YugabyteDB , Oracle , PlanetScale ), and NoSQL (e.g., MongoDB , Neo4j , Redis ). Since there is no official Postgres MCP server, every Postgres DBaaS has released its own (e.g., Timescale , Supabase , Xata ). The cloud vendors released multi-database MCP servers that can talk to any of their managed database services (e.g., Amazon , Microsoft , Google ). Allowing a single gateway to talk to heterogeneous databases is almost, but not quite, a holy-grail federated database . As far as I know, each request in these MCP servers targets only a single database at a time, so the application is responsible for performing joins across sources. Beyond the official vendor MCP implementations, there are hundreds of rando MCP server implementations for nearly every DBMS. Some of them attempt to support multiple systems (e.g., DBHub , DB MCP Server ). DBHub put out a good overview of PostgreSQL MCP servers. An interesting feature that has proven helpful for agents is database branching. Although not specific to MCP servers, branching allows agents to test database changes quickly without affecting production applications. Neon reported in July 2025 that agents create 80% of their databases . Neon was designed from the beginning to support branching (Nikita showed me an early demo when the system was still called " Zenith "), whereas other systems have added branching support later. See Xata's recent comparison article on database branching. Andy's Take: On one hand, I'm happy that there is now a standard for exposing databases to more applications. But nobody should trust an application with unfettered database access, whether it is via MCP or the system's regular API. And it remains good practice only to grant minimal privileges to accounts. Restricting accounts is especially important with unmonitored agents that may start going wild all up in your database. This means that lazy practices like giving admin privileges to every account or using the same account for every service are going to get wrecked when the LLM starts popping off. Of course, if your company leaves its database open to the world while you cause the stock price of one of the wealthiest companies to drop by $600b , then rogue MCP requests are not your top concern. From my cursory examination of a few MCP server implementations, they are simple proxies that translate the MCP JSON requests into database queries. There is no deep introspection to understand what the request aims to do and whether it is appropriate. Somebody is going to try to order 18,000 water cups in your application, and you need to make sure it doesn't crash your database. Some MCP servers have basic protection mechanisms (e.g., ClickHouse only allows read-only queries ). DBHub provides a few additional protections , such as capping the number of returned records per request and implementing query timeouts. Supabase's documentation offers best-practice guidelines for MCP agents, but they rely on humans to follow them. And of course, if you rely on humans to do the right thing, bad things will happen . Enterprise DBMSs already have automated guardrails and other safety mechanisms that open-source systems lack, and thus, they are better prepared for an agentic ecosystem. For example, IBM Guardium and Oracle Database Firewall identify and block anomalous queries. I am not trying to shill for these big tech companies. I know we will see more examples in the future of agents ruining lives, like accidentally dropping databases . Combining MCP servers with proxies (e.g., connection pooling) is an excellent opportunity to introduce automated protection mechanisms. MongoDB, Inc. v. FerretDB Inc. MongoDB has been the NoSQL stalwart for two decades now. FerretDB was launched in 2021 by Percona's top brass to provide a middleware proxy that converts MongoDB queries into SQL for a PostgreSQL backend. This proxy allows MongoDB applications to switch over to PostgreSQL without rewriting queries. They coexisted for a few years before MongoDB sent FerretDB a cease-and-desist letter in 2023, alleging that FerretDB infringes MongoDB's patents, copyrights, and trademarks, and that it violates MongoDB's license for its documentation and wire protocol specification. This letter became public in May 2025 when MongoDB went nuclear on FerretDB by filing a federal lawsuit over these issues. Part of their beef is that FerretDB is out on the street, claiming they have a " drop-in replacement " for MongoDB without authorization. MongoDB's court filing has all the standard complaints about (1) misleading developers, (2) diluting trademarks, and (3) damaging their reputation. The story is further complicated by Microsoft's announcement that it donated its MongoDB-compatible DocumentDB to the Linux Foundation . The project website mentions that DocumentDB is compatible with the MongoDB drivers and that it aims to " build a MongoDB compatible open source document database ". Other major database vendors, such as Amazon and Yugabyte, are also involved in the project. From a cursory glance, this language seems similar to what MongoDB is accusing FerretDB of doing. Andy's Take: I could not find an example of a database company suing another one for replicating their API. The closest is Oracle suing Google for using a clean-room copy of the Java API in Android. The Supreme Court ultimately ruled in favor of Google on fair use grounds, and the case affected how re-implementation is treated legally. I don't know how the lawsuit will play out if it ever goes to trial. A jury of random people off the street may not be able to comprehend the specifics of MongoDB's wire protocol, but they are definitely going to understand that the original name of FerretDB was MangoDB . It is going to be challenging to convince a jury that you were not trying to divert customers when you changed one letter in the other company's name. Never mind that it is not even an original name: there is already a parody DBMS called MangoDB that writes everything to /dev/null as a joke. And while we are on the topic of database system naming, Microsoft's choice of "DocumentDB" is unfortunate. There are already Amazon DocumentDB (which, by the way, is also compatible with MongoDB, but Amazon probably pays for that), InterSystems DocDB , and Yugabyte DocDB . Microsoft's original name for "Cosmos DB" was also DocumentDB back in 2016. Lastly, MongoDB's court filing claims they "pioneered the development of 'non-relational' databases". This statement is incorrect. The first general-purpose DBMSs were non-relational because the relational model had not yet been invented. General Electric's Integrated Data Store (1964) used a network data model , and IBM's Information Management System (1966) used a hierarchical data model . MongoDB is also not the first document DBMS. That title goes to the object-oriented DBMSs from the late 1980s (e.g., Versant ) or the XML DBMSs from the 2000s (e.g., MarkLogic ). MongoDB is the most successful of these approaches by a massive margin (except maybe IMS). File Format Battleground File formats are an area of data systems that have been mostly dormant for the last decade. In 2011, Meta released a column-oriented format for Hadoop called RCFile . Two years later, Meta refined RCFile and announced the PAX-based ORC (Optimized Record Columnar File) format. A month after ORC's release, Twitter and Cloudera released the first version of Parquet . Nearly 15 years later, Parquet is the dominant file open-source format. In 2025, there were five new open-source file formats released vying to dethrone Parquet: CWI FastLanes CMU + Tsinghua F3 SpiralDB Vortex The Germans' AnyBlox Microsoft Amudai These new formats joined the other formats released in 2024: Meta Nimble LanceDB Lance IoTDB TsFile SpiralDB made the biggest splash this year with their announcement of donating Vortex to the Linux Foundation and the establishment of their multi-organization steering committee. Microsoft quietly killed off Amudai (or at least closed sourced it) at some point at the end of 2025. The other projects (FastLanes, F3, Anyblox) are academic prototypes. Anyblox won the VLDB Best Paper award this year. This fresh competition has lit a fire in the Parquet developer community to modernize its features . See this in-depth technical analysis of the columnar file format landscape by Parquet PMC Chair ( Julien Le Dem ). Andy's Take: The main problem with Parquet is not inherent in the format itself. The specification can and has evolved. Nobody expected organizations to rewrite petabytes of legacy files to update them to the latest Parquet version. The problem is that there are so many implementations of reader/writer libraries in different languages, each supporting a distinct subset of the specification. Our analysis of Parquet files in the wild found that 94% of them use only v1 features from 2013, even though their creation timestamps are after 2020. This lowest common denominator means that if someone creates a Parquet file using v2 features, it is unclear whether a system will have the correct version to read it. I worked on the F3 file format with brilliant people at Tsinghua ( Xinyu Zeng , Ruijun Meng , Huanchen Zhang ), CMU ( Martin Prammer , Jignesh Patel ), and Wes McKinney . Our focus is on solving this interoperability problem by providing both native decoders as shared objects (Rust crates) and embedded WASM versions of those decoders in the file. If somebody creates a new encoding and the DBMS does not have a native implementation, it can still read data using the WASM version by passing Arrow buffers. Each decoder targets a single column, allowing a DBMS to use a mix of native and WASM decoders for a single file. AnyBlox takes a different approach, generating a single WASM program to decode the entire file. I don't know who will win the file format war. The next battle is likely to be over GPU support. SpiralDB seems to be making the right moves, but Parquet's ubiquity will be challenging to overcome. I also didn't even discuss how DuckLake seeks to upend Iceberg... Of course, when this topic comes up, somebody always posts this xkcd comic on competing standards . I've seen it before. You don't need to email it to me again. Random Happenings Databases are big money. Let's go through them all! Acquisitions: Lots of movement on the block. Pinecone replaced its CEO in September to prepare for an acquisition , but I have not heard anything else about it. Here are the ones that did happen: DataStax → IBM The Cassandra stalwart got picked up by IBM at the beginning of the year for an estimated $3b . Quickwit → DataDog The leading company behind the Lucene replacement, Tantivy , a full-text search engine, was acquired at the beginning of the year. The good news is that Tantivy development continues unabated. SDF → dbt This acquisition was a solid pick-up for dbt as part of their Fusion announcement this year. It allows them to perform more rigorous SQL analysis in their DAGs. Voyage.ai → MongoDB Mongo picked up an early-stage AI company to expand its RAG capabilities in its cloud offering. One of my best students joined Voyage one week before the announcement. He thought he was going against the "family" by not signing with a database company, only to end up at one. Neon → Databricks Apparently, there was a bidding war for this PostgreSQL company, but Databricks paid a mouthwatering $1b for it. Neon still exists today as a standalone service, but Databricks quickly rebranded it in its ecosystem as Lakebase . CrunchyData → Snowflake You know Snowflake could not let Databricks get all the excitement during the summer, so they paid $250m for the 13-year-old PostgreSQL company CrunchyData. Crunchy had picked up top ex-Citus talent in recent years and was expanding its DBaaS offering before Snowflake wrote them a check. Snowflake announced the public preview of its Postgres service in December 2025. Informatica → Salesforce The 1990s old-school ETL company Informatica got picked up by Salesforce for $8b . This is after they went public in 1999, reverted to PE in 2015, and went public again in 2021. Couchbase → Private Equity To be honest, I never understood how Couchbase went public in 2021. I guess they were riding on MongoDB's coattails? Couchbase did interesting work a few years ago by incorporating components from the AsterixDB project at UC Irvine . Tecton → Databricks Tecton provides Databricks with additional tooling to build agents. Another one of my former students was at the company and is now at Databricks. Tobiko Data → Fivetran This team is behind two useful tools: SQLMesh and SQLglot . The former is the only viable open-source contender to dbt (see below for their pending merger with Fivetran). SQLglot is a handy SQL parser/deparser that supports a heuristic-based query optimizer. The combination of this in Fivetran and SDF with dbt makes for an interesting technology play in this space in the coming years. SingleStore → Private Equity The PE firm buying SingleStore ( Vector Capital ) has prior experience in managing a database company. They previously purchased the XML database company MarkLogic in 2020 and flipped it to Progress in 2023 . Codership → MariaDB After getting bought by PE in 2024, the MariaDB Corporation went on a buying spree this year. The first up is the company behind the Galera Cluster scale-out middleware for MariaDB. See my 2023 overview of the MariaDB dumpster fire . SkySQL → MariaDB And then we have the second MariaDB acquisition. Just so everyone is clear, the original commercial company backing MariaDB was called "SkySQL Corporation" in 2010, but it changed its name to "MariaDB Corporation" in 2014. Then in 2020, the MariaDB Corporation released a MariaDB DBaaS called SkySQL. But because they were hemorrhaging cash, the MariaDB Corporation spun SkySQL Inc. out as an independent company in 2023 . And now, in 2025, MariaDB Corporation has come full circle by buying back SkySQL Inc . I did not have this move on my database bingo card this year. Crystal DBA → Temporal The automated database optimization tool company heads off to Temporal to automatically optimize their databases! I'm happy to hear that Crystal's founder and Berkeley database group alumnus Johann Schleier-Smith is doing well there. HeavyDB → Nvidia This system (formerly OmniSci, formerly MapD) was one of the first GPU-accelerated databases, launched in 2013. I couldn't find an official announcement of their closing, aside from an M&A firm listing the successful deal. And then we had a meeting with Nvidia to discuss potential database research collaborations, and some HeavyDB friends showed up. DGraph → Istari Digital Dgraph was previously acquired by Hypermode in 2023 . It looks like Istari just bought Dgraph and not the rest of Hypermode (or they ditched it). I still haven't met anybody who is actively using Dgraph. DataChat → Mews This was one of the first "chat with your database" out of the University of Wisconsin and now CMU-DB professor Jignesh Patel . But they were bought by a European hotel management SaaS. Take that to mean what you think it means. Datometry → Snowflake Datometry has been working on the perilous problem of automatically converting legacy SQL dialects (e.g., Teradata) to newer OLAP systems for several years. Snowflake picked them up to expand their migration tooling . See Datometry's 2020 CMU-DB tech talk for more info. LibreChat → ClickHouse Like Snowflake buying Datometry, ClickHouse's acquisition here is a good example of improving the developer experience for high-performance commodity OLAP engines. Mooncake → Databricks After buying Neon, Databricks bought Mooncake to enable PostgreSQL to read/write to Apache Iceberg data. See their November 2025 CMU-DB talk for more info. Confluent → IBM This is the archetype of how to make a company out of a grassroots open-source project. Kafka was originally developed at Linkedin in 2011. Confluent was then spun out as a separate startup in 2014. They went IPO seven years later in 2021. Then IBM wrote a big check to take it over. Like with DataStax, it remains to be seen whether IBM will do to Confluent what IBM normally does with acquired companies , or whether they will be able to remain autonomous like RedHat. Gel → Vercel Formerly EdgeDB , they provided DSL on top of PostgreSQL that got picked up by Verel at the end of the year. Kuzu → ??? The embedded graph DBMS out of the University of Waterloo was acquired by an unnamed company in 2025. The KuzuDB company then announced it was abandoning the open-source project. The LadybugDB project is an attempt at maintaining a fork of the Kuzu code. Mergers: Unexpected news dropped in October 2025 when Fivetran and dbt Labs announced they were merging to form a single company. The last merger I can think of in the database space was the 2019 merger between Cloudera and Hortonworks . But that deal was just weak keys getting stepped on in a kitchen : two companies that were struggling to find market relevance with Hadoop merged into a single company to try to find it (spoiler: they did not). The MariaDB Corporation merger with Angel Pond Holdings Corporation in 2022 via a SPAC technically counts too, but that deal was so MariaDB could backdoor their way to IPO. And it didn't end well for investors . The Fivetran + dbt merger is different (and better) than these two. They are two complementary technology companies combining to become an ETL juggernaut, preparing for a legit IPO in the near future. Funding: Unless I missed them or they weren't announced, there were not as many early-stage funding rounds for database startups. The buzz around vector databases has muted, and VCs are only writing checks for LLM companies. Databricks - $4b Series L Databricks - $1b Series K ClickHouse - $350m Series C Supabase - $200m Series D Timescale - $110m Series C Supabase - $100m Series E Astronomer - $93m Series D Tessel - $60m Series B LanceDB - $30m Series A Convex - $24m Series B SpiralDB - $22m Series A ParadeDB - $12m Series A CedarDB $5.9m Seed TopK - $5.5m Seed Columnar - $4m Seed SereneDB - $2.1m Pre-Seed Starburst - Undisclosed? TurboPuffer - Undisclosed? Name Changes: A new category in my yearly write-up is database companies changing the name of their company or system. HarperDB → Harper The JSON database company dropped the "DB" suffix from its name to emphasize its positioning as a platform for database-backed applications, similar to Convex and Heroku. I like the Harper people. Their 2021 CMU-DB tech talk presented the worst DBMS idea I have ever heard. Thankfully, they ditched that once they realized how bad it was and switched to LMDB. EdgeDB → Gel This was a smart move because the name "Edge" conveys that it is a database for edge devices or services (e.g., Fly.io ). But I'm not sure "Gel" conveys the project's higher-level goals. See their 2025 CMU-DB talk on Gel's query language (still called EdgeQL) from a CMU Ph.D. alum. Timescale → TigerData This is a rare occurrence of a database company renaming itself to distinguish itself from its main database product. It is usually companies renaming themselves to be the name of the database (e.g., "Relational Software, Inc." to "Oracle Systems Corporation", "10gen, Inc." to "MongoDB, Inc."). But it makes sense for the company to try to shed the perception of being a specialized time-series DBMS instead of an improved version of PostgreSQL for general applications, since the former is a much smaller market segment than the latter. Deaths: In full disclosure, I was a technical advisor for two of these failed startups. My success rate as an advisor is terrible at this point. I was also an advisor for Splice Machine , but they closed shop in 2021. In my defense, I only talk with these companies about technical ideas, not business strategies. And I did tell Fauna they should add SQL support, but they did not take my advice. Fauna An interesting distributed DBMS based on Dan Abadi's research for deterministic concurrency control . They provided strongly consistent transactions right when the NoSQL fade was waning, and Spanner made transactions cool again. But they had a proprietary query language and made big bets on GraphQL. PostgresML The idea seemed obvious: enable people to run ML/AI operations inside of their PostgreSQL DBMS. The challenge was to convince people to migrate their existing databases to their hosted platform. They were pushing pgCat as a proxy to mirror database traffic. One of the co-founders joined Anthropic. The other co-founder created a new proxy project called pgDog . Derby This is one of the first DBMSs written in Java, dating back to 1997 (originally called "Java DB" or "JBMS"). IBM donated it to the Apache Foundation in the 2000s, and it was renamed as Derby. In October 2025, the project announced that the system would enter "read-only mode" because no one was actively maintaining it anymore. Hydra Although there is no official announcement for the DuckDB-inside-Postgres startup, the co-founders and employees have scattered to other companies. MyScaleDB This was a fork of Clickhouse that adds vector search and full-text indexing using Tantivy. They announced they were closing in May 2025. Voltron Data This was supposed to be the supergroup of database companies. Think of it like having Run the Jewels team of heavy hitters. You had top engineers from Nvidia Rapids, the inventor of Apache Arrow and Python Pandas, and the Peruvian GPU wizards from BlazingSQL . Then throw in $110m in VC money from top firms that included the future CEO of Intel (and a board of trustee member at CMU ). They built a GPU-accelerated database ( Theseus ), but failed to launch it in a timely manner. Lastly, although not a business, I would be remiss not to mention the closing of IBM Research Almaden . IBM built this site in 1986 and was the database research mecca for decades. I interviewed at Almaden in 2013 and found the scenery to be beautiful. The IBM Research Database Group is not what it used to be . Still, the alum list of this hallowed database ground is impressive: Rakesh Agrawal , Donald Chamberlin , Ronald Fagin , Laura Haas , Mohan , Pat Selinger , Moshe Vardi , Jennifer Widom , and Guy Lohman . Update 2026-01-05: I missed that Gel was acquired by Vercel in December 2025. [Credit] Update 2026-01-05: I also missed that Supabase raised two funding rounds in 2025. Update 2026-01-05: Although TurboPuffer has not made an official announcement for raising a round, their CEO mentions adding somebody from Thrive Capital to their team. [Credit] Update 2026-01-05: Apparently I need a better way to track fundraises because I missed LanceDB's Series A round too! [Credit] Andy's Take: Somebody claimed that I judge the quality of a database based on how much funding the backing company raises for its development. This is obviously not true. I track these happenings because the database research game is crowded and high-energy. Not only am I "competing" against academics at other universities, but big tech companies and small start-ups are also putting out interesting systems I need to follow. The industry research labs are not what they used to be, except for Microsoft Research, which is still aggressively hiring top people and doing incredible work. I predicted in 2022 that there would be a large number of database company closings in 2025. Yes, there were more closings this year than in previous years, but not at the scale I expected. The death of Voltron and sort-of acquihire of HeavyDB seem to continue the trend of the inviability of GPU-accelerated databases. Kinetica has been milking government contracts for years, and Sqream still appears to be kicking it. These companies are still niche, and nobody has been able to make a significant dent in the dominance of CPU-powered DBMSs. I can't say who or what, but you will hear some major GPU-accelerated database announcements by vendors in 2026. It also provides further evidence of the commoditization of OLAP engines; modern systems have gotten so fast that the performance between them is negligible for low-level operations (scans, joins), so the things that differentiate one system from another are user experience and the quality of the query plans their optimizers generate. The Couchbase and SingleStore acquisitions by private equity (PE) firms might signal a future trend in the database industry. Of course, PE acquisitions have happened before, but they all seem to be in recent times: (1) MarkLogic in 2020, (2) Cloudera in 2021, and (3) MariaDB in 2023. The only ones I can find before 2020 were SolidDB in 2007 and Informatica in 2015. PE acquisitions might replace the trend of plateaued database companies being bought by holding companies that milk the maintenance fees until eternity (Actian, Rocket). Even Oracle is still making money off RDB/VMS after buying them 30 years ago! Lastly, props to Nikita Shamgunov . As far as I know, he is the only person to have co-founded two database companies ( SingleStore and Neon ) that were both acquired in a single year. Like when DMX (RIP) released two #1 albums in a single year ( It's Dark and Hell Is Hot , Flesh of My Flesh ), I don't think anybody is going to break Nikita's record any time soon. Peak Male Performance Talk about a banner year for the database OG Larry Ellison. The man turned 81 and accomplished more in one year than most people do in their lifetime. I will cover it all in chronological order. Larry started the year ranked third-richest in the world. The idea that he would be worth less than Mark Zuckerberg was keeping him up at night. Some were saying Larry's insomnia was due to a diet change after he bought a famous British pub and was eating more pies. But I assure you that Larry's " veg-aquarian " diet has not changed in 30 years. Then, in April 2025, we got the news that Larry had become the second-richest person in the world . He started sleeping a little better, but it still wasn't good enough. There was also still a lot going on in his life that was stressing him out. For example, Larry finally decided to sell his rare, semi-road-legal McLaren F1 supercar , complete with the original owner's manual in the glovebox. In July 2025, Larry graced us with this third tweet in 13 years (known as "#3" by Larry aficionados such as myself). This was an update about the Ellison Institute of Technology (EIT) that Larry established near the University of Oxford. With the name EIT and its association with Oxford, it sounds like it would be a pure research, non-profit institution, similar to Stanford's SRI or CMU's SEI . But it turns out to be an umbrella organization for a series of for-profit companies owned by a California-based limited liability company. Of course, a bunch of weirdos replied to #3 with promises of blockchain-powered cryogenic freezing or room-temperature superconductors . Larry told me he ignores those. Then there are people like this guy who get it. The biggest database news of the year (possibly the century) hit us on Wednesday, September 10th, at approximately 3:00pm EST. After waiting for his turn for decades, Larry Joseph Ellison was finally anointed the richest person in the world . $ORCL shares rose by 40% that morning, and since Larry still owns 40% of the company, his estimated total worth is $393b . To put this in perspective, this not only made Larry the wealthiest person in the world, but also the richest person in the entire history of humanity. The peak net worths, adjusted for inflation, of John D. Rockefeller and Andrew Carnegie (yes, the 'C' in CMU) were only $340b and $310b , respectively. On top of Larry's ascension to the top of the world, Oracle was also involved in the acquisition of the U.S. company controlling TikTok and Larry bankrolling Paramount (controlled by his son from his fourth marriage) bid to take over Warner Bros . The U.S. president even chided Larry to take control of CNN's news division since Larry is the majority shareholder of Paramount. Andy's Take: I don't even know where to begin. Of course, when I found out that Larry Ellison had become the richest person in the world, all thanks to databases, I was heartened that something positive had finally happened in our lives. I don't care that Oracle's stock was artificially pumped up by splashy deals to build AI data centers instead of its traditional software business. I don't care that he dropped down the rankings after personally losing $130b in two months . That's like you and me blowing a paycheck on FortuneCoins . It stings a little, and we had to eat rice and beans for two weeks mixed with expired hot sauce packets we took from Taco Bell, but we'll be alright. Some people claim that Larry is out of touch with ordinary people. Or that he has lost his way because he is involved in things not directly related to databases. They point to things like his Hawaiian robot farm selling lettuce at $24/pound (€41/kg). Or that 81-year-old men don't have naturally blonde hair . The truth is that Larry Ellison has conquered the enterprise database world, competitive sailing , and techbro wellness spas . The obvious next step is to take over a cable TV channel watched by thousands of people waiting in airports every day. Every time I talk with Larry, he makes it clear that he does not care one bit what people say or think about him. He knows his fans love him . His (new) wife loves him . And in the end, that's all that matters. Conclusion Before we close, I want to give some quick shout outs and words of advice. First is to PT for keeping their database game tight with Turso in lockdown (see you on the outside). Condolences to JT for losing their job for trapping their KevoDB database sidepiece. And be sure to only put in fake data in your database for testing and not to sell it for $175m only to end up getting a seven year bid . My Ph.D. students and I also have a new start-up . I hope to say more on that soon. Word is bond.
</article>

<comments>
1. Maybe off-topic but,

If you're not familiar with the CMU DB Group you might want to check out their eccentric teaching style [1].

I absolutely love their gangsta intros like [2] and pre-lecture dj sets like [3].

I also remember a video where he was lecturing with someone sleeping on the floor in the background for some reason. I can't find that video right now.

Not too sure about the context or Andy's biography, I'll research that later, I'm even more curious now.

[1] https://youtube.com/results?search_query=cmu+database

[2] https://youtu.be/dSxV5Sob5V8

[3] https://youtu.be/7NPIENPr-zk?t=85

2. Indeed, I was delighted when I read the part about wutang's time capsule and obviously OP is a wu-tang and general hip hop fan. The intro you shared is dope!

3. I can't understand if their "intro to database systems" is an introductory (undergrad) level course or some advanced course (as in, introduction to database (internals)).

Anyone willing to clarify this? I'm quite weak at database stuff, i'd love to find some undergrad-level proper course to learn and catch up.

4. It is an undergrad course, though it is cross-listed for masters students as well. At CMU, the prerequisites chain looks like this: 15-122 (intro imperative programming, zero background assumed, taken by first semester CS undergrads) -> 15-213 (intro systems programming, typically taken by the end of the second year) -> 15-445 (intro to database systems, typically taken in the third or fourth year). So in theory, it's about one year of material away from zero experience.

5. It's the internals.

He is training up people to work on new features for existing databases, or build new ones.

Not application developers on how to use a database.

Knowing some of the internals can help application developers make better decisions when it comes to using databases though.

6. Here is the playlist: https://www.youtube.com/playlist?list=PLSE8ODhjZXjYMAgsGH-Gt...

You can tell from the topics, it's related to building databases, not using them.

7. (I consed "https://" onto your links so they'd become clickable. Hope that's ok!)

8. While the author mentions that he just doesn't have the time to look at all the databases, none of the reviews of the last few years mention immutable and/or bi-temporal databases.

Which looks more like a blind spot to me honestly. This category of databases is just fantastic for industries like fintech.

Two candidates are sticking out.
https://xtdb.com/blog/launching-xtdb-v2 (2025)
https://blog.datomic.com/2023/04/datomic-is-free.html (2023)

9. > none of the reviews of the last few years mention immutable and/or bi-temporal databases.

We hosted XTDB to give a tech talk five weeks ago:

https://db.cs.cmu.edu/events/futuredata-reconstructing-histo...

> Which looks more like a blind spot to me honestly.

What do you want me to say about them? Just that they exist?

10. Nice work Andy. I'd love to hear about semantic layer developments in this space (e.g. Malloy etc.). Something to consider for the future. Thanks.

11. > I'd love to hear about semantic layer developments in this space (e.g. Malloy etc.)

We also hosted Llyod to give a talk about Malloy in March 2025:

https://db.cs.cmu.edu/events/sql-death-malloy-a-modern-open-...

12. You can get pretty far with just PG using tstzrange and friends: https://www.postgresql.org/docs/current/rangetypes.html

Otherwise there are full bitemporal extensions for PG, like this one: https://github.com/hettie-d/pg_bitemporal

What we do is range types for when a row applies or not, so we get history, and then for 'immutability' we have 2 audit systems, one in-database as row triggers that keeps an on-line copy of what's changed and by who. This also gives us built-in undo for everything. Some mistake happens, we can just undo the change easy peasy. The audit log captures the undo as well of course, so we keep that history as well.

Then we also do an "off-line" copy, via PG logs, that get shipped off the main database into archival storage.

Works really well for us.

13. People are slow to realize the benefit of immutable databases, but it is happening. It's not just auditability; immutable databases can also allow concurrent reads while writes are happening, fast cloning of data structures, and fast undo of transactions.

The ones you mentioned are large backend databases, but I'm working on an "immutable SQLite"...a single file immutable database that is embedded and works as a library: https://github.com/radarroark/xitdb-java

14. I see people bolting temporality and immutability onto triple stores, because xtdb and datomic can't keep up with their SPARQL graph traversal. I'm hoping for a triple store with native support for time travel.

15. Lance graph?

16. FYI I made a comment very similar to yours, before reading yours. I'll put it here for reference. https://news.ycombinator.com/item?id=46503181

17. Why fintech specifically?

18. Destructive operations are both tempting to some devs and immensely problematic in that industry for regulatory purposes, so picking a tech that is inherently incapable of destructive operations is alluring, I suppose.

19. I would assume that it's because in fintech it's more common than in other domains to want to revert a particular thread of transactions without touching others from the same time.

20. Not only transactions - but state of the world.

21. compliance requirements mostly (same for health tech)

22. Because, money.

23. XTDB addresses a real use-case. I wish we invested more in time series databases actually: there's a ton of potential in a GIS-style database, but 1D and oriented around regions on the timeline, not shapes in space.

That said, it's kind of frustrating that XTDB has to be its own top-level database instead of a storage engine or plugin for another. XTDB's core competence is its approach to temporal row tagging and querying. What part of this core competence requires a new SQL parser?

I get that the XTDB people don't want to expose their feature set as a bunch of awkward table-valued functions or whatever. Ideally, DB plugins for Postgres, SQLite, DuckDB, whatever would be able to extend the SQL grammar itself (which isn't that hard if you structure a PEG parser right) and expose new capabilities in an ergonomic way so we don't end up with a world of custom database-verticals each built around one neat idea and duplicating the rest.

I'd love to see databases built out of reusable lego blocks to a greater extent than today. Why doesn't Calcite get more love? Is it the Java smell?

24. > it's kind of frustrating that XTDB has to be its own top-level database instead of a storage engine or plugin for another. XTDB's core competence is its approach to temporal row tagging and querying. What part of this core competence requires a new SQL parser?

Many implementation options were considered before we embarked on v2, including building on Calcite. We opted to maximise flexibility over the long term (we have bigger ambitions beyond the bitemporal angle) and to keep non-Clojure/Kotlin dependencies to a minimum.

25. From my perspective on databases, two trends continued in 2025:

1: Moving everything to SQLite

2: Using mostly JSON fields

Both started already a few years back and accelerated in 2025.

SQLite is just so nice and easy to deal with, with its no-daemon, one-file-per-db and one-type-per value approach.

And the JSON arrow functions make it a pleasure to work with flexible JSON data.

26. From my perspective, everything's DuckDB.

Single file per database, Multiple ingestion formats, full text search, S3 support, Parquet file support, columnar storage. fully typed.

WASM version for full SQL in JavaScript.

27. This is a funny thread to me because my frustration is at the intersection of your comments: I keep wanting sqlite for writes (and lookups) and duckdb for reads. Are you aware of anything that works like this?

28. DuckDB can read/write SQLite files via extension. So you can do that now with DuckDB as is.

https://duckdb.org/docs/stable/core_extensions/sqlite

29. My understanding is that this is still too slow for quick inserts, because duckdb (like all columnar stores) is designed for batches.

30. The way I understood it, you can do your inserts with SQLite "proper", and simultaneously use DuckDB for analytics (aka read-only).

31. Aha! That makes so much sense. Thank you for this.

Edit: Ah, right, the downside is that this is not going to have good olap query performance when interacting directly with the sqlite tables. So still necessary to copy out to duckdb tables (probably in batches) if this matters. Still seems very useful to me though.

32. Analytics is done in "batches" (daily, weekly) anyways, right?

We know you can't get both, row and column orders at the same time, and that continuously maintaining both means duplication and ensuring you get the worst case from both worlds.

Local, row-wise writing is the way to go for write performance. Column-oriented reads are the way to do analytics at scale. It seems alright to have a sync process that does the order re-arrangement (maybe with extra precomputed statistics, and sharding to allow many workers if necessary) to let queries of now historical data run fast.

33. Not all olap-like queries are for daily reporting.

I agree that the basic architecture should be row order -> delay -> column order, but the question (in my mind) is balancing the length of that delay with the usefulness of column order queries for a given workload. I seem to keep running into workloads that do inserts very quickly and then batch reads on a slower cadence (either in lockstep with the writes, or concurrently) but not on the extremely slow cadence seen in the typical olap reporting type flow. Essentially, building up state and then querying the results.

I'm not so sure about "continuously maintaining both means duplication and ensuring you get the worst case from both worlds". Maybe you're right, I'm just not so sure. I agree that it's duplicating storage requirements, but is that such a big deal? And I think if fast writes and lookups and fast batch reads are both possible at the cost of storage duplication, that would actually be the best case from both worlds?

I mean, this isn't that different conceptually from the architecture of log-structured merge trees, which have this same kind of "duplication" but for good purpose. (Indeed, rocksdb has been the closest thing to what I want for this workload that I've found; I just think it would be neat if I could use sqlite+duckdb instead, accepting some tradeoffs.)

34. > the question (in my mind) is balancing the length of that delay with the usefulness of column order queries for a given workload. I seem to keep running into workloads that do inserts very quickly and then batch reads on a slower cadence (either in lockstep with the writes, or concurrently) but not on the extremely slow cadence seen in the typical olap reporting type flow. Essentially, building up state and then querying the results.

I see. Can you come up with row/table watermarks? Say your column store is up-to-date with certain watermark, so any query that requires freshness beyond that will need to snoop into the rows that haven't made it into the columnar store to check for data up to the required query timestamp.

In the past I've dealt with a system that had read-optimised columnar data that was overlaid with fresh write-optimised data and used timestamps to agree on the data that should be visible to the queries. It continuously consolidated data into the read-optimised store instead of having the silly daily job that you might have in the extremely slow cadence reporting job you mention.

You can write such a system, but in reality I've found it hard to justify building a system for continuous updates when a 15min delay isn't the end of the world, but it's doable if you want it.

> I'm not so sure about "continuously maintaining both means duplication and ensuring you get the worst case from both worlds". Maybe you're right, I'm just not so sure. I agree that it's duplicating storage requirements, but is that such a big deal? And I think if fast writes and lookups and fast batch reads are both possible at the cost of storage duplication, that would actually be the best case from both worlds?

I mean that if you want both views in a consistent world, then writes will bring things to a crawl as both, row and column ordered data needs to be updated before the writing lock is released.

35. Yes! We're definitely talking about the same thing here! Definitely not thinking of consistent writes to both views.

Now that you said this about watermarks, I realize that this is definitely the same idea as streaming systems like flink (which is where I'm familiar with watermarks from), but my use cases are smaller data and I'm looking for lower latency than distributed systems like that. I'm interested in delays that are on the order of double to triple digit milliseconds, rather than 15 minutes. (But also not microseconds.)

I definitely agree that it's difficult to justify building this, which is why I keep looking for a system that already exists :)

36. I think you could build an ETL-ish workflow where you use SQLite for OLTP and DuckDB for OLAP, but I suppose it's very workload dependent, there are several tradeoffs here.

37. Right. This is what I want, but transparently to the client. It seems fairly straightforward, but I keep looking for an existing implementation of it and haven't found one yet.

38. very interesting. whats the vector indexing story like in duckdb these days?

also are there sqlite-duckdb sync engines or is that an oxymoron

39. https://duckdb.org/docs/stable/core_extensions/vss

It's not bad if you need something quick. I haven't had a large need of ANN in duckdb since it's doing more analytical/exploratory needs, but it's definitely there if you need it.

40. From my perspective - do you even need a database?

SQLite is kind-of the middle ground between a full fat database, and 'writing your own object storage'. To put it another way, it provides 'regularised' object access API, rather than, say, a variant of types in a vector that you use filter or map over.

41. If I would write my own data storage I would re-implement SQLite. Why would I want to do that?

42. Not sure if this is quite what you are getting at, but the SQLite folks even mention this as a great use-case: https://www.sqlite.org/appfileformat.html

43. As a backend database that's not multi user, how many web connections that do writes can it realistically handle? Assuming writes are small say 100+ rows each?

Any mitigation strategy for larger use cases?

Thanks in advance!

44. Couple thousand simultaneous should be fine, depending on total system load, whether you're running on spinning disks or on SSDs, p50/99 latency demands and of course you'd need to enable the WAL pragma to allow simultaneous writes in the first place. Run an experiment to be sure about your specific situation.

45. You also need BEGIN CONCURRENT to allow simultaneous write transactions.

https://www.sqlite.org/src/doc/begin-concurrent/doc/begin_co...

46. After 2 years in production with a small (but write heavy) web service... it's a mixed bag. It definitely does the job, but not having a DB server does have not only benefits, but also drawbacks. The biggest being (lack of) caching the file/DB in RAM. As a result I have to do my own read caching, which is fine in Rust using the mokka caching library, but it's still something you have to do yourself, which would otherwise come for free with Postgres.
This of course also makes it impossible to share the cache between instances, doing so would require employing redis/memcached at which point it would be better to use Postgres.

It has been OK so far, but definitely I will have to migrate to Postgres at one point, rather sooner than later.

47. How would caching on the db layer help with your web service?

In my experience, caching makes most sense on the CDN layer. Which not only caches the DB requests but the result of the rendering and everything else. So most requests do not even hit your server. And those that do need fresh data anyhow.

48. As I said, my app is write heavy. So there are several separate processes that constantly write to the database, but of course, often, before writing, they need to read in order to decide what/where to write. Currently they need to have their own read cache in order to not clog the database.

The "web service" is only the user facing part which bears the least load. Read caching is useful there too as users look at statistics, so calculating them once every 5-10 minutes and caching them is needed, as that requires scanning the whole database.

A CDN is something I don't even have. It's not needed for the amount of users I have.

If I was using Postgres, these writer processes + the web service would share the same read cache for free (coming from Posgres itself). The difference wouldn't be huge if I would migrate right now, but now I already have the custom caching.

49. I am no expert, but SQLite does have in memory store? At least for tables that need it..ofc sync of the writes to this store may need more work.

50. Why have multiple connections in the first place?

If your writes are fast, doing them serially does not cause anyone to wait.

How often does the typical user write to the DB? Often it is like once per day or so (for example on hacker news). Say the write takes 1/1000s. Then you can serve

1000 * 60 * 60 * 24 = 86 million users

And nobody has to wait longer than a second when they hit the "reply" button, as I do now ...

51. > If your writes are fast, doing them serially does not cause anyone to wait.

Why impose such a limitation on your system when you don't have to by using some other database actually designed for multi user systems (Postgres, MySQL, etc)?

52. Because development and maintenance faster and easier to reason about. Increasing the chances you really get to 86 million daily active users.

53. So in this solution, you run the backend on a single node that reads/writes from an SQLite file, and that is the entire system?

54. Thats basically how the web started. You can serve a ridiculous number of users from a single physical machine. It isn't until you get into the hundreds-of-millions of users ballpark where you need to actually create architecture. The "cloud" lets you rent a small part of a physical machine, so it actually feels like you need more machines than you do. But a modern server? Easily 16-32+ cores, 128+gb of ram, and hundreds of tb of space. All for less than 2k per month (amortized). Yeah, you need an actual (small) team of people to manage that; but that will get you so far that it is utterly ridiculous.

Assuming you can accept 99% uptime (that's ~3 days a year being down), and if you were on a single cloud in 2025; that's basically last year.

55. I agree...there is scale and then there is scale. And then there is scale like Facebook.

We need not assume internet FB level scale for typical biz apps where one instance may support a few hundred users max. Or even few thousand. Over engineering under such assumptions is likely cost ineffective and may even increase surface area of risk. $0.02

56. That depends on the use case. HN is not a good example. I am referring to business applications where users submit data. Ofc in these cases we are looking at 00s not millions of users. The answer is good enough.

57. Pardon my ignorance, yet wasn't the prevailing thought a few years ago that you would never use SQLite in production? Has that school of thought changed?

58. SQlite as a database for web services had a little bit of a boom due to:

1. People gaining newfound appreciation of having the database on the same machine as the web server itself. The latency gains can be substantial and obviously there are some small cost savings too as you don't need a separate database server anymore. This does obviously limit you to a single web server, but single machines can have tons of cores and serve tens of thousands of requests per second, so that is not as limiting as you'd think.

2. Tools like litestream will continuously back up all writes to object storage, so that one web server having a hardware failure is not a problem as long as your SLA allow downtimes of a few minutes every few years. (and let's be real, most small companies for which this would be a good architecture don't have any SLA at all)

3. SQLite has concurrent writes now, so it's gotten much more performant in situations with multiple users at the same time.

So for specific use cases it can be a nice setup because you don't feel the downsides (yet) but you do get better latency and simpler architecture. That said, there's a reason the standard became the standard, so unless you have a very specific reason to choose this I'd recommend the "normal" multitier architectures in like 99% of cases.

59. > SQLite has concurrent writes now

Just to clarify: Unless I've missed something, this is only with WAL mode and concurrent reads at the same time as writes, I don't think it can handle multiple concurrent writes at the same time?

60. I think only Turso — SQLite rewritten in Rust — supports that.

61. I’m a fan of SQLite but just want to point out there’s no reason you can’t have Postgres or some other rdbms on the same machine as the webserver too. It’s just another program running in the background bound to a port similar to the web server itself.

62. SQLite is likely the most widely used production database due to its widespread usage in desktop and mobile software, and SQLite databases being a Library of Congress "sustainable format".

63. Most of the usage was/is as a local ACID-compliant replacement for txt/ini/custom local/bundled files though.

64. "Production" can mean many different things to different people. It's very widely used as a backend strutured file format in Android and iOS/macOS (e.g. for appls like Notes, Photos). Is that "production"? It's not widely used and largely inappropriate for applications with many concurrent writes.

Sqlite docs has a good overview of appropriate and inappropriate uses: https://sqlite.org/whentouse.html
It's best to start with Section 2 "Situations Where A Client/Server RDBMS May Work Better"

65. Only for large scale multiple user applications. It’s more than reasonable as a data store in local applications or at smaller scales where having the application and data layer on the same machine are acceptable.

If you’re at a point where the application needs to talk over a network to your database then that’s a reasonable heuristic that you should use a different DB. I personally wouldn’t trust my data to NFS.

66. What is a "local application"?

67. Funny how people used to ask "what is a cloud application", and now they ask "what is a local application" :-)

Local as in "desktop application on the local machine" where you are the sole user.

68. This, though I think other posters have pointed to a web app/site that’s backed by SQLite. It can be a perfectly reasonable approach, I think, as the application is the web server and it likely accesses SQLite on the same machine.

69. The reason you heard that was probably because they were talking about a more specific circumstance. For example SQLite is often used as a database during development in Django projects but not usually in production (there are exceptions of course!). So you may have read when setting up Django, or a similar thing, that the SQLite option wasn't meant for production because usually you'd use a database like Postgres for that. Absolutely doesn't mean that SQLite isn't used in production, it's just used for different things.

70. I would say SQLite when possible, PostgreSQL (incl. extensions) when necessary, DuckDB for local/hobbyist data analysis and BigQuery (often TB or PB range) for enterprise business intelligence.

71. I think the right pattern here is edge sharding of user data. Cloudflare makes this pretty easy with D1/Hyperdrive.

72. For as much talk as I see about SQLite, are people actually using it or does it just have good marketers?

73. Among people who can actually code (in contrast to just stitch together services), I see it used all around.

For someone who openly describes his stack and revenue, look up Pieter Levels, how he serves hundreds of thousands of users and makes millions of dollars per year, using SQLite as the storage layer.

74. It's the standard for mobile. That said, in server-side enterprise computing, I know no one who uses it. I'm sure there are applications, but in this domain you'd need a good justification for not following standard patterns.

I have used DuckDB on an application server because it computes aggregations lightning fast which saved this app from needing caching, background services and all the invalidation and failure modes that come with those two.

75. > are people actually using it or does it just have good marketers?

_You_ are using it right this second. It's storing your browser's bookmarks (at a minimum, and possibly other browser-internal data).

76. If you use desktops, laptops, or mobile phones, there is a very good chance you have at least ten SQLite databases in your possession right now.

77. It is fantastic software, have you ever used it?

78. I don't have a use case for it. I've used it a tiny bit for mocking databases in memory, but because it's not fully Postgres, I've switched entirely to TestContainers.

79. FWIW (and this is IMHO of course) DuckDB makes working with random JSON much nicer than SQLite, not least because I can extract JSON fields to dense columnar representations and do it in a deterministic, repeatable way.

The only thing I want out of DuckDB core at this point is support for overriding the columnar storage representation for certain structs. Right now, DuckDB decomposes structs into fields and stores each field in a column. I'd like to be able to say "no, please, pre-materialize this tuple subset and store this struct in an internal BLOB or something".

80. Pavlo is right to be skeptical about MCP security. The entire philosophy of MCP seems to be about maximizing context availability for the model, which stands in direct opposition to the principle of Least Privilege.

When you expose a database via a protocol designed for 'context', you aren't just exposing data; you're exposing the schema's complexity to an entity that handles ambiguity poorly. It feels like we're just reinventing SQL injection, but this time the injection comes from the system's own hallucinations rather than a malicious user.

81. Totally agree, unfettered access to databases are dangerous

There are ways to reduce injection risk since LLMs are stateless and thus you can monitor the origination and the trustworthiness of the context that enters the LLM and then decide if MCB actions that affect state will be dangerous or not

We've implementeda mechanism like this based on Simon Willison's lethal trifecta framework as an MCP gateway monitoring what enters context. LMK if you have any feedback on this approach to MCP security. This is not as elegant as the approach that Pavlo talks about in the post, but nonetheless, we believe this is a good band-aid solution for the time bein,g as the technology matures

https://github.com/Edison-Watch/open-edison

82. > Totally agree, unfettered access to databases are dangerous

Any decent MVCC database should be able to provide an MCP access to a mutable yet isolated snapshot of the DB though, and it doesn't strike me as crazy to let the agent play with that .

83. For this database has to have nested transactions, where COMMITs do propagate up one level and not to the actual database, and not many databases have them. Also, a double COMMIT may propagate changes outside of agent's playbox.

84. > For this database has to have nested transactions, where COMMITs do propagate up one level and not to the actual database,

Correct, but nested transaction support doesn't seem that much of a reach if you're an MVCC-style system anyway (although you might have to factor out things like row watermarks to lookaside tables if you want to let them be branchy instead of XID being a write lock.)

You could version the index B-tree nodes too.

85. > but nested transaction support doesn't seem that much of a reach if you're an MVCC-style system anyway

You are talking about code that have to be written and tested.

Also, do not forget about double COMMIT, intentional or not.

86. i dont know anyone with a brain that is using a DB mcp with write permissions in prod. i mean trying to lay that blame on a protocol for doing something as nuts as that seems unfair.

87. Was the trade-off so exciting that we abandoned our own principles? Or, are we lemmings?

Edit: My apologies for the cynical take. I like to think that this is just the move fast break stuff ethos coming about.

88. I think it's time for a big move towards immutable databases that weren't even mentioned in this article. I've already worked with Datomic and immudb: Datomic is very good, but extremely complex and exotic, difficult learning curve to achieve perfect tuning. immudb is definitely not ready for production and starts having problems with mere hundreds of thousands of records. There's nothing too serious yet.

89. The author mentions about it in the name change for edgeDb to gel. However, it could also have been added in the Acquisitions landscape. Gel joined vercel [1].

1. https://www.geldata.com/blog/gel-joins-vercel

90. Thanks for catching this. Updated: https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...

I need to figure out an automatic way to track these.

91. You just ruined my day. The post makes it sound like gel is now dead. The post by Vercel does not give me much hope either [1]. Last commit on the gel repo was two weeks ago.

[1] https://vercel.com/blog/investing-in-the-python-ecosystem

92. From discord:

> There has been a ton of interest expressed this week about potential community maintenance of Gel moving forward. To help organize and channel these hopes, I'm putting out a call for volunteers to join a Gel Community Fork Working Group (...GCFWG??). We are looking for 3-5 enthusiastic, trustworthy, and competent engineers to form a working group to create a "blessed" community-maintained fork of Gel. I would be available as an advisor to the WG, on a limited basis, in the beginning.

> The goal would be to produce a fork with its own build and distribution infrastructure and a credible commitment to maintainership. If successful, we will link to the project from the old Gel repos before archiving them, and potentially make the final CLI release support upgrading to the community fork.

> Applications accepted here: https://forms.gle/GcooC6ZDTjNRen939

> I'll be reaching out to people about applications in January.

93. I want to thank Andy and the entire DB Group at CMU. They’ve done a great job of making database accessible to so many people. They are world class.

94. What did they do?

95. look up the cmu db youtube

96. What an amazing set of articles, one thing that I think he's missed is the clear multi year trends.

Over the past 5 years there's been significant changes and several clear winners. Databricks and Snowflake have really demonstrated ability to stay resilient despite strong competition from cloud providers themselves, often through the privatization of what previously was open source. This is especially relevant given also the articles mentioning of how cloudera and hortonworks failed to make it.

I also think the quiet execution of databases like clickhouse have shown to be extremely impressive and have filled a niche that wasn't previously filled by an obvious solution.

97. Pg18 is an absolutely fantastic release. Everyone flaks about the async IO worker support, but there’s so much more. Builtin Unicode locales, unique indexes/constraints/fks that can be added in unvalidated state, generated virtual (expression) columns, skip scans on btree indexes (absolutely huge), uuidv7 support, and so much more.

98. Supabase seems to be killing it. I read somewhere they are used by ~70% of YCombinator startups. I wonder how many of those eventually move to self-hosted.

99. Regarding distributed(-ish) Postgres, does anyone know if something like My/MariaSQL's multi-master Galera† is around for Pg:

> MariaDB Galera Cluster provides a synchronous replication system that uses an approach often called eager replication. In this model, nodes in a cluster synchronize with all other nodes by applying replicated updates as a single transaction. This means that when a transaction COMMITs, all nodes in the cluster have the same value. This process is accomplished using write-set replication through a group communication framework.

* https://mariadb.com/docs/galera-cluster/galera-architecture/...

This isn't necessarily about being "web scale", but having a first-party, fairly-automated replication solution would make HA easier for a number internal-only stuff much simpler.

† Yes, I am aware: https://aphyr.com/posts/327-jepsen-mariadb-galera-cluster

100. I can't believe that article has no mention of SQLite ??

101. > I can't believe that article has no mention of SQLite ??

https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...

102. No MSSQL, DB2 or Oracle either. Anything this proven & stable is probably not worth blogging about in this context. SQLite gets a lot of attention on HN but that's a bit of an exception.

103. Same. CMD-F, 'sqlite', no hits, skip and go straight to comments.

104. > Acquisitions ... Gel → Vercel

is a bit misleading. Gel (formerly EdgeDB) is sunsetting it's development. (extremely talented) Team joins Vercel to work on other stuff.

That was a hard hit for me in December. I loved working with EdgeQL so much.

105. It is a beautifully designed language and would make a great starting point for future DB projects.

106. No mention of DuckDB? Surprising.

107. Also somewhat surprised. DuckDB traction is impressive and on par with vector databases in their early phases. I think there's a good chance it will earn an honorable mention next year if adoption holds and becomes more mainstream. But my impression is that it's still early in its adoption curve where only those "in the know" are using it as a niche tool. It also still has some quirks and foot-guns that need moderately knowledgeable systems people to operate (e.g. it will happily OOM your DB)

108. Same surprise here. However in practice, the community tends to talk about DuckDB more like a client-side tool than a traditional database

109. I would like to mention that vector databases like Milvus got lots of new features to support RAG, Agent development, features like BM25, hybrid search etc..

110. Also emmer (which is perhaps too niche to get mentioned in an article like this), which I focuses more on being a quick/flexible 'data scratchpad', rather than just scale.

https://hub.docker.com/r/tiemster/emmer

111. nice to see it get mentioned here :), I like using it also for scripts etc. Quite flexible because you can do everything with the api.

112. Didn't know MongoDB was suing the company behind FerretDB. That's disgusting.

113. Andy has a balanced and appropriate take here.

114. With a trend towards immutable single writer databases MMAP seems like a massive win.

115. Barely any mention of Oracle or MS Sql Server, commonly reckoned to be #1 and #3 most used databases in the world

https://db-engines.com/en/ranking

116. Oracle is mentioned at the start, where he proclaims the "dominance" of Postgres and then admits its newest features have been in Oracle for nearly a quarter of a century already. The dominance he's talking about is only about how many startups raise how many millions from investors, not anything technical.

And then of course at the end he has a whole section about Larry Ellison, like always.

117. Isn't it because it's about news , as in what's changing, rather than being about what's staying the same? He's a researcher, so his interests are always going to be more oriented toward new systems and new companies more than the big dominant systems.

118. There's nothing technically new that he's covering here though? It's all just startups adding stuff to Postgres that Oracle had for decades already.

119. The startups are new.

120. Nothing about time series-oriented databases?

121. > Nothing about time series-oriented databases?

https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...

122. Not much happened I guess. Clickhouse has got an experimental time series engine : https://clickhouse.com/docs/engines/table-engines/special/ti...

123. QuestDB at least is gaining some popularity: https://questdb.com/

I was hoping to learn about some new potentially viable alternatives to InfluxDB, alas it seems I'll continue using it for now.

124. Over here, it is DB2, SQL Server or Oracle if using a plain RDMS, or whatever DB abstraction layer is provided on top of a SaaS product, where we get to query with some kind of ORM abstraction preventing raw SQL, or GraphQL, without knowing the implementation details.

125. This sounds like a flashback to J2EE. Which I know is still alive and well. Banks, insurance companies and the tax agency do not much care for fancy new stuff, but that it works.

126. I describe these techs like garbage trucks. No one likes to see them but they’re there every day doing a decent part of what it takes to hold society together hah.

127. Scott Hanselman has a good term for all these kind of jobs, the dark matter developers.

https://www.hanselman.com/blog/dark-matter-developers-the-un...

128. Yep, Fortune 500 enterprise consulting, boring technology that pays the bills.

Java, .NET, C++, nodejs, Sitecore, Adobe Experience Manager, Optimizely, SAP, Dynamics, headless CMSes,...

129. Never felt so old, seeing nodejs in a list of old boring stuff.

130. Yeah, it is on the edge, but unavoidable in many Web projects.

131. Andy is probably the only person who adores Larry Ellison (Oracle) unironically.

132. Ironically unironically.

133. I love these yearly review posts. Thanks Andy and team.

134. we had to restrict ours to views only because it kept trying to run updates. still breaks sometimes when it hallucinates column names but at least it can't do anything destructive

135. > "The Dominance of PostgreSQL Continues"

It seems like the author is more focused on database features than user base. Every metric I can find online says that MySQL/MariaDB is more popular than PostgreSQL. PostgreSQL seems "better" (more features, better standards compliance) but MySQL/MariaDB works fine for many people. Am I living in a bubble?

136. > Every metric I can find online says that MySQL/MariaDB is more popular than PostgreSQL

What are those metrics? If you're talking about things like db-engines rankings, those are heavily skewed by non-production workloads. For example, MySQL still being the database for Wordpress will forever have a high number of installations and developers using and asking StackOverflow questions. But when a new company or established company is deciding which new database to use for their custom application, MySQL is seldom in the running like it was 8-10 years ago.

137. Popularity can mean multiple things. Are we talking about how frequently a database is used or how frequently a database is chosen for new projects? MySQL will always be very popular because some very popular things use it like WordPress.

It does feel like a lot of the momentum has shifted to PostgreSQL recently. You even see it in terms of what companies are choosing for compatibility. Google has a lot more MySQL work historically, but when they created a compatibility interface for Cloud Spanner, they went with PostgreSQL. ClickHouse went with PostgreSQL. More that I'm forgetting at the moment. It used to be that everyone tried for MySQL wire compatibility, but that doesn't feel like what's happening now.

If MySQL is making you happy, great. But there has certainly been a shift toward PostgreSQL. MySQL will continue to be one of the most used databases just as PHP will remain one of the most used programming languages. There's a lot of stuff already built with those things. I think most metrics would say that PHP is more widely deployed than NodeJS, but I think it'd be hard to argue that PHP is what the developer community is excited about.

Even search here on HN. In the past year, 4 MySQL stories with over 100 point compared to 28 PostgreSQL stories with over 100 points (and zero MariaDB stories above 100 points and 42 SQLite). What are we talking about here on HN? Not nearly as frequently MySQL - we're talking about SQLite and PostgreSQL. That's not to say that MySQL doesn't work great for you or that it doesn't have a large installed base, but it isn't where our mindshare is about the future.

138. > ClickHouse went with PostgreSQL.

What do you mean by this? AFAIK they added MySQL wire protocol compatibility long before they added Postgres. And meanwhile their cloud offering still doesn't support Postgres wire protocol today, but it does support MySQL wire protocol.

> Even search here on HN.

fwiw MySQL has been extremely unpopular on HN for a decade or more, even back when MySQL was a more common choice for startups. So there's a bit of a self-fulfilling prophecy where MySQL ecosystem folks mostly stopped submitting stories here because they never got enough upvotes to rank high enough to get eyeballs and discussion.

That all said, I do agree with your overall thesis.

139. I think author is basing his observations on where the money is flowing.
PostgreSQL adjacent startups and businesses are seeing a lot of investment.

140. > Am I living in a bubble?

There are rumblings that the MySQL project is rudderless after Oracle fired the team working on the open-source project in September 2025. Oracle is putting all its energy in its closed-source MySQL Heatwave product. There is a new company that is looking to take over leadership of open-source MySQL but I can't talk about them yet.

The MariaDB Corporation financial problems have also spooked companies and so more of them are looking to switch to Postgres.

141. > There are rumblings that the MySQL project is rudderless after Oracle fired the team working on the open-source project in September 2025.

Not just the open-source project; 80%+ (depending a bit on when you start counting) of the MySQL team as a whole was let go, and the SVP in charge of MySQL was, eh, “moving to another part of the org to spend more time with his family”. There was never really a separate “MySQL Community Edition team” that you could fire, although of course there were teams that worked mostly or entirely on projects that were not open-sourced.

142. How is SpacetimeDB not mentioned here?

143. > How is SpacetimeDB not mentioned here?

https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...

144. Why does "database" is surveys like this not include DuckDB and SQLite, which are great [1] embedded answers to Clickhouse and PostgreSQL. Both are excellent and useful databases; DuckDB's reasonable syntax, fast vectorized everything, and support for ingesting the hairiest of data as in-DB ETL make me reach for it first these days, at least for the things I want to do.

Why is it that in "I'm a serious database person" circles, the popular embedded databases don't count?

[1] Yes, I know it's not an exact comparison.

145. TiDB has gained some momentum in silicon valley with companies looking to adopt it. Does he have any commentary on TiDB which is an OLTP and OLAP hybrid?

146. Can we even say that Anyblox is a file format? By my understanding of the project it's "just" a decoder for other file formats to solve the MxN problem.

147. It's so weird how everyone nowadays is using Postgres. It's not like end users can see your database.

It's disturbing how everyone is gravitating towards the same tools. This started happening since React and kept getting worse. Software development sucks nowadays.

All technical decisions about which tools to use are made by people who don't have to use the tools. There is no nuance anymore. There's a blanket solution for every problem and there isn't much to choose from. Meanwhile, software is less reliable than it's ever been.

It's like a bad dream. Everything is bad and getting worse.

148. What's wrong this postgres?

149. Which alternatives to PostgreSQL would you like to see get more attention?

150. All of them. Nothing wrong with Postgres, I like Postgres. But the more alternatives the better. My favorite database is RethinkDB but officially, it's a dead project. Unofficially it's still pretty great.
</comments>

Based on the content above, return a JSON object with this exact structure:
{
  "article_summary": "A short paragraph summarizing the article",
  "comment_summary": "A short paragraph summarizing the overall discussion",
  "topics": ["Topic Name # description of related ideas", ...]
}

Generate up to 20 distinct topics from the comments, focusing on the most interesting and prevalent themes. Each topic should have:
- A concise name (2-6 words) before the # separator
- A description after the # separator

For each topic, the description after the # separator should be 20-80 words identifying ideas, phrases, subtopics, and/or themes that appear in the comments that relate to this topic.

Remember: Output ONLY the JSON object, no commentary or markdown formatting.

← Back to job