Features I'd like in PostgreSQL

Submitted by gil on Sat, 01/28/2023 - 12:51pm

I’ve put in my fair share of time with PostgreSQL. It has no shortage of functionality and I believe its expansive feature set drove a lot of its growing popularity in the 2010s. As my time with the database has worn on I’ve come up with a handful of features that I wish the database had. Some of these are ideas borrowed from other systems, some are inventions of my own. The real tragedy is that I simply don’t have the time to implement the patches myself.

–i-am-a-dummy mode

This idea comes from MySQL whose stock CLI client supports an –i-am-a-dummy flag. When enabled, MySQL identifies any UPDATE or DELETE that lacks a WHERE clause and fails them with an error. In other words, if you intended to DELETE FROM table1 WHERE pk = 10, started typing it into your prompt, and only got as far as DELETE FROM table1 before accidentally executing the query, the query errors out instead of having every row deleted. More modern versions of MySQL also use the flag to block queries that return excessively large result sizes, require the WHERE to reference a key column, or allow a LIMIT in place of a WHERE.

Update: User felixyz on lobste.rs points out that there is a pg_safeupdate server-side extension that blocks UPDATE/DELETE without WHERE.

Update: Andrey Borodin and Nikolay Samokhvalov of Postgres TV streamed themselves implementing this behind a GUC! They sent in the patch but it was rejected by Tom Lane.

Unit test mode (random result sorting)

The order of a query’s results is famously unspecified unless you nail it down with an ORDER BY. The flexibility is useful: sometimes you need a stable order for whatever code consumes the query results and sometimes you just don’t. But there is a secret, third thing: some code comes to accidentally depend on a coincidental ordering of the results. To help flush these out it would be neat if PostgreSQL could intentionally shuffle the order of query results if a given query did not specify an ORDER BY. I imagine this as a run-time configuration field for the database that you’d turn on when running automated tests or in a testing environment. The alternative, folding every query in your application into a subquery that randomly sorts the results, is ugly and invasive, although perhaps some ORMs could implement the feature at their level.

Update: User avianlyric on Hacker News pointed out that CLUSTER can can be used to reorder results in a pinch.

Query progress in psql

This is another feature from MariaDB. It’s possible in both databases to query the database to get progress information on several types of queries, but the ALTER TABLE, CREATE INDEX, DROP INDEX and LOAD DATA INFILE queries also automatically inform the CLI client on their progress. The CLI uses this to draw a nice animated status bar with some informational messages. PostgreSQL lets you query the system catalog for the progress of several query types but has no live update functionality in the psql client.

Pandas-like join validation

I am a true database expert who has never screwed up a join before. But there are many beginners out there and it would be neat to extend the SQL language to support validation features like Pandas has for its join function. When doing a join you’d specify what relation you expect to have unique keys and which one has potential duplicates and PostgreSQL could warn or panic when this assertion is triggered. Some hypothetical syntax:

SELECT x, y
FROM t1
JOIN t2 USING (key) VALIDATE 1:m

Which would have PostgreSQL check that t1.key has no duplicate rows and errors if it comes across any. This is another place where it would be useful to have a runtime option to disable the expensive uniqueness checks.

Pandas’ “suffixes” argument, which automatically renames overlapping columns in the output, and “indicator”, which emits a new column identifying the left or right table or both as the source of an output row, are also neat ideas that seem really helpful for relational newbies trying to troubleshoot their joins. Pandas is insanely popular, it’s taught to a lot of very inexperienced developers, and I think features like these were earned with newbie blood. I suspect PostgreSQL’s users and developers don’t get that sort of exposure to newbies and their typical problems but there’s no reason why they can’t learn from the hard-earned experience of Pandas and incorporate some good ideas.

Update: User richbell on Hacker News brought up DuckDB's Friendlier SQL which is a superset of sensible extensions to SQL. I endorse all of these ideas except for the trailing comma fix - join the dark side and use leading commas!

JIT support for CREATE INDEX

Today, the JIT support in PostgreSQL is closely tied to query planning and execution. It rests upon the object-oriented nature of the execution plan and emits LLVM IR for each plan node as it traverses the plan tree. It’s an elegant technical approach, and for its sweet spot of long-running analytical queries it can cut down on query time and free up CPU time.

DDL statements are not JITed, and for good reason - the vast majority of them are going to be I/O bound and not particularly interesting or useful to JIT. However, this is one exception: CREATE INDEX. Even though it does a lot of I/O (typically scanning the entire table) in my experience it is CPU-bound. Even moderately sized tables of tens or hundreds of gigabytes can take 10-30 minutes to index. And sometimes you just have to reindex everything, at once, like when you restore a backup with pg_restore. JIT compilation of the index creation code would be a huge and immediate benefit to just about every PostgreSQL user. And it would be a benefit: the PGCon 2017 presentation “JIT-Compiling SQL Queries in PostgreSQL Using LLVM“ which first explored JIT compilation found up to a 19% improvement in index creation time.

Reduce the memory usage of prepared queries

Like many databases, PostgreSQL has a PREPARE statement that takes an entire query with placeholder values and stores its query plan for future re-use. You can later invoke the stored query with EXECUTE, passing values in for the placeholders, and take advantage of all the precomputed query planning work. If your query is executed more than once, even if only a few times, this winds up being a measurable win. The database feature has been around for years and years and years, but over on the Python driver side, we’re finally starting to see database drivers automatically preparing and managing prepared queries for you (see psycopg3 and asyncpg). The rollout of these drivers is slow, even if they are production-ready, but over time we are only going to see more and more inadvertent users of prepared statements.

On the server side, prepared statements are per connection. They don’t take up a ton of memory, especially on the sort of beefy machine you’d use as a database host, but they take up more than you might expect them to do, to the point where it made sense for me to deallocate prepared plans when I knew they wouldn’t be used anymore. The per connection scope means that there’s no way to share query plans across connections. My first feature I’d like to see around PREPARE is a way to share prepared statements in some sort of global cache. A typical OLTP database is going to have a handful of long-lived connections, behind some sort of bouncer, and some number of hot queries, whether they are determined automatically by a driver or explicitly by the application developer. A global cache could easily be hydrated when a long-lived connection starts up or as queries are issued and easily gets some performance wins.

This issue was reported in 2019 by Daniel Migowski who provides some statistics. He happens to be working with some decently large queries which turn out to have 30-45 MB stored query plans, totalling 7 GB of RAM over his dataset of around 250 queries. It can get ugly! Under the hood, the PREPARE statement stores everything in struct CachedPlanSource, which includes the original query’s text, the parse tree for the query, the query tree for the query, and possibly the plan tree if the database decides to re-use a generic plan for each execution. All of this is potentially needed to execute a prepared statement in the future.

But I’d also like to draw your attention to an email by Tom Lane later in the thread where he suggests some techniques to reduce the memory usage of the individual query plan. In particular, one of the suggestions is:

A totally different idea is to make a variant version of copyObject that is intended to produce a compact form of a node tree, and does not create a separate palloc allocation for each node but just packs them as tightly as it can in larger palloc chunks. [...] The stumbling block here is that nobody is gonna tolerate maintaining two versions of copyfuncs.c, so you'd have to find a way [...] to autogenerate the copy functions from annotated struct definitions; people have muttered about that for years but not done anything.

Some background on PostgreSQL’s memory management circus: there are a lot of tree data structures in the backend of the database. Originally, when the database was written in Lisp, the trees and their nodes were what I’d broadly call object-oriented giving them standard interfaces, constructors and destructors. When porting to C they created this copyObject function that can traverse trees, inspect each node and do the appropriate work to copy it. I’d still call it object-oriented but it is definitely C-flavored OO. This high-level interface over trees is the first of three legs of memory management in PostgreSQL.

The second leg is memory contexts. Each context has a malloc-like interface (palloc, as mentioned by Tom before) that allocates within a memory context. All memory contexts are organized into a hierarchy of contexts, and there is always a current context used by palloc() that gets set by the backend as it moves through the different stages of query execution. By scoping allocations with the same lifetime you reduce memory fragmentation but also make cleanup easier - when the user is finished with a query or it aborts for whatever error, you can tear down the relevant parent memory context and have all of its children recursively cleaned up.

The memory context API is also object-oriented in that it is a standard interface implemented by several sub-allocators. They make up third leg of memory management in the database. The most frequently used one is AllocSet, an arena allocator. This is the actual code that calls malloc() and free() to get whole free pages from the system. AllocSet maintains its own freelist to avoid too much round-tripping between PostgreSQL and the system allocator and groups smaller allocations of the same size onto the same page to reduce fragmentation.

There’s a lot of flexibility in this system but there’s also a decent amount of overhead as each step is allocating one or more structs for its own bookkeeping data. Avoiding this overhead is what Tom is proposing: there could be a special copyObject that figures out the full size of the tree it has to copy, makes a single allocation, and packs everything into it, avoiding most of the overhead from all the small data structures needed for the prepared query. This is the second feature I’d like to see attempted to save PREPARE memory usage.

An entirely new allocator?

AllocSet is a mature battle-tested, and pretty simple piece of code which has sat at the heart of PostgreSQL since 1999 with little change to its design since then. There is something to be said for the longevity of such a fundamental piece of code and I would not be surprised if any alternate implementations can’t beat its performance just because of its simplicity. I’d like to see someone try to slot one of the modern malloc implementations (e.g. jemalloc, mimalloc, tcmalloc) into PostgreSQL. Even if it isn’t a win I’d like to see the results. Modern implementations take advantage of everything we’ve learned about writing fast memory allocation on modern hardware and operating systems and have seen lots of real-world testing. I also don’t think the palloc()/memory context API needs to change at all, modern allocators typically have an arena allocator interface and could be slotted in to replace the memory context API without API breakage. Some concerns would be legal licensing or performance losses from synchronization (the PostgreSQL backend is single-threaded and thus any effort spent on locking or synchronization is wasted) but I am hopeful that there is a malloc that can fit the bill.

Batching query plans for joins

In the terminology of the PostgreSQL query planner, when you join two tables, the database picks one table to be the outer relation and one to be the inner relation. The outer relation is the table that gets scanned first and the matching rows it emits are checked one at a time against the inner relation to get the final, joined result row(s).

Most uses of a foreign data wrapper have you talking to another database system over a network. With that comes overhead from using the network and some per-query overhead in the foreign database. It would be bad to repeatedly query the foreign database in an inner loop as you'd pay this overhead each time. Unfortunately, if the database picks your foreign data wrapper to be the inner relation in a join you get one query for each row emitted by the upper relation and pay that overhead each time.

What I'd like to see is an execution technique (in PostgreSQL terminology, a path or upper relation) that is capable of batching rows from the outer relation, sending them as one query to the inner, and sorting out the results. I have a hunch this would be a win for typical FDW use cases but I have not tested it. This would be useful for the built-in PostgreSQL-to-PostgreSQL FDW (fdw_postgres) which has to emit one query per outer relation row but once implemented the same path could be used in other wrappers. It would be a useful addition to the core database but it's actually possible to implement your own paths in an extension via the little-known custom scan path API.