Repack Concurrently | Scaling Postgres 414

[00:00:00] I tried using PGrepack once and I had some issues with it. It didn't really finish fully compacting the table. Now granted the table was almost a terabyte, so maybe I should have started with smaller tables, but I've never really used the PGrepack extension to try and compact and get rid of bloat and tables. I have heard some good things about PG squeeze, but I've never really used that either. What I have used a lot of to compact things is reindex. So reindexing your indexes is a great way to reduce bloat in your indexes. Now of course it won't touch your tables, but for many of the databases I work with, index bloat is much worse than the table bloat. And basically I've relied on logical replication upgrades to actually compact the table as opposed to using one of these tools. Well, that may change because it looks like repack might be coming in Postgres 19 and especially repack concurrently, but I hope you, your friends, family and co workers continue to do well. Our first piece of content is waiting for PostgreSQL 19 add concurrently option to repack. This is from depesh.com and I'm mostly going to be looking at the commit notes here. [00:01:25] So when you repack concurrently it no longer acquires an access exclusive lock while the new copy of the table is being created. [00:01:34] It creates this copy under a shared update exclusive lock, which is basically the same as vacuum. [00:01:41] And importantly, this copy is following a snapshot that's been taken with a replication slot enabled. So it basically uses a background worker to logically decode the changes and stash them in a concurrent data changes area. [00:01:58] And then those changes are reapplied to the copy just before swapping the rel file nodes or basically flipping over to the new table. And they say applications can continue to access the original copy of the table normally until just before the swap, at which point an access of exclusive lock is needed. [00:02:18] Now he says there are some loose ends. This process needs a dedicated replication slot. And secondly, due to the way snapshots are set up, you can only have one repack process running at a time in the whole system, which I don't think that's necessarily a downside. [00:02:35] And of course there's a danger of deadlocking and aborting the process during the final phase where it's trying to do a access exclusive lock. [00:02:45] But irrespective of these loose ends, this is awesome. And Depes actually runs through the process of using this, but I think this is awesome if it gets into Core because that means a lot more people are going to be using this regularly and it'll hopefully become more and more reliable. So definitely check this out if you're interested. The next blog post is related understanding PostgreSQL repack through repack C this is from Hygo CA and here they walk through the C code and actually explain how it works from a nuts and bolts perspective. So if you want this greater detail, definitely check out this blog post. [00:03:25] Next Piece of Content There was another episode of postgres fm. This one was on a schema design checklist, so they mentioned about five to seven items that they would suggest when building your schema. The first is either use bigints or UID version 7s for your primary keys. I think the debate still rages as to what people prefer, but definitely don't use standard ints. Use the bigint. Definitely don't use UUID version 4s that are entirely random. Use a time ordered one like version 7. [00:04:04] The next recommendation is using not nulls for your columns, and I generally like to use not nulls on everything, especially text and booleans. And I set defaults for those in the database columns anyway in terms of integers and timestamps, I let the use case drive whether I want that to be null or not null. Next is setting up foreign keys for your database. They definitely recommend that, especially if it's a new database because it helps you maintain your data integrity, although they were a little bit cautious about cascading too many things. Next is using tools like constraints instead of setting up things like enums which are really difficult to change. Then they talked about engaging in column tetris. This is orienting your columns to get the maximum packing of your columns together to make the space very efficient. In the table, I pretty much never do this because so many things change with tables. Oh we need to add this new column. Oh we don't need this column anymore. Especially on new projects it's really hard to do this, but if you have a schema structure relatively sure is not going to be changing. Especially if it's like a data warehouse or a data mart. It makes a ton of sense to go ahead and go through the process of arranging columns in the most efficient order. [00:05:27] They also talked about over indexing and under indexing and they said talking to AIs it's very easy to get over indexed. They did cover partitioning and they debated whether you should pursue this early or not. [00:05:42] Personally there are some downsides to partitioning the first one is you cannot drop an index concurrently from a partition table. There's a process to add it concurrently, but you cannot drop it concurrently without locking all the tables and all the partitions, so that reduces your flexibility. And secondly, globally unique indexes aren't supported, so you can't have a unique index on that table unless it includes the partition key. So for these reasons I usually don't like partitioning from the start and last thing they covered is rls, but which is row level security. And the thing is, when you're considering RLS you definitely need to think about performance, but this was a great episode. If you want to learn more, you can listen to the episode here or watch the YouTube video down here. [00:06:30] Next Piece of Content I don't know what happened this week, but Christoph pettis [email protected] and he released seven or eight blog posts this week so we'll hit on some, but he was definitely prolific last week. [00:06:46] This first blog post is hints part 1 the state of the Art Everywhere but Here. And this post talks about basically what are hints across different database platforms. He talks about Oracle and basically these are SQL based hints that you embed with a special comment along with the SQL, which to me this is a nightmare to deal with. And he expresses quote they built an elaborate culture around hints and then quote when someone moves From Oracle to PostgreSQL, they arrive with tens of thousands of queries containing hint comments and the expectation is that their new database will honor them. I can't fathom how that's manageable. He does cover SQL Server hints which go as an option clause at the end. Talks about MySQL who actually has two generations of hint syntax again in the SQL and me looking at this I'm thinking why would you want this? He mentioned MariadB DB2 and he says the general pattern is you have inline hints in the query text, predominantly Oracle, MySQL and then you have out of band Plan guidance from SQL Server DB2 optimization profiles and each has their pros and cons that he mentions here. [00:08:02] But what's interesting is the PGPLANT advice that's being developed potentially for Postgres 19 is aimed squarely at the second. [00:08:11] So out of band plan guidance you don't have to embed all of these hints in your query text, which personally I think is a win. The next post hints part 2 features we do not want and this basically goes and talks about the history about why postgres has resisted hints all this time and I'll just hit up the things he mentioned here. Maintainability. [00:08:33] And this kind of goes back to what he was mentioning with Oracle. If you have 10,000 queries, you have 10,000 places to audit. For the hints, you have upgrade interference where he says, you know, today's helpful hint becomes tomorrow's anti performance degree. Third is it enables bad DBA habits. So rather than do a statistics fix or a schema change or a better query, you reach for a quick hint rather than doing the harder work. [00:08:59] Fourth is it does not scale with the data size. So a hint that's right for a 1 million row table is different from 100 million row table. [00:09:08] Fifth, he says they're unusually necessary because he says most of the time the planner is right, but personally I think that's only right in the sense if it has the correct information. So you do need accurate statistics and good configuration to make sure the planner is making the correct decision based upon the reality. And lastly, interference with planner development. In other words, people use hints to bypass issues with the planner and the developer team would rather understand there's a problem and fix the planner. Now he says Postgres actually does have hints or a way to guide what the planner does, and that's through configuration options. So in a session you can turn off sequential scans, nest loops, hash joins, etc. You can also adjust planner costs with regard to the random page cost, sequential cost, CPU cost, tuple cost, and the effective cache size. But postgres has never had those explicit inline query hints. But he says the answer to that has been the extension pghintplan so this basically gives you Oracle style hints, so you could always use that if you choose to. What's interesting is that the argument has shifted. It's not necessarily about letting users override the planner, but we should let users stabilize the plans across upgrades. [00:10:27] So plan stability is a hugely important thing, especially as your database size grows. [00:10:33] So having the ability to support that is great. And that's kind of part of the reason that PGPlan advice was developed. [00:10:41] So now we go on to the third blog post, Hints Part three, Advice, Not Orders. And it talks about PGPLAN advice. [00:10:48] And this is the core mechanism for generating the plan advice strings that you see. So if you do explain and do plan advice, it tells you what it's going to do. And then you can set specific advice for a particular query in a session or a transaction by setting the plan advice advice to a text string that you want the planner to follow. [00:11:11] Now there is additional tooling that was added added called pgcollectadvice and pgstache advice. But these are example extensions of how advice collection could be extended or matching advice strings to queries via a query identifier. So this goes in the configuration of the database to say, hey, when you see this query, go ahead and use this advice that lives in the database configuration as opposed to all of your SQL. [00:11:37] And he goes over all the details, some of which we've covered in previous blog posts about this. [00:11:43] Now, unfortunately there's a quote time is rapidly running out to get things into version 19. [00:11:49] So we may not have PTPLAN advice in 19, and there's still some arguments back and forth, he says on the listservs. But it's not whole resistance to it, they're just negotiating of how it will actually end up in the product. [00:12:05] But check this out if you're interested. [00:12:08] Next piece of content Also from the build.com is asyncio in PostgreSQL 19, the year after. [00:12:15] So async IO was released in 18 and it had various I O methods. You could set the default was worker and you could set the number of workers. Well, that's becoming more dynamic in Postgres 19, apparently. [00:12:29] So you have additional configuration options that have been added where you have I O men workers and IO max workers. So you can set a range for your worker pool. You can define when they can be reaped with IowaWorker idle timeout as well as how quickly they can be added with IO worker launch interval. So he says this is equivalent to an auto scaler, so you give it parameters and it dynamically adjusts the workers based upon the activity of the database. So this is definitely cool, he says. You know, you definitely want to benchmark your performance on Postgres 19. They have also reworked the Read Ahead scheduler for large I O requests and added an I O option to explain analyze to track this asynchronous I O activity per plan node. So definitely looks like some great stuff coming in 19 with regard to Asyncio. [00:13:24] Next piece of content is enforcing constraints across Postgres partitions. This is from pgedge.com and really the constraints that he's talking about is unique constraints, because again, postgres does not have globally unique indexes. When working with partitions, you can put a unique index on an individual partition but not across them. [00:13:44] So in the example he has here, where you're partitioning by a range of IDs, well, you cannot have a global unique index that does not have the Partition key. So if you have the partition key, sure, you can have a unique index, but you can't, say, have a unique timestamp on this event date. But if you want that, how can you accomplish it? And he says, well, you could use triggers. [00:14:08] So in this example, he puts in a deduplication ID on the events table, creates an index on it, and then does a trigger that looks to see if that deduplication ID exists anywhere in the table and if it's found, raise an exception. [00:14:24] So this works. The disadvantage of it is inserts become three times slower because of that trigger. [00:14:32] Now, you can make it more efficient by creating a dedicated duplicate table. So it only has one column, the deduplication ID in his case. And now that trigger function just inserts the value into that table and if it fails, it means you've hit a duplicate. And he says you also need to add a trigger to handle deletions from the event table as well. [00:14:56] Now, this is a little bit faster, but it's still twice as slow than doing an insert without a trigger, at least with the 2 million row insert he did. Now, another enhancement he adds, is that if you're concerned with the size of the deduplication table, well, you can partition that too. So he shows an example of partitioning by hashes to make it efficient as well. So definitely some downsides with doing this, but it is a way to handle uniqueness across partition tables when you can't add the partition key to it. So check this out if you're interested. Next piece of content. How Postgres CTIDS gave me a 30x speed up on processing 200 million rows. This is from connorhallman.com and he had an issue where he needed to process 200 million records or 40 gigabytes. He needed to process every row, do it in batches, be able to pause and resume. Now, the other benefit he had is that the data was completely static, so there were no inserts, updates, deletes happening on this table during the processing. [00:16:06] So his first initial attempt was just adding a processed result column, put a partial index on it, and after a batch is processed, just mark the whole batch as processed. But that was slower than he anticipated, so his fetch was slow. [00:16:23] So the index was fast, but it had to jump all over the heap looking for the data. [00:16:29] And the update was worse because postgres had to write new tuple versions and update the partial index. [00:16:36] So his next iteration was to use a separate results table. And he also memory mapped the IDs so he loaded all the IDs to a local memory map file and he stored the results in a separate table. Similarly to the last post, where all the deduplication IDs got their own table, he's putting all the processed results in a dedicated table. [00:16:57] Now, because of this, the writes were much faster. [00:17:00] The problem was the fetch was even slower because they were loading a batch by UUids. [00:17:07] So again they were going all over the disk trying to retrieve this data. But since his data was static, he said maybe I can use the physical row identifier, essentially the ctid. [00:17:19] So that way he knows that there's alignment between the physical representation and how he's querying the data. [00:17:26] So he used a query like this to pull back the data a thousand rows in the physical order. [00:17:33] Now, CTIDs are not stable, but if you're not changing the data, that should be fine. And using this solution was essentially 30 times faster than his initial implementation. And he says also because he's using the CTID based processing approach, it is possible to paralyze it as well. But if you want to learn more, definitely check out this blog post. Next piece of content when upserts don't update but still write Debugging Postgres Performance at scale this is from datadoghq.com and they needed to keep track of the hosts that are sending them telemetry data. So they created a table called Host Last Ingested when it last received telemetry data. And they wanted to insert the host into the table if there's a conflict with that host, just update the last ingested time, but only do that if the last ingested date has been greater than a day. [00:18:29] So basically it's an on conflict update. But the problem with the on conflict update is it actually locks the row even if this update doesn't happen. [00:18:40] And they discovered this by looking at the metrics with increased writes happening predominantly due to the wall. And when they investigated the wall output using PGWallInspect, they found these locks happening. And it wasn't just a lock, but it was also a commit because a transaction ID was assigned to that transaction. In contrast, Insert on Conflict Do Nothing doesn't use a transaction id. So basically they rewrote their query. So they're using a CTE and in the CTE they're saying insert a value if it exists. If there's a conflict, don't do anything and return all the rows. So basically they get the host and then they do an update as long as the record doesn't exist and it hasn't been updated in the last day, so they do say this does introduce a race condition, but they're fine living with that. But if you want to learn more, definitely check out this blog post. I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content discussed, as well as sign up to receive weekly notifications of each episode. There you can find an audio version of the show as well as a full transcript. Thanks. I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 373

Planetscale Postgres! | Scaling Postgres 373

Episode 402

Millions of QPS | Scaling Postgres 402

Episode 359

Trillions of Rows & Metrics | Scaling Postgres 359