Shard With One Command | Scaling Postgres 413

[00:00:00] Historically, in terms of dealing with performance problems, usually the last thing you try to do is shard your database. So usually the initial steps are have additional read replicas to handle heavy read load. But then if your writes get up to a certain level at that point you basically have to consider sharding. But it's interesting watching the advances that have happened with regard to the sharding infrastructure. Early on we had Citus that did scale out sharding, but now we have at least two new entrants into the space. [00:00:34] One is PGDog, the other is Multigres, which the makers Of Vitess for MySQL are building multigres for postgres. And with some of the advances of these projects, it may make sense to shard earlier than you might have done historically. [00:00:54] But I hope you, your friends, family and co workers continue to do well. Our first piece of content is shard postgres with one command. This is from pgdog.dev and I'll just quote this first paragraph here. Pgdog is on a mission to Shard postgres. [00:01:10] To make this work, it needs to be possible without downtime and with little to no intervention from the user. On the road to the ambitious gold, we built the reshard command. [00:01:20] So this is basically one command to shard your database. Now, of course it can reshard things, so if you've already sharded it, it can reshard it to a different topology or number of shards. [00:01:33] However, it still works even if you have no shards. You just have one primary database. And this blog post walks through this reshard command that's available in pgdog. And again, pgdog is there sharding pooler, I guess you would say it's a piece of technology that works in conjunction with postgres, so you can use it in pretty much any environment. I mean, you should be able to shard if you're hosting postgres or if you're running your own server instances, or if you have kubernetes clusters, apparently. So basically pgdog has a configuration file and you need to of course list out your existing primary database. Assume you just have one, and then you want to split that into two shards. [00:02:20] You would add those definitions as well, with the same database name, but with different sharded host names. And you start from an entirely empty database on the destination cluster. And how it does this resharding is using copy commands as well as logical replication. [00:02:40] So once you have the configuration set up, you do need to create a publication called pgdog for all tables on the source database and then run the reshard command. And that's pretty much it says pgdog takes care of everything else. Schema synchronization, moving, resharding the data online and cutting over traffic. So with the strategy that they're doing, they say you do zero writes to the source database. So with this copy and logical replication, that's how they're transferring the data over in terms of how the data is sharded. You could use hash based sharding, list based sharding, range based sharding, but that's not used that frequently. And apparently even schema based sharding is possible as well. And of course you need that shard key in every single table of your database. [00:03:33] So the first thing this reshard command does is it synchronizes the schema. So basically it copies all the schema and objects from the source database to the destination cluster. [00:03:44] Then it sets up replication slots and then starts copying tables and then it does a query rewrite to actually shard the data to the appropriate destination shard, because they do have a specific copy parser that does it. In addition, you can shard from a replica. So they show this example here where you're sharding from one of your database replicas and the users table gets sent to pgdog and it gets split into either shard and an entirely different replica. It could be copying over the user settings at the same time and sharding that appropriately during the copy. There are of course monitoring commands so you can check the state and once everything is in sync, you can do the cutover process. [00:04:28] But in order to do this cutover, the replication lag has to come down to zero. [00:04:33] And pretty much the only way to do this is to stop all write queries from reaching the production database. [00:04:39] And in their experience, they say it is a very brief downtime. [00:04:43] They say, quote, the entire operation takes less than one second to do that cutover. And PGDog actually updates its own configuration, they say here and swaps the databases so that you are now using the sharded cluster as your main database. And they also say that before they resume traffic, they also create logical replication connections from the new shards to the old production database. [00:05:06] So this sounds super impressive and I can't wait to hear a little bit more about it. And part of me is wondering, could you do this for a logical replication upgrade and go from an n of one shard to an n of one cluster? [00:05:22] Would that be possible to do with this? But if you want to learn more, definitely check this out next piece of content Postgres Performance Regression Are we there yet? This is from mydbanotebook.org and this blog post is in reference to the first news article I covered in last week's episode of Scaling Postgres that was talking about the performance regression that could be coming in Postgres on the Linux 7 kernel based upon changes that were made to that. [00:05:49] So she covers this in detail. I don't have time to go over all of it, but I'll just try to hit the highlights. Basically, Postgres has been relying on this preempt none for kernel operations, and that's pretty much what Linux 7 got rid of. And she says on modern architecture so ARM 64x86, PowerPC, RISC, S390 long architecture, all of them. So the kernel now only offers preempt full and preempt lazy. So initially they thought this may be related to Postgres spinlocks, but after further examination it looks like it might be more due to TLB misses where this is the translation lookaside buffer that maps virtual memory to physical memory. And this large instance size he was using, he had huge pages turned off. [00:06:42] And usually when you have any shared buffer size anywhere close to 100 gigabytes, you're going to want to enable huge pages. But with used pages off and 100 gigabyte plus buffer, the first access to each page causes a minor fault, and when that happens, when the spinlock is held, a thread stalls. So Andreas Freund redid the test enabling huge pages, and he could not reproduce the regression, and when he disabled huge pages, the contention reappeared. [00:07:13] So when you have a large shared buffered size and don't have transparent huge pages or huge pages enabled, this is an issue that can happen. [00:07:22] And the important thing to mention is that it only really happens on cold startup or on a cold buffer pool. So an ongoing system may not experience this issue. [00:07:35] So she says, you know, should you be scared? [00:07:38] Well, assuming nothing changes with regard to the code, if you choose to upgrade to a Linux 7 kernel and you have a large shared buffer size, you should enable huge pages or transparent used pages on your machine. This shouldn't be too hard to do if you're running on bare metal or dedicated vm, but if you're running on a containerized environment, that could be a problem being able to adjust that because you would need access to the host to configure these apparently. [00:08:06] But of course she says at the end you still should test before you upgrade to a Linux 7 kernel just to make sure there are no performance regressions. And again in a running system, not from a cold start, you may not see any performance regression. But definitely check out this blog post if you want to learn more. [00:08:24] Next piece of content introducing Zeta OSS Postgres platform with branching now Apache 2 this is from Zada IO and apparently they have open sourced their platform that allows fast branching using copy on write a scale to zero functionality so you don't have to spend money on databases that aren't using compute. [00:08:49] It runs on 100% vanilla Postgres, so it's all the tooling around it that support this capability and it was built with production grade requirements like high availability, read replicas, failovers, upgrades, backups, IP filtering, etc. And Zada's current customers use it for creating preview and testing environments quickly or as just a fully managed postgres service to run their production database. So they talk a little bit about the architecture, how they're using copy on write, and pretty much they're using the ZFS file system storage to do that. And they don't just use it on the same node as the database, but it's a separate storage node so they can separate storage from compute and they use NVME over fabrics to be able to connect to those volumes. [00:09:38] And some of the core technology they use to run this is cloud native PG as their kubernetes operator and open EBS for the storage. [00:09:48] But if you're interested in learning more, definitely check out this blog post. [00:09:52] Next piece of content Checkpoints writestorms and you'd this is from pgedge.com and he's talking about checkpoints, which is basically taking all the dirty buffers that exists in memory and committing them to the actual disk and how you want to keep this optimized for the best performance. [00:10:09] And he says by default checkpoints run every five minutes. That's the default. And the checkpoint completion Target by default is 5. But normally people change it to 0.9 so that some part of the way through that five minutes it has completed the checkpoint. So when you set it at 0.9 it's usually 90% through that interval. The writes have been spread out to achieve that checkpoint. And he says this is the timed checkout behavior. [00:10:36] This can also happen earlier if you exceed your max wall size parameter. So this is the maximum amount of wall size to keep in the wall area. If you exceed that, it needs to do an early checkpoint to keep the number of wall files under control. [00:10:51] So he did a little test here to show this and he did a PGBench run where the max wall size was set at 1 gigabyte and you can see the transactions per second kicked off. So slowly started rising and then at about the 40 minute mark or so it exceeded the max wall size and a checkpoint was enforced that knocked the performance way down until the test was complete. In contrast, if you have the max wall size set at 4 gigabytes throughout the test, it never hits that threshold to kick off a checkpoint. And again, this test was only 60 seconds in duration, so it's not going to hit that five minute mark of the checkpoint. [00:11:31] So forced checkpoints can be a big problem. So it's important to set your maxwall size sufficiently so that you're not checkpointing too frequently. You can check the status of that if you're on Postgres 17 and higher using the PGSTAT checkpoint or system view or earlier versions, you could use the PGSTAT bgrriter to do it as well. He also encourages turning on log checkpoints because then you can see whether it was a time or a wall based checkpoint that was kicked off. [00:12:02] And he says you also want to be wary of setting maxwall size too high because then you have just a lot of wall files staying on your database system that all need to be replayed if the system happens to crash and needs to start a recovery. Now what's interesting here, he did not mention the checkpoint warning setting because this is a way to log the fact that an early checkpoint was done and I like to set it about five minutes from the checkpoint timeout. So if I have the checkpoint timeout at 15 minutes maybe I set the checkpoint warning at 10 minutes so I know if maxwall size is exceeded and it did an early checkpoint. So I like using the checkpoint warning configuration option as well. [00:12:45] But check this out if you want to learn more. Next piece of content 0 autovacuum vacuum cost delay write storms and you this is from ardentperf.com and he's following up the article by Sean Thomas that we just covered and he's sharing his experience with wall writing and he said he encountered a problem where there were ton of wait events happening reporting IPC sync rep so he says this essentially means the replicas were lagging behind and the primary was blocking on commit acknowledgments and we looked at the wall contents, there were a ton of full page image writes happening and after further investigation it was from a 40 gigabyte toast table where he was only seeing inserts for this particular TABLE and so he was trying to determine where these hint updates were coming from. And he says quote a postgres select can dirty pages by setting tuple hint bits when it reads rows whose inserting or deleting transaction has committed but whose visibility status has not been cached in the tuple header, which commonly happens after recent updates, inserts and deletes or other write activity. And based upon how many tuples were changing per second, he was thinking maybe these dirty buffers were triggering checkpoints. But it wasn't the checkpoints, it was actually the performance problem was happening when autovacuum was running and it aligned perfectly with each wall storm that was happening. Now the thing about this system, he says, is that autovacuumvacuum cost delay was set to zero. [00:14:22] So was vacuum setting hintbits on append only table very fast and these being set very fast, it does need to do full page image writes after a checkpoint that was potentially causing this wall storm. So with this cost delay set to zero, he says without any cost throttling, autovacuuming was racing through large tables at full speed, rendering pages with hintbit updates, writing the full pages to the wall log faster than the system could replicate them. [00:14:53] So that's bugged wall traffic, replication lag, et cetera. Once they set the autovacuum vacuum cost delay to 2 milliseconds, everything smoothed out. And he has this interesting quote here. It's a big server and workload cost delay zero won't do anything that bad, won't completely burn down the server. So I can probably let that one stay where it is. [00:15:14] And his other quote here was I was completely wrong. [00:15:18] So I guess the moral of the story is here is don't set your cost delays to zero. [00:15:23] Next piece of content waiting for Postgres 19 reduce timing overhead for Explain Analyze with RDTSC this is from pganalyze.com and whenever you run Explain Analyze, there is an performance impact of that measurement. So when you turn on timing in psql, it reports how long it took a query to run you. He has a table here with quote lots of rows. [00:15:49] When you run a count from it, it ran in just about two seconds. When you run Explain Analyze and turn the timing off and this is the timing that appears next to each node in the Explain Analyze output, it now runs about maybe 20% slower, 2.3 seconds. [00:16:09] But if you run Explain Analyze with the timing on, which is I think the default when you just use explainanalyze, it takes over 4.2 seconds. So it takes twice as long to run compared to not using ExplainAnalyze at all, which is pretty significant performance difference. [00:16:28] So he was actually involved in an optimization to improve this for Postgres 19. [00:16:35] And they're using different CPU clock instructions to do this optimization because the current way it works is this CPU instruction waits until all prior instructions have finished. [00:16:47] So basically timing is not just getting the time, but it's blocking other activity from occurring. [00:16:52] But you can use a slightly different instruction, he says RTDSC without the P and it does not have this issue of blocking other concurrent instructions. [00:17:03] So they wanted to use this new capability. [00:17:06] And there is in postgres 19 presumably going to be a timing clock source by default. It's going to default to the faster way, but it will fall back to the old way. [00:17:18] Or you can configure it of course. [00:17:20] And here's an example of what the performance looks like. He is using a different machine for this test, so the performance is different. So he runs it without ExplainAnalyze. It runs in 268 milliseconds. [00:17:32] He then runs it with timing off. It runs in 297 milliseconds. So there's still an impact, of course. [00:17:39] And then when he turns timing on, it runs in 352 milliseconds. So still about a 20 to 25% overhead of explain Analyze. It's not free, but it's not double like we saw before. And of course in this scenario, when he set the timing clock source to using the old way of calculating timing, it ran in 800 milliseconds. [00:18:02] So that's more than double the time it took originally, which is crazy. [00:18:07] So this looks great. If you're interested in learning more, you can check this out. Next piece of content waiting for PostgreSQL 19 online enabling and disabling of data checksums. [00:18:17] So I like this quote down here. Wow. [00:18:21] So in Postgres 18 we get the the checksums enabled by default. Well, presumably in 19 you will be able to turn them on or off of a running cluster, which is crazy. [00:18:33] And here's the first quote from the patch. This allows data checksums to be enabled or disabled in a running cluster without restricting access to the cluster during processing, which is awesome. So if you want to learn more about that, definitely check out his blog post here. Next piece of content understanding PostgreSQL wait events. This is from richyen.com and he talks about different wait events and how to analyze them in Postgres. [00:18:59] And last piece of content waiting for PostgreSQL 19 new pgget ddl functions so this is from depeche.com so they show this database function here, pggetdatabased and what you do is you pass it in a database and it will show you the DDL to recreate that database in its configuration. [00:19:21] They also have one for recreating a role as well as recreating a table space. [00:19:28] So this lets you see the DDL to recreate existing objects. So if you want to learn more, definitely check that out. I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode. There you can also find an audio version of the show as well as a full transcript. Thanks. I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 130

External Compression, Parallel Vacuum, JSON Aggregates, JSON & Arrays | Scaling Postgres 130

Episode 336

More Postgres 17 Highlights | Scaling Postgres 336

Episode 158

Vacuum Speed Up, Faster Foreign Tables, Fast Queries With Union, Query Optimizer | Scaling Postgres 158