Sharding At Network Speeds | Scaling Postgres 377

[00:00:00] When you have to start scaling postgres, usually the first thing you do is you start scaling out reads, because normally in most applications the reads outweigh the writes. But then if your writes get to the point that a large instance can't handle all of the writes, then usually the next alternative is sharding. [00:00:21] Now some people have sharded their application along some sort of application boundary, but but you can also shard by each customer in your database like you could do with Citus. But the problem becomes when you want to start sharding, how do you move to sharding? Because you already have an existing database that is probably fairly large, how do you migrate to sharding? And usually I hear about people having to do it in an application and sending dual writes and dual reads as they do the migration process. [00:00:55] But what if there was a push button way to do it? We're going to talk about that in the first blog post this week. Also, be sure to stay till the end of my consulting corner. Where I talk about is postgres good enough to do basic text searching, but I hope you, your friends, family and co workers continue to do well. Our first piece of content is Sharding postgres at network speed. This is from pgdog.dev and as a reminder, this is an open source project that basically introduces a pooler in front of one or more postgres instances that allows you to shard those databases that sit behind it. This particular post talks about some enhancements they've recently made to it to shard databases really fast. [00:01:41] So the key to the sharding is usually setting up logical replication because you want to copy the data from one database to a new sharded database. [00:01:51] And they have an example here, say you have your source database, you're going to send those replication commands to pgdog and they're going to shard the database. And typically how they do sharding is using tenants. So basically the requirement is for every table to have some sort of tenant ID that could be a tenant ID or like I have used account id, whatever you use to segment each customer's data. [00:02:16] Now their new architecture actually leverages the enhancement to Postgres 16 where you can do logical replications from physical replicas because it basically allows you to do what his headline here says, paralyze everything. [00:02:33] So PGdog is sitting here. It is not logically replicating from the primary database, but multiple read replicas that are doing physical replication from the primary. [00:02:45] So that enables PgDog to copy users, say from this replica and send the results to each of the different shards. This particular process in pgdog copies the orders table and sends it to specific shards. And the cycle is as listed here. Basically, PGdog picks a replica from the config. So one of however many replicas he actually tested 16. [00:03:08] You start a transaction, you, you get the current LSN by creating a logical replication slot, and then you copy the data out. And as the copy is happening, it sends the data to each of the shards. And then of course, once the data is copied out, that slot continues to stream changes and it continues to send them to the shards. And what he found is that each string can read at 500Mbps and the shards can write at the same speed. So basically they can copy shard and write data at 8 gigabytes per second, which essentially he says it's enough to make this process basically network bound. And if you had a 16 terabyte database, they could reshard it in about 55 minutes. So that's pretty crazy. But what's even more interesting is how they're doing the sharding because they can't just use logical replication from this shard to this shard, because essentially that's what it's reading. It basically rewrites the logical replica protocol stream so that X log data here and transforms it into bind and execute messages. Basically it receives the logical replication data stream like this, but then it actually sends prepared statements using the query protocol. And he says, yes, the query protocol is lower than the logical replication protocol, but because you're number one using prepared statements and number two sending it to 16 different databases, it can still keep up with the load. So this is super fascinating to me. And as he says here, the goal for PGdog is to be a push button scaling solution for postgres. Although I wonder how this works with foreign keys, because I know when I've set up logical replication, I try to avoid crossing foreign key boundaries when I'm setting up multiple publishers and subscribers to try to avoid any conflicts. But maybe the rewriting that PGdog is doing, you can avoid that. [00:05:01] But I found this fascinating. Feel free to check it out if you're interested. Next piece of content. Making Postgres 42,000 times slower because I am unemployed. This is from byteofdev.com and as opposed to speeding the database up, he wants to slow it down. [00:05:18] And he says all changes must be on parameters in PostgreSQL conf. So he's not doing anything to make the machine slower itself. The hardware or the underlying operating system or adjusting indexes or things like that. He's only changing PostgreSQL conf. So in his initial test using TPCC with 128 data warehouses, he got 7,082 transactions per second. So the first thing he said, all right, let's drop shared buffers. So probably has to access disk a bit more. He dropped it to eight megabytes. He says, well, can I drop it lower? And he eventually got it down to 2 megabytes of shared buffers, and that dropped it down to 500 requests per second. So much less than the 7,000. [00:06:05] Then he says, all right, let's do as much background work as possible. So he just did autovacuum to run as much as possible. Take as many resources as you want, but make the vacuuming process as slow as possible. So reduce the maintenance work memory, turn on the logging collector, put it on certain log destinations, and put the log autovacuumen Duration 0. Basically log as much as possible. That dropped it down a bit. And now he says, all right, now I want to write a lot. So we put the wall writer flush after zero, the wall writer delay equals one. Made the min and max wall size as small as possible, Put the checkpoint time out at 30 seconds. But he wanted to maximize wall write. So we put the wall sync method to open delay data sync, which he felt was the slowest method. Add logical to the wall level, turn on wall log hints and summarize the wall track wall I O timing. That dropped the performance down further, but again, not by so much. The thing that hammered it a lot more is he put the random page cost and the CP index tuple cost super high. So essentially it wouldn't use indexes. Now the TPS is less than 1 per second, so 0.87 requests per second, which is essentially 7,000 times slower than the default tuning. And now we forced IO into one thread. Why? [00:07:28] Because he was using the PG19 development version and he set the I O method to worker and only set one worker. And doing that knocked it down to 0.1 transactions per second, which is insane. So it's 42,000 times slower than what he started with. Again, this is another post I thought was fascinating. Feel free to check it out if you're interested. [00:07:50] Next piece of content. There was another episode of Postgres FM last week. This one was on monitoring from Postgres AI. And last week I did mention that Postgres AI had their release week. Well, along with that, on Friday they put out this episode and they talked more about the monitoring than some of the other announcements that they had. But again, this is a monitoring tool that is built on top of PGWatch 3 for monitoring. It is using Prometheus as its Data Store and PromQL to access it. But he said he does store the slow query statements in Postgres because it wasn't efficient to do that in Prometheus. They discussed the different dashboards and what he felt was important and not. And also he kind of fought against creating his own monitoring system for a while, but he he felt compelled to do it now. But it is an open source resource that you can access because his vision is to have a really good monitoring and data collection tool that then LLMs or experts can utilize to diagnose system problems. And then of course you can use his other toolings like DB Lab or PostgreSQL AI site to do experiments and figure out performance issues. [00:09:07] But if you want to learn more, you can definitely listen to the episode here or watch the YouTube video down here. And now I'm just going to briefly discuss each of the posts that were part of the Postgres launch week. So they had the release of DB Lab 4.0 instant database branching with O1 economics. So basically this is very fast ways of creating database branches, akin to creating a branch in Git, but this is doing it for databases and you can do it super fast. And this is offered as an open source tool that you can run on Mac os and I believe they have an online solution as well, so you can check that out. [00:09:44] Next piece of content was Postgres AI monitoring Expert level Postgres monitoring tool for humans and AI. And pretty much this is what was covered in the podcast. So you can listen to that or look at it here to have an example of this. But again this is an open source tool. [00:10:01] Next piece of content Postgres AI Checkup Service Expert Led AI assisted Comprehensive Database Health assessments. So basically this is the consulting arm of what they do. So basically the monitoring you can put in place for you and then you can actually ask for their assistance with analyzing data and coming up with conclusions. Next post was introducing Postgres how tos Practical Guides for Real World Database challenges. So if you ever read any of those cookbook books that address a particular topic, well, this is basically a Postgres cookbook as far as I can best understand it. And they've just got a link to the how to guides right here. Next piece of content Self driving postgres. So this is Nikolai talking about where he sees AI being introduced in Better Automating Postgres and some examples he mentioned is the database has automated tuning to improve performance. [00:10:54] Indexes can be created automatically or the Planner smart it learns from the past query execution to improve plan choices. Now based upon their consulting they actually identify these 25 areas as ways automation could assist, as well as some areas that aren't really addressed currently in the market. Figuring out what's going wrong with a particular installation I thought this was an interesting read, but before I trust my database to an LLM I I still think there needs to be significant advancement of its capabilities because it is too often incorrect and it never seems to do the same thing the same way twice. Like in a given LLM session, if you ask it to write a particular bit of code and then ask him to write it again a little bit later in that session, it will write it completely differently. Now that's fine, but part of me wonders is that what's also introducing some of these error rates that you see when working with LLMs. [00:11:49] But check this out if you're interested Next piece of content postgres Azure Platform building Event driven systems with schema changes this is from neon.com and I haven't seen much of a need for this, but if you have a whole pipeline that needs to detect schema changes in your database, maybe you're using CDC to send data off to some analytics platform or a data lake, maybe you need to capture those schema changes. Well, well this is a solution for that. So you could use something like event triggers that trigger any DDL activity. That trigger would store a row in some sort of a schemachangelog table that would then trigger the Listen notify pub sub so you can do this within postgres that could kick off various automations. [00:12:36] So he goes through the process of setting something like this up if you're interested. [00:12:41] And the last piece of content cloud native PG 1.26.1, 1.2.5.3 and 1.27.0 release candidate 1 is released. So this has a general recommendation that if you are using cloud native pg, go ahead and upgrade to the most recent version. But they talked about some of the enhancements coming in 1.27 such as dynamic loading of PostgreSQL extensions, logical decoding, slot synchronization, and primary isolation checks, as well as experimental failover quorum. They just keep adding all these features that help make managing postgres easier. So I'm still keeping an eye on it. I haven't moved any major platform to cloud Navia PG yet, but we'll see how things go as they continue to work on it. And now it's time for my consulting corner. So I do have a client who has some minimal search features in their application and they have been using elasticsearch and they actually did some experimentation and said, hmm, I wonder if we could get rid of this Elasticsearch cluster and just do some basic searching using Postgres. And they chose to use the PGTrigram extension as well as a Btreegen index. So basically they wanted to search with trigrams and then they also needed to narrow their search query by a particular tenant ID, for example. So they needed to use the BTreegen index and based on how they set it up, they actually got better performance than elasticsearch. Now the devil's in the details. I don't know why it was necessarily faster, but with the performance that they tested in postgres, the searching was more than sufficient for what they were looking for. So they're actually going to explore using this solution. I did make a recommendation that they should definitely be sure to vacuum relatively frequently to keep the GIN index optimized, as well as make sure they had sufficient work memory. Because sometimes when you're using a GIN index you may end up with some lossy heap blocks which can reduce your performance. You want to try to keep those as exact. So basically you can explain, analyze a gen index query and if you have some lossy blocks, usually increasing work memory will make those exact and improve your performance. But I think this might be the case of postgres Search is good enough. [00:14:56] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode There. You can also find an audio version of the show, as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 88

Partitioning, Logical Replication Upgrade, Columnar Compression, HAProxy Connections | Scaling Postgres 88

Episode 319

When Postgres Development Stopped! | Scaling Postgres 319

Episode 259

Recovery Time, Lost Data, Production Ready, PG16 Highlights | Scaling Postgres 259