100 TB and Beyond! | Scaling Postgres 320

[00:00:00] In terms of the clients that I work with on a consulting basis, the limits of the largest database cluster I think I've helped work with is about 100 terabytes and tables that are of course multi terabytes. Well, in this week's episode we talk about some organizations that far exceed that, but I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is to 100 terabytes and beyond. This is from Postgres FM, and this is Postgres FM's 100th episode, and this was an excellent episode. I don't think they could have done their hundredth episode better. And in it, Michael and Nikolai had three guests, Arka from notion, Sammy from Figma, and Dirk from Adian, which I hadn't heard of that one before. That organization does payment processing, but it was really interesting hearing all of their different perspectives, the different issues that they've dealt with managing really large database clusters. Now, interestingly, most of them were doing sharding, so they had went ahead and sharded their database, but only one, I think, actually used partitioning. So I'm actually surprised by that. I thought that feature set might have been used more. Two of them were running on rds. I think all of them, or most of them were using PG bouncer. They all mentioned dealing with different vacuum issues and transaction id wrap around and different techniques to avoid issues with those. Because when your partition table, one of the partitions gets up to 30 terabytes itself, that can definitely cause some issues. And also with regard to partitions, they talked about issues related to locking they were running into. With regard to extensive planning times, the planner was trying custom plans as opposed to just using a generic plan, and they resorted to just using generic plans. Even though the execution time was a bit slower, they were saving more time in terms of the planning stage. They also talked about how important hot updates are, particularly when you're at scale, because of the drastic reduction in wall that allows you to have. So overall, I thought this was fascinating, and I think I'm going to have to watch it two or three more times just to get all the relevant information out of it. Now, I also encourage you, and I'm going to do this as well, is to check out all of the different links, because these organizations have also recorded what they've done with their database. So you can get even more insight by checking all the different links to the blog post that they've posted about managing postgres. So I definitely highly encourage you to listen to it or watch the YouTube video down here. [00:02:43] And if you're wanting to learn more about how to scale postgres, I have posted my training, that's approximately an hour, called postgres performance demystify. Now, this is similar to the training I did at the end of 2023. So if you participated in postgres performance demystified at that time, I don't think you're going to get too much more out of it. I did make some tweaks, but not significant ones. But this is free training that you can participate in and I kind of share the framework I use to achieve a thousand fold improvements. So I'll have a link in the description below where you can sign up. [00:03:18] Next piece of content how much speed you are leaving at the table if you use default locale this is from dipesh.com and with postgres 17 coming out and the new built in collation provider, he decided to do a performance test of different collation types. So he checked out the provider Libc, which is usually the default, the ICU provider as well as the new built in provider, and for each of them he tested AC or C UTF eight as well as English UTF eight if it was available. So we created each of the tables using the different collations, loaded the data in, and conducted five different speed tests. Now, all the raw data is here, but I'm going to focus on the graph down here. Now, this is relative performance on the y axis. 100% is basically the fastest one and the fastest one, as you can see here, is the Libc and the slowest one is the Libc English US collation. So again, this highlights the performance of c correlations or binary sort order versus linguistic sort order that we've discussed in previous episodes of scaling postgres. Now, I expected the built in c to be equivalent to the LibC, and in many of the test cases it was. But there were a few variations, like the test one and test two. The built in C provider was slower than the Libc provider, so not quite sure what's going on with that. There were some cases where the C UTF eight for the built in provider was a little bit slower, but not by much. So that can be expected, I suppose, depending on the functions being used. But the performance of the built in provider, this is the blue and the orange, was very consistent, whereas the Libc C versus C UTF eight had much more variation. So after looking at this, I am definitely looking for the built in C and C UTf eight collation coming to postgres 17 I guess my only outstanding question is why was there this level of variation between the LiBC? [00:05:26] But if you want to learn more, definitely check out this blog post. [00:05:30] Next piece of content how to get the most out of postgres memory settings this is from tempo IO and in terms of configuring memory, the first thing you generally adjust is your shared buffers, and generally the accepted value that you set is about 25% of memory. Now you can do that and mostly just leave it alone, but if you want to get the most out of it, he suggests using this extension PG buffer cache to monitor how much the cache is being utilized. And he says if your database has been running for a while and the buffer cache isn't 100% utilized, your setting may be too high and you can actually reduce the shared buffers. But conversely, if it is 100% utilized and only a small fraction of your most used tables are cached, then you probably may want to increase your shared buffers a bit. He also said that the new systemview PGstatio can also help you configuring shared buffers to make sure you're not constantly cycling the same pages in and out of memory. And this will help you get a little bit of a clue as to that. The next area of memory he considers is working memory. So this is memory being used by each session as its processing statements, and by default it is four megabytes. But you want to be cautious because this much memory can be used per query node. So if you have multiple joins, that working memory could be used multiple times in a particular query. And he actually has a formula further down that the memory usage could be the number of connections, which is basically the number of sessions times the number of nodes. Now he says an average of five is maybe something to consider times the amount of work memory. So that's the total of memory that could potentially be used by the sessions of the database. So you want to make sure that that fixed fits within the total memory of the instance that you're using. Now, the risk of setting it too high is that your machine runs out of memory and processes may get canceled. The downside of putting it too low is that you start writing temporary files to disk, and he says you can mitigate that by looking at the PGStat database view to potentially see if you should increase your work memory or not. Next area he covers his ongoing maintenance and this is basically maintenance work memory. So this is memory for auto vacuum as well as doing things like index creation. And most normally sized servers were a bit more set at 1gb, but you can of course make it more. And I think in postgres 17 they have now removed the 1gb limit for auto vacuum in postgres. So that could prove beneficial for making auto vacuum faster for instances with larger memory. Next topic he covers is session pooling. And basically he's talking about using a connection pooler. In this example he's using PG bouncer. And the advantage of that is that you can have many more clients connecting to PG bouncer and having fewer connections to the database. Now I want to caution you where he says session pooling here, you cannot do session pooling with PG bouncer. So just keep this in mind. You actually have to use something like transaction pooling with PG bouncer to be able to multiplex the connections like this. Like he says, go from over 1000 client connections on this side to maybe 40 50 on the database side. So to do that you do need to use something like transaction pooling as opposed to session pooling, because session pooling is just a one to one ratio of clients to database. So it doesn't really give you that memory savings. And the reason why it's saving memory is basically you're using less connections. And if you think back to this formula, work memory times the max connections times an average of five nodes. You can see how that would help you free up more memory for your database. And then the last point he covers is reducing bloat. So if you have bloated tables, that is information that you're storing in a page in memory that has no relevance. So the more you can reduce bloat, the more efficiently you will use your memory. An extension to do that is PGStat tuple to give you a sense of how bloated your database table is, this is a very comprehensive post, so if you're interested in configuring your memory, definitely suggest you check it out. [00:09:35] Next piece of content PSQL Friday 17 this is from pgsqlfriday.com. and basically the upcoming challenge for the monthly blogging event is artificial intelligence. So check out this post and participate if you want. And if you're interested in AI, timescale DB posted five blog posts on the same day about two extensions that they release. Now I think this is probably what I would call the umbrella blog post for them, making PostgreSQL a better AI database. And the first extension that they released is PG vector scale, where they introduced streaming disk Ann or approximate nearest neighbor. And what they're doing with this, they're storing an index in a local SSD drive as opposed to memory. So the indexes for vectors can be huge, so they're just not storing it in memory and storing it directly on an SSD local drive to save costs. And apparently it results in excellent performance. The second thing it has is statistical binary quantization, basically improving binary quantization techniques. They show an example of comparison against Pinecone, where they were able to see a 28 x performance gain as well as being 75% cheaper. And when I was looking at this, I was thinking, why didn't they just add these features to PG vector? Why do we have to have another extension with this? But I think the reason is right here, because they say while PGvector is written in c, PG vector scale is written in rust, offering the postgresql community a new avenue to contribute to PG vector capabilities. So I can see if you have developers that only know rust and don't really want to learn c. I can see them dedicating themselves to one open source project or another based upon kind of what languages they know. So it is probably less convenient for a user having to potentially use two different extensions. But I can see the perspective with regard to it. The next extension that they created is PGAI, which they say gives developers AI engineering superpowers. So basically trying to take a full stack engineer from here up to be an AI engineer. Now initially I was potentially intrigued by it, but it looks like all they're doing is enabling you to contact the OpenI API through postgres, to be able to generate embeddings and then to be able to send API query calls to it. And I think in one of the other blog posts they had an example of someone who knew c, but they didn't want to have to learn Python for doing AI work or working with the OpenAI API. But after I built my own semantic search solution just using Ruby, it wasn't that difficult to do. So if the PGAI extension is only a wrapper around being able to contact the something like the OpenAI API. I don't see too much of the value in it right now, but if you want to do more things in the database, maybe this is exactly what you want. Now for the names of the other posts where they go into all the detail that I've just summarized here, they have Pgai giving postgresql developers AI engineering superpowers. So they go into detail here, pgvector versus pinecone vector database performance and cost comparison. [00:12:54] Next was how we made PostgreSQl as fast as pinecone for vector data. So this goes into their disk based approximate nearest neighbor search as well as the binary quantization they developed. And finally PostgreSQl and Pg vector now faster than pine cone, 75% cheaper and 100% open source. So feel free to review any of these blog posts to get all the details. Next piece of content cloud native pg recipe eight participating in postgresql 17 testing program in Kubernetes this is from gabrielebartolini it and basically now that 17 is beta one, they're starting testing with their cloud native PG Kubernetes operator. So check this out if you want to participate in that. And then we have six other blog posts related to cloud native PG, where Daniel Westerman did six blog posts about working with cloud native Pg. So the first one here is on minikube on open source leap micro six, and all of these are from db services.com. the next posts all about cloud native Pg are bootstrapping an empty cluster, configuring the postgresql instance, postgresql extensions, the Kubectl plugin, and finally scaling up and down. So all of these blogs could be a good way for you to get started with cloud native Pg. If you're interested in that next piece of content 2024 pg conf dev and growing the community this is from Rhaas dot blogspot.com and this is a discussion about things that transpired at Pgconf dev, but a lot of it was talking about community involvement in making it better. So if you want to learn more about what the PostgreSQL community is thinking about and trying to make better in terms of the community, definitely check out this blog post and the last piece of content benchmark, Amazon RDs or PostgresqL single azdb instance, multi AZDB instance, and multi AZ DB cluster deployments. This is from Amazon.com and I thought this was interesting just from a performance perspective, and they compared the three different ways that you can use RDS. One is just a single instance, so you have a single instance running. There's no redundancy, there's no replica, and application servers in multiple availability zones can reach out and contact this instance. Now this is probably the most performant way of running postgres, but the problem is you don't have any redundancy. The next one they discussed is multi AZ database instance, and this basically provides a single database instance like before, but it has a synchronous standby replica ready, but it's not a hot standby so you can't query from it. It just exists in the background without you seeing it. But in the event of an availability issue it can fail over to it. And they said this is actually the slowest solution because it is using synchronous replication and it is storing all the wall logs in EBS volumes, which just adds more of a network round trip for storing things on disk. Then they cover their multi AZ DB cluster, which I think these naming conventions are a little too similar, but with this one they provide you at least three instances. One is your writer database that does replication to two readers in separate availability zones. This replication they call semi synchronous, which means they send a write to both instances, and the first one that responds they execute the commit on the primary, and the performance of this actually mirrors a single instance, they say, because they're using local NVMe drives as opposed to EBS volumes to store the wall to be able to quickly get confirmation back to the primary instance that something has been committed. And they did Hammerdb benchmarks in order to confirm this, and they show the results of that test here. So like I said, the multi AZ instance, this is one with a silent replica stored on EBS, gives the least performance compared to a single instance or a multi AZ cluster writing wall to the local NVMe drives. But if you want to learn more, check out this blog post. [00:17:07] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode. There you can also find an audio version of the show and a full transcript. Thanks, and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 11

Serializable, JSON & JSONB, Fast Data Loading, PG11 Features | Scaling Postgres 11

Episode 392

100 Times Faster By Skipping! | Scaling Postgres 392

Episode 146

Active-Active & Federated, Memory Speed, Table Rewrite Causes, PostgreSQL Shibboleth | Scaling Postgres 146