Millions of QPS | Scaling Postgres 402

[00:00:00] Here's a question. Would you want to support 800 million users using a single postgres primary database? I think at that scale I would definitely want to shard things, but ChatGPT has not done so yet. It looks like they got into it a little bit, but still their main database is a single primary. They have not sharded yet. [00:00:26] So the first article is actually one that's been going around since I recorded last week's episode of Scaling Postgres. So we'll Talk about the ChatGPT article and I will do a very quick consulting corner where I actually have a question for you asking about what do you use to maintain your source of record for your partition tables if you're using them, But I hope you, your friends, family and co workers continue to do well. Our first piece of content is scaling PostgreSQL to power 800 million ChatGPT users. This is from OpenAI.com and they're talking about the fact that they use a single primary and 50 readers in order to support their 800 million users. And this is their main database. And because this article has been making the rounds, I thought I'd make it the first article we talk about this week because a lot of the insights that they mentioned, they might be old hat to a lot of you. I did find a few interesting things, but will review it. So they mentioned the main pain they've been experiencing with scaling Postgres since the launch of ChatGPT is if they ever had a cache layer failure that could increase the postgres load, or expensive queries can cause the increased load or write spikes. Maybe there's a new feature that's writing more than they expected that can cause the postgres overload. Well, if it happens, those requests become slower timeout and that triggers retries, which again causes more postgres overload. So it's a positive feedback loop. It causes more problems. So they put in a lot of engineering work to try to address these areas to minimize load, especially on the primary, since they only have one. Essentially the readers can be impacted, but there are 50 of them. But still, if your caching layer goes down, they're all going to be hit by the traffic that the cache layer was maintaining. And they do say, quote, although postgres scales well for read heavy workloads, we still encounter challenges during periods of high write traffic. And essentially they're talking about when you update a tuple, it essentially creates another row. So this is the MVCC that doesn't update a row in place. So this is a well known problem with postgres and why I'm so interested in getting Oreal DB available as a new storage engine to at least put some update heavy workloads on it. So I think the insert heavy workloads are fine. It's these update heavy ones where you're going to run into issues. [00:03:03] And they do say their infrastructure is handling millions of queries per second. So basically the first thing they did was put every read they could on the read replicas and only read when absolutely necessary from the primary. Now usually when you see a postgres installation this large, they've already moved to sharding or at least have it in their plans and they say they're not ruling it out sharding it in the future. In addition, they have already started moving certain workloads to a new sharded system that runs on Azure Cosmos db, which I'm assuming is a scale out sharded solution, but they are keeping the workloads that are harder to shard still on their main postgres cluster. Then they talk about a number of steps that they follow in order to make the system more efficient. Number one is query optimization of course, because that's where you can get the biggest wins. And they said they have had issues with the system performance. If a costly query gets through that's joining 12 tables. So they've had to put some controls on that to prevent these queries getting into their system. And they even talk about moving some join logic into the application layer instead. Although at this scale I guess I would have expected to see if they don't need real time data doing change data capture to offload that to a data warehouse or a data mart. Particularly if you're joining 12 tables together. That seems like it may be more of an analytical workload. Next point they talked about is addressing a single point of failure mitigation. This is basically meaning they only have a single primary. Well, what do they do to address that? They have implemented high availability with a hot standby and a continuously synchronized replica to be able to take over and they are running on the Azure Postgres offering. I believe they also do some workload isolation, so I found this interesting. They actually isolate some workloads onto dedicated instances, I'm assuming read only instances to ensure that sudden spikes impact other traffic. So requests go into a low priority or a high priority tier. So it sounds like if you had a job system where you would assign priorities to different jobs, those would just hit different database systems presumably. Next they talked about connection pooling, so definitely at the scale this is essential. And they have pgbouncers in front of each reader and of course writer I'm assuming. But they say they have a dedicated kubernetes deployment with multiple pgbouncer pods in front of each replica. They do have a caching layer to avoid the database receiving all of the query traffic. And this is a pretty big challenge is scaling read replicas because now they're up to 50 essentially all of this data has to be synced from that one primary, so synchronizing 50 replicas becomes a not insignificant load on the primary. And currently they're investigating cascading replication to be able to cascade from one replica to another replica to be able to handle this. And they're working with the Azure postgres team to set that up. They've also set various rate limits at the application layer, the connection pooler layer, proxy layer, query layer to prevent traffic spikes that would impact the database. And they also have to implement a fair amount of schema management, which is basically avoiding doing database migrations that would cause a full table rewrite. So definitely common thing you have to do. So it looks like that they are continuing on this path with a single primary, but they are starting to move newer workloads that are shardable, they say, onto presumably the sharded Cosmos DB that Azure offers. And I guess they didn't want to do the big migration onto a different sharded solution. [00:06:53] But if you haven't checked out this blog post, feel free to go ahead and do that. Next piece of content related to sharding is there was another episode of Postgres FM last week. This one was a PGdog update. So we've covered PGdog on scaling Postgres before and this is a Sharding pooler. So it does sharding, it does pooling, it also does some load balancing and it basically sits between your application layer and your database layer. [00:07:21] And lev Kokotov of PGDog, the founder, joined Nick and Michael to discuss an update because he was on the show about eight months ago and they're basically just doing an update of PgDog and where it is, he basically says this is a real thing now we have real customers, they are in production and he talked various different enhancements that he's been adding to the product based upon what different potential customers are looking for. And basically he's wanting to get to that point of view, one button to be able to shard your postgres database. So I thought this was appropriate. Follow on from the OpenAI article and I thought he had a couple of interesting quotes very quickly. At one point in the show he said he thinks with PG Dog they have a better pooler than PG Bouncer. The load balancing aspect is a game changer, but Sharding is its North Star. That's what it's focusing the most on. And I'm definitely still keeping track of PG Dog and what they're working on as well as multigres of course, and how that's progressing. [00:08:23] And if you look at the pgdog website, they do have a number of customers that presumably are using them now, given that they're on their website. [00:08:32] But if you want to learn more and hear the discussion, you can listen to the episode here or watch the YouTube video down here. [00:08:38] Next piece of content also from pgdog. This is replacing protobuf with rust to go five times faster. [00:08:46] So this is an enhancement that they've added to PGdog to make the product, as he says, five times faster. [00:08:54] So still heavy in development, optimizing the solution and make it better. Now this is a very technical blog post about all the optimizations added, so I'm not going to go into detail about this, but if you're interested, you can definitely feel free to check this out. [00:09:08] Next piece of content Fast, scalable, enterprise grade postgres natively integrated with Clickhouse this is from Clickhouse.com and what Clickhouse has done is that they are now offering a postgres OLTP solution or transactional solution that works with their analytics database. So basically all the work to move your data into Clickhouse will be infinitely easier if they offer you a postgres service to run your OLTP application. [00:09:40] So what they've done is they've partnered with UBI Cloud, which is known as an open source cloud company and was founded by the same team that started Citus Data and was involved with Heroku. So definitely no slouches. And it's interesting they didn't acquire UBI Cloud, but they're basically partnering with them. So they're going to be providing the postgres hosted OLTP service that works in conjunction with the Clickhouse analytics and this unified solution Clickhouse is talking about. They're using NVME backed disk storage to give you great performance, which is particularly important, they say, with AI workloads. So much like PlanetScale is doing with their Postgres metal, they're doing something similar and using NVMe backed Postgres. And they say, quote, we evaluated alternative architectures separating storage and compute. [00:10:35] So basically they decided to go for the hosted NVME storage offering that UBI Cloud had developed. And this makes so much sense with the PG Clickhouse extension that they released a month or two ago. We discussed it on scaling postgres, where it enables you to run analytic queries within postgres and they get pushed down into a Clickhouse database. [00:10:57] So this image clearly shows that you have an application. You can run your transactions through your postgres database, you can run your analytic queries through through the postgres database. But when you're contacting data within PG Clickhouse, it does the push down and optimally runs them, they say, up to 100 times faster. And how the data is kept in sync is from the click Pipe CDC pipeline. So this is what I was thinking about with the OpenAI post is that when they were saying a 12 table join significantly impacted the performance of their database system, well, maybe they should be doing some sort of CDC if they're not already. And sending presumably a 12 table join sounds more like an analytical workload, therefore it could be pushed down into Clickhouse. But I thought this was an interesting offering and it's similar to moves we've seen by other companies like I think Snowflake acquiring Crunchy Bridge. So they basically purchased an OLTP hosting solution. [00:11:59] But if you want to learn more, check out this blog post. [00:12:01] Next piece of content Introduction to buffers in PostgreSQL this is from boringsql.com and in Building RegressQL, his tool for doing regression testing in Postgres, he had to learn a lot about shared buffers. So he basically shared it here. He says the first thing you need to know about shared buffers is that how the database writes to the disk or reads from the disk is always through an 8 kilobyte page. Even if you have a very small row, it's always going to pull or write eight kilobyte pages. And there is a postgres cache that's called shared buffers that essentially buffers these disk rights where it maintains a cache of what's the current state of the database. And he goes into details about how the shared buffers works. I won't get into that here, but if you're interested, definitely check out this blog post because it is very good. He talks about dirty buffers and the background writer and checkpointing and the ring buffers and their role, as well as local buffers which are basically within each back end to handle temporary tables along with the role of the OS page cache, because even though they're shared buffers, we still rely on the operating system cache. So that's why your shared buffers is never really recommended to have 100% of the memory. But if you want to learn more, you can check out this blog post. [00:13:20] Next piece of content is a feature of MySQL PostgreSQL or MariaDB or TIDB or so this is basically discussing the fact that a lot of forward momentum has stopped or paused with the community edition of MySQL and Oracle has laid off many of their top performers in the MySQL group. So the question is, you know what's going to happen? He said, well, you could do nothing, which basically means stay on MySQL at whatever version it's at. But with the importance of security, I can't necessarily see that. The next option, he says, go with the elephant. So basically switch to Postgres. Or if you don't want to even consider that, you could try MariaDB. Although things have diverged from MySQL he said it is probably the second easiest port you can do. He says, you could use the Percona solution, or you could go with Enterprise Oracle, MySQL and, you know, pay for it. [00:14:17] And he mentions a few other options. But this makes me wonder, Is this why PostgreSQL continues to grow in popularity, is due to other databases declining in popularity? Next piece of content, Finding Abbey Road when users typed Beatles, Abby Erd Fuzzy and semantic search in PostgreSQL. This is from Bendiment IE and he's talking about where you have a set of data and users want to find something in it. They search for all sorts of crazy stuff in order to find an actual album in this example. So he says, how can you handle that? So he proposes two ways of doing it. One is using fuzzy matching using trigrams, and he uses the PGTrigram extension. The other is using semantic search, and that's basically using PGvector. So you're searching the meaning of the embeddings. So he creates both extensions in his postgres database, creates an album table and loads in data from the Spotify data set. [00:15:18] Create some indexes, a gen index with gen trigramops and an IVfflatindex with vector cosine ops with lists equal 100. And here's his fuzzy query using a similarity search. He says, but that doesn't use the index. You're going to want to use the percent operator in order to use the gen index. [00:15:43] So basically PGTrigram works well for essentially misspellings. If you want semantic search, then you have to generate your embedding. So you basically pick an AI model, generate the embeddings for the data you're interested in, storing those in the database, and then you can search those embeddings using queries like this. And when someone does a search, you do have to convert that to embedding and then you compare it in the query. [00:16:09] Now he says what really helped him on a project and he's giving an example of it here, is doing some text normalization. [00:16:16] So if you see search patterns by people you can actually convert it. So if you see feet for example, go ahead and make it featuring or V for volume or removing certain noise patterns or stripping particular articles like the and from beginning of searches. Now he combines both approaches when doing a search. He doesn't use reciprocal rank fusion that other blog posts have mentioned, but what he does is does the fuzzy matching first and only if the score is low, fall back to using the embeddings and use the better result out of those. [00:16:55] So if you want to learn more about this, definitely check out this blog post. All right, now it's time for my consulting corner. I want to talk a little bit about partition tables and particularly what is your source of truth with what partitions need to be built? Personally, I've built it into the code that creates the partition tables and it just runs on a daily basis and says are there any new tables that need to be created that haven't been yet, go ahead and create them. So my source of truth is in the code, but I know if you're using a tool like pgpartmann that source of truth is generally contained within the tables of postgres. So you have your configuration typically in a separate schema that defines the partitions you're setting up as well as the template tables. So you can alter and change configuration of some of the templates and they will get applied to the new partitions. But really I'm interested in knowing what do you use as the source of truth for your partitions. [00:17:57] Do you like storing this data somewhere in postgres tables? Do you store it in the code that needs to run to generate the new partition tables? Let me know in the comments below. I hope you enjoyed this episode. Be sure to check out scalingpostgrows.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode. There you can find an audio version of the show as well as a full transcript. Thanks. And I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 282

Postgres 16 Released, The Postgres Meme, File vs base64 Strings, Intelligent Sharding | Scaling Postgres 282

Episode 270

Data Skew, Monitoring, Remote Access CSV, Documentation | Scaling Postgres 270

Episode 303

PgBouncer, PgCat, Supavisor Fight!!! | Scaling Postgres 303