Scaling Out To 226K TPS | Scaling Postgres 422

[00:00:00] Well last week we talked about scale out funding where investors seem to be investing in on how to better scale out Postgres through sharding and this week we look into a sharding example but they did it at the application layer as opposed to a Postgres layer. But I hope you, your friends, family and co workers continue to do well. Our first piece of content is is from Christmas outage to number one app store ranking in RF frames Postgres scaling retrospective. This is from andyatkinson.com and he's talking about the company rframes which I think they have electronic frames you can buy and store your images on them and they rotate through the images and apparently they have massive load on Christmas Day I guess everyone uploading their images to it. [00:00:53] And in 2024 the service was unavailable for three hours and presumably it was database related. [00:01:00] Well with the new implementation presumably in 2025 the transactions per second across multiple DBs. So they did scale it out peaked at 226,000 transactions per second with more than 100,000 TPS sustained for 10 hours on multiple days. [00:01:18] So this posed is long and it has quite a story to it. I can't cover everything in it here, but in 2024 during Christmas they were using the largest R6G instance available at the time and they basically were write constrained even with having a dedicated log volume available for it, which is a storage volume that's separate from the database table volume. [00:01:45] So all of this load caused multiple challenges as he lays out here. [00:01:50] So their solution was to scale out the writes and they almost 5x'd times the amount of hardware that they used. [00:01:59] And he has various stats you can look at here they are using PgBouncer. He described the process of setting up those DBS and he's using application sharding. So they're basically dedicating particular tables to particular database systems and they are using Ruby on Rails. So they are using the ActiveRecord Multiple Databases capability to do that. And he described the steps to do the switch over to the new DBs. They tried logical replication but it wasn't fast enough. [00:02:29] So they actually just did a full logical replication and I think then removed the data that was needed. Yes, cleanup of unneeded tables. [00:02:39] So this is mostly the database side of the plan. He also wrote scaling Rails at RF frames, splitting to eight primary DBs and reaching number one in the app Store. This blog post is the same story, but it mostly looks at it from the Ruby on Rails perspective. So there's A lot more Ruby code in describing how the application sharding was done and it's basically using built in capabilities that exist in the Ruby on Rails framework now. But he did describe some various changes that of course needed to be done to the application to get it working. [00:03:12] So I thought this was super interesting. But it also again underscores the need for scaling out solutions when your current primary database can't scale up enough. And that's exactly what pgdog and multigres are targeting. [00:03:28] So I'm definitely going to keep an eye out on those two solutions and see what happens with them. But check these two blog posts out if you're interested. Next piece of Content the only scalable delete in Postgres is drop table. This is from planetscale.com and he's talking about when you do large deletes to a database it basically causes a lot of work because it doesn't immediately delete the data, it marks it for deletion. [00:03:54] So essentially that is write activity that has to happen and it's a vacuum that needs to go in and then ultimately remove that delete. And depending on the database table and how much activity you have, you could run into locking conflicts as well during the delete process. [00:04:11] So he's advocating using drop table or truncate as opposed to delete when you can. [00:04:17] Now of course this drops the entire table or truncates the table so you can't just swap it for a delete. So what it basically means is if you have ongoing tables that you need to delete data from frequently move them into partition tables because hopefully that set of rows you would need to delete can be in a partition that could be removed. And if you're talking about time series data partitioning by a timestamp is a great way to delete older data. [00:04:46] He also discusses one off deletes and that's a case where you need to delete a lot of data one time and he uses an example of where invalid data was inserted into a table. So what they actually did, they could afford a long term lock on this table. So they opened a transaction, locked the table, created a temporary table, and moved all the data into it. Then they truncated the original table and then moved the data they wanted to keep back into the table. [00:05:15] So that essentially removed the rows that they didn't want by using truncate as opposed to delete. And then of course for ongoing deletes use partition tables that I mentioned. [00:05:26] But check this out if you want to learn more Next piece of content Disaster Recovery is a Process, Not a Tool Part 1 this is from richyan.com and this is another fairly long blog post, but it's all talking about disaster recovery and the process that you need to go through. Because when you have a disaster, it's not necessarily some sort of natural event like a flood, earthquake, wildfire, etc. [00:05:52] It may be a performance regression after some failover, or it could be some sort of corruption or a security incident, or an application bug that writes incorrect values or accidental deletes. These are all disasters that need to be considered and how to recover from and he says when you think about disaster recovery, most people think about being prepared and setting up prevention. [00:06:19] But he says recovery is what happens after preparation and prevention have failed, so you need to have a process for that recovery. And he talks about numerous backup tools that are available to handle recovery and just be prepared to use them in particular incidents. [00:06:36] Then he talks about RPO and rto, basically recovery point objective and recovery time objective, and how there are negotiations because one interacts with the other to a certain extent, and what actions do you follow based upon your objectives? [00:06:53] And he shows an example here of different recovery paths based upon what's happening. [00:06:58] So he says, for example, a lower RPO isn't just better, it's a design and operational commitment. And that commitment costs money, time, and people's attention. [00:07:08] And your disaster recovery plan should have an infrastructure component, a procedural component, and human component. But I thought this was a very insightful article. If you want to learn more, definitely encourage you to check it out. And the next piece of content is the null in your not in. This is from boringsql.com and he says a not in query can return the wrong answer without telling you, because basically if you have a null sitting somewhere in that set of records you're looking at, it will return nothing. So he shows an example here, a list of products and a list of archived category IDs, and one of them is null. So if you do a not in, the results will give you nothing, even though you might expect it to return the 10 record, it gives you none instead. [00:07:58] But if instead you change that not in to exclude nulls, then you get the right set of records back. Now why this works is because an regular in statement is considered an or. [00:08:11] So give me where x equals a or x equals b, or x equals c, not in is more of an and between the values. [00:08:20] So X is not equal to A and not equal to B and not equal to C, and you can see what the actual planner outputs for n and not n and essentially it's this not equal to and that causes the records to vanish. [00:08:35] Now we show some different examples of this of course now in Postgres 19 some of this may be resolved because some not ends may be converted to anti joins which can avoid this problem. [00:08:48] But he says probably don't want to wait for that. And there's a couple of ways you can resolve this. You can default to using not exists that will avoid the problem, or filter out the nulls in the subquery as he did in the example above. [00:09:03] Or this is an example of using the accept command as well. But if you want to learn more, definitely check out this blog post. Next piece of content looking forward to Pesque 19 it's about time. This is from pjedge.com and he's talking about enhancements coming in PG19 to help make temporal tables easier. So this is being able to identify a row by some sort of identifier and the time range it was valid for. [00:09:31] And historically you would have to use B tree gist indexes and using exclude to avoid overlaps and it's just not inherently temporal aware. [00:09:43] However, with the enhancements in 19 you can now define a primary key with two values, say an integer for the ID and a range timestamp and classify it as without overlap. So very simple to do. [00:09:58] And of course have a date range on the table as well. So you don't need the b tree gist extension, you don't need to define the time as two columns, you can use the time range and you don't need this exclusion constraint. And then he goes through and shows how you can use these temporal tables. So if you're interested in that, definitely check out this blog post. Next piece of content related to time is our favorite subject Time Zones British Columbia Time zones and Postgres because apparently British Columbia did their spring forward to UTC back in March, but they will not fall back to UTC 8 in November so they're permanently staying UTC -7. And he took this opportunity on how to handle transitions like this and talks about time zone storage but but if you have a time zone shift and you have people needing to arrive at a particular time for appointments, you may want to implement a dual or a triple column pattern to track separately a local timestamp, local time zone and the UTC timestamp in order to better handle transitions like these. And he walks through the process of doing that. So if you're interested in that or need to do it definitely Check out this blog post. [00:11:19] Next piece of content pgkpart this is from Hexacluster AI and this is a new extension that rejects queries that would scan every partition of a partition table without a usable predicate on the partition key. [00:11:35] So if you have a lot of partitions in your database, this could be an advantageous extension to use. [00:11:43] So that way you avoid queries not using a partition key and hitting all of your partitions when it runs. [00:11:50] So if you're interested in something like that, definitely check out this blog post. [00:11:54] Next piece of content 1000 Postgres branches for $1 this is from Zeta I.O. and really the headline is what pulled me into this and they've done a lot of optimization to their branching where they've got times down to 1 to 2 seconds and I don't know how real this is, but 1000 DBAs for $1 on their video here based upon how fast they can scale and they do show some real world use cases of how people are using them. [00:12:22] So some people use an isolated Postgres branch in every CI build. So basically they're getting production shape data without touching a shared database and when the CI build is not running the branches are scaled down to zero. [00:12:35] This is another example where hooting agents are working on independent tasks with their own copy of the db. [00:12:42] So I guess that's an insight into why so many people are using so many databases today. [00:12:47] But if you want to learn more about the enhancements, you can check it out here. [00:12:51] And the last piece of content there was another episode of the Talking Postgres podcast. This one is how I got started running a Postgres user group with Jeremy Schneider. So if you're interested in learning more about him and his work, definitely check out the podcast here. I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content discussed, as well as sign up to receive weekly notifications of each episode there. You can also find an audio version of the show as well as a full transcript. Thanks. I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 79

PgBouncer SCRAM, Art of PostgreSQL, Distributed Time-Series, Window Functions | Scaling Postgres 79

Episode 372

The Future of Postgres Architecture | Scaling Postgres 372

Episode 396

600 Times Faster Analytics | Scaling Postgres 396