Scaling For High Volume | Scaling Postgres 364

[00:00:00] There are some typical best practices to follow when you scale postgres, but what I find interesting is when certain organizations choose to kind of bend or break those rules and see how they make changes to it to be able to scale their systems. So we'll talk about that today as well as be sure to stay till the end of my consulting corner where I talk about how to get around a case where the planner is clearly choosing a sub optimal plan. But I hope you, your friends, family and co workers continue to do well. Our first piece of content is Merkle Map Scaling certificate transparency with 100 billion rows of data. So they have a system that has about 20 terabytes of storage and over 100 billion unique rows. So not small, not the largest system, but definitely not small. And they go through and talk about some of the optimizations they've done to it. So they are using ZFS as their storage and they're using the default record size of 128 kilobytes. But if they were to keep the 8 kilobyte page size that Postgres has by default, that would mean a 16x overhead for many operations. So what they chose to do is not actually decrease the ZFS record size to 8 kilobytes or maybe 16 kilobytes, but they wanted to keep that 120 kilobyte DFS size. So they increased Postgres as much as they could to 32 kilobytes. So they're using a 32 kilobyte block size for Postgres. Next thing they did is they actually optimized for append only operation. So they tried to minimize updates as much as possible, which definitely helps postgres run at its best the less updates you have because that means your vacuum has to run less. But they did talk about some tweaks they did make to the configuration and they even said how they handle transaction ID wrap around when the write pressure is so high that they're starting to get high on the transaction ID wraparound level. They actually slow down ingestion in their application to let the database catch up essentially and then they ramp up ingestion again. So that's an interesting way to deal with that. They did say many people suggest using copy for ingestion, but they actually wanted the integrity checks of postgres. So they're actually doing insert statements. I'm assuming they're doing multi row insert statements to make it more efficient, but I didn't see them mention that here. But they wanted the database's Integrity, the consistency checks, etc. They did say they made some modifications to commit to increase its performance. And they're using commit, delay and commit set siblings parameters to try and optimize that. So basically batching multiple commits together. Now, normally people with very large databases, they partition, but they chose not to partition because they felt it brought more challenges. It does bring some more challenges, but generally if you're deleting data, I would still advocate doing partitioning. And they say they're also very strict about the queries run. Everything is using an index to give optimum performance. They are using the LZ4 algorithm for ZFS compression. They tried Zstandard as well, but LZ4 offered better performance and storage savings for what they were looking for. And they said they have a resilient ingestion strategy. So basically fail fast and retry often. So if something doesn't work, try it again. I didn't see that they're using an independent connection pooler. It looks like they're using Rust, but I think they're just using the connection pool in the Rust library. So I don't think they're using pgbouncer or some other solution. But I thought this was an interesting take on a slightly different way to scale postgres. So be sure to check it out if you're interested. Oh, and I should say this is from blog.transparency.dev. next piece of content, why your workflows should be Postgres rows. This is from dbos.dev, and I thought this was an interesting pattern. Basically they have a scenario where, say you have a utility, it's tracking usage data. You want to calculate that data, prepare an invoice, bill the customer, send the receipt and log the fact that it happened. And then you have that workflow here. But what they're advocating for each step of these workflows, you should actually insert a row into the database. That way you can very precisely track at what stage different things are in and even track input and output parameters for for each of these steps. And this is per job. So in this first example, they have a workflow status table. This maintains a unique workflow identifier they call the workflow id the workflow name, whatever the name is. In this case, billing workflow, a status. What are the input parameters for this workflow? What are the ultimate output parameters for this workflow? And if there are any errors. Now, I saw the status field, I got a little bit concerned because. But basically this table just receives an insert and then One update at the conclusion of the workflow, whether it failed or whether it's successful. So it's not too bad. The status is not receiving a ton of updates, or at least it's not intended to. So a separate table actually tracks all the steps. And that table they happen to call operation outputs. They have the workflow ID referencing that unique workflow, the step id, what step it's on, what step step name was executed and what is the output from that step. And then an error in case something happened. Now what's interesting about this is that by recording what the output is, you can actually pick up at the next step by using the output in the input of the next workflow step. So that's pretty interesting. And you can see that here, where there is some issue with billing the customer. Well, you can take the output from the generate invoice and, and use it as the input for the bill customer and then that workflow should continue processing. And they have some Python code explaining that here. So this is an interesting pattern I hadn't seen before. I'm actually going to be sure to bookmark this to consider using it my application development in the future, if I think it makes sense. I mean, a lot of times I try to make my jobs idempotent, so no matter how many times you run it, you'll still get the same result and it will automatically try to catch up. But this is a way to kind of store it in the database where the state of things are at in a more logical and consistent fashion. But if you're interested, you can take a look at this blog post. Next piece of content, a PostgreSQL planner gotcha with CTEs delete and limit. This is from Shayan.dev and he had this query here where it's a CDE he's looking at with deleted tasks. And that subquery here does a delete from a task queue where an ID is in some set of IDs. So he's doing a select ID from task queue where a group ID is some amount and he's always limiting it to one and doing a for update skip lock. So ideally this should pull one id, delete that id and then get the returning item ID from the delete statement and use it to select from deleted tasks. Now what's odd about this is the query planner was actually doing a nested loop semi join and it ended up running the limit query five times. So you can see five loops for the limit. So they were actually returning five rows which is a little confusing to me because I would assume this query would return one row, that row would be deleted, it would return the item ID and then you could select from it. Now how he said he got around it is he didn't use a cte, but he just used a subquery. But he says this encourages it to evaluate the subquery. So I don't know if it still doesn't work 100% of the time. And what I find interesting about the CTE normally, how I've seen it done is the cte. The with statement is the select query. So that selects the one row and then you delete from the subquery, essentially return returning the item id. So you don't necessarily need this select either. But if you still had issues with it, you should be able to materialize this query in an independent cte so it will always truly only pick one row. Anyway, I thought this was an interesting blog post due to the seemingly strange planner choices, but let me know what you think. Next piece of content waiting for PostgreSQL 18 allow not null constraints to be added as not valid. So this is awesome. We can now make a column not null invalid so everything new coming in should be not null. And then once all the existing data in the table is updated you can then validate that constraint and it should happen relatively quickly. But he goes through discussing that and looking at that here, so that's great. And this is from depeche.com next piece of content orphan files after a PostgreSQL crash this is from cybertech postgresql.com and whenever Postgres happens to crash, he says well when does it crash? I mean it's very rare. But he mentions some cases here. Maybe you have the out of memory killer kills it because you didn't set memory over commit properly. Maybe your pgwall directory ran out of disk space causing it to crash. He mentioned something about there's a critical section of processing where postgres may escalate an error to panic. There could also be faulty hardware that would cause it. An administrator can inadvertently corrupt the database, although I think that's you have to do that pretty intentionally or just a general software bug in postgres, rare as they thankfully are. But once you have this crash, it's hard to identify files that need cleaning up because the database itself is full of tons of files and trying to go through all of those would take a very long time to bring up the system. So he actually agrees that there shouldn't be a good cleanup mechanism. But how do you clean up these files? And he says the safest way is to do a dump and restore of the database, because then you're just taking the logical database and bringing it over to a new database cluster. You could also use logical replication to do that as well. He says there's also the risky way to try and identify and remove these files, and personally I wouldn't want to touch that with a ten foot pole. But he does talk about ways you can try to identify these orphan files. Frankly I would probably just leave them alone. But if you want to learn more, you can check out this blog post next one also from cybertech postgresql.com is postgresql trusted extensions for Beginners and this talks about what are trusted extensions versus regular extensions. And it is just basically a flag that says trusted and whether it's true or false. You can see this in the control file of an extension. But what it's meant to mean is that this extension is relatively safe enough that a general user, not a super user with create privileges, can create this extension in the database or essentially activate it. But he makes an important distinction. Trusted does not mean the extension is bug free, which could cause postgres to crash. It does not mean the code has been officially audited by the PostgreSQL core developers. That hasn't been done. And it does not mean that it's completely safe in every possible scenario. It's basically the author of the extension determining that this should be relatively safe for general user to be able to activate on their database. And he has the quote here. We believe that enabling this extension should not allow users to bypass security or harm the database server. Now of course this is important for cloud providers so people can enable particular extensions on a database and literally a trusted extension, it just has this flag trusted. So if you are a super user or an admin of a database server in extension you can choose to make it trusted or not. But if you want to learn more, definitely check out this blog post. Next piece of content Source Code locations for database queries in Rails with marginalia and query logs. This is from andy atkinson.com and he brings up an important tool in your toolkit for managing your application. Working with your database is actually being able to log particular code locations with your queries. And how this is typically done is that a comment is added to a query being executed giving you additional information about what part of the code executed a particular statement. Now he's using a Rails example here and he says this has been built into the most recent version since 7.2, but also you can use the marginalia gem. But I've seen this for other languages like Python, et cetera, and in this Rails example they show the application, the controller that it came from, which is part of the MVC pattern, and Rails what action it came from, as well as the exact line item of the source code. So this really helps speed up resolution of issues you may encounter with your database and your application. But check this out if you want to learn more. Next piece of Content Taming large Data sets in PostgreSQL archiving and retention without the pain this is from DataEgritt.com and they have a few suggestions about how to manage larger postgres sizes when your data is growing. First is they're saying use partitioning, unlike the first blog post that chose not to use partitioning. But again, if you're going to be archiving data it is much easier to do and less taxing on the system to just drop an old partition table as opposed to trying to run delete statements on it. Next they said you can use compression to save space, particularly with cold data. And and in here they're advocating using the LZ4 algorithm for toast as opposed to the built in PGLZ algorithm to give you a little bit more compression. Or like the first blog post they actually chose to compress the entire file system using zfs. Next is use compression when you're doing backups. They show doing it with PGDump, but you could also use this with PG based backup as well and they're just using the zstandard compression. And the next option is that when you have old data access, actually taking it entirely out of the cluster and putting it in another PostgreSQL database maybe. And if you do that, you can use the postgresvarn data wrapper to be able to run queries and access that other database cluster. Or you can actually store the data in CSV files and access it via the file foreign data wrapper or even send it to Parquet files on S3 or local storage and use the PGParquet extension to be able to access that data. And if you are in the cloud, many of the cloud providers provide tiers of storage to spend less money to store large amounts of data. But if you want to learn more, you can check out this blog post. Next piece of content what really happens when you drop a column in Postgres? This is from thenhile.dev and they had a riddle. Here we have a table with two columns, ID and name what will happen if we add and remove a column 2000 times? And the answer is you get an error 2000 many columns. Tables can have at most 1600 columns. So the column limit for a table in Postgres is 1600 columns. 1600. So it's hit that limit. But why we only have two columns? Because we kept deleting, adding, deleting, adding, etc. Well, what happens is that that column is truly not removed. It's still in the database files. The catalog reference to that column is removed, but it still exists within the database file. And this blog post talks about examining that using the pgFileDump utility to actually see that data. But of course, if you have these columns there, how do you get rid of them? Well, vacuum does remove rows that have been deleted, but it doesn't do anything for columns. But that's the normal standard vacuum or autovacuum that runs. You can do a vacuum full because that does a full table rewrite and that will remove all of those columns from the table. But vacuum full can be a problem because it locks the whole table during that table rewrite, so you can't read or insert new data into it. Now, there are extensions that do online rewrites like PGrepack or PG Squeeze, or you can even do a logical replication transfer that works as well. But if you want to learn more, definitely check out this blog post. [00:16:13] Next piece of content Striping postgres Data volumes A free lunch. This is from kmople GitHub IO and he was doing an experiment to see what kind of performance he would get by striping volumes. So he used a 4VCPU instance in Amazon, a C6G XLarge, and he said he examined 0 stripes 2, 4, 8, and 6. And I think he's attaching that many volumes to the instance and then using RAID to stripe across them. And he uses several chunk sizes as well. 8, 16, 32, 64, and 128. And then he looked at various read performance. Like this is a random key read, and you can see the TPS goes to a new plateau with each stripeset added. So the pinnacle of the performance was close to 30,000 transactions per second across 16 stripes. And additionally your latency drops significantly as well. From about 16 milliseconds to close to 1 millisecond, the batch reads increase as well, and the latency of those drops and even the full scan speed. And for this particular test, it looked like 32 gigabyte stripe size worked the best and the more stripes the better the more volumes used. And he did compare the performance to provisioned iops as well as his local system and a single GP3 in memory as well. But the 16 stripes held up pretty well and it seemed much more economical from his examination. But if you want to learn more, you can check this out. The last piece of content is incremental backups in PostgreSQL 17. This is from Stoker postgresql.blogspot.com and this is just a quick review of using incremental backups in Postgres 17. He does have a link to a video by Robert Haas you can look at as well, but he goes through the different steps of setting up a incremental backup, so definitely check out that if you're interested and now it's time for my consulting corner. So I was asked to take a look at a query that was a query across three tables and it was much slower than they anticipated. So I took a look at it and the planner was definitely making a mistake in the plan that it was using. It was very inefficient. So basically it was joining one table to another intermediate table to a third table, and the query was narrowing the first table and the third table's result set. So postgres decided to scan the index on the first table, scan the index on the third table, and then hash them together, along with doing a nested loop join for the intermediate table to basically get the results. But that was actually the worst option because the query on the first table only returned one or three rows, so it only had to deal with three rows and then join to the intermediate table. Maybe it picked up a few extra rows there and then joined to the final table where the filter actually reduced the result set even more. Now, I didn't see any way to simplify the query. It wasn't doing anything too exotic that I think might confuse the planner, so I assumed it was just a statistics issue and when looking at its row estimate for that table 1 how many rows it would return, it was looking for 256 rows, whereas what it actually returned is anywhere from one to three, maybe four rows at most. So once I altered the statistics on that column to get 10 times the number of statistics. Now the plan did the join from the first table to the second to the third without trying to hash things together and it ran super fast and basically read 10,000 less buffers compared to the original plan. Now another way you could have handled this is used a materialized CTE to first run against that one table to get the one to three rows. Because then the planner says, okay, run this material. Ict how many rows do I have? And if it's only one or three, it's very easy to say. All right, we'll just join these two tables using the join columns, and don't worry about the other filter on the third table. That's going to be a less efficient plan. So that's another way to have handled it, but hopefully you found that interesting. [00:20:21] I hope you enjoyed this episode. Be sure to check out scalingpostgrads.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode there. You can also find an audio version of the show, as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 211

Multiranges, Missing Metrics, Newbie PostGIS, Conference Videos | Scaling Postgres 211

Episode 213

Planner Optimization, Fish to Elephants, Large Databases, Spatial Data | Scaling Postgres 213

Episode 344

Performance & Hard Things | Scaling Postgres 344