Managing Autovacuum, Better JSON, Avoiding Updates, OS Tuning

[00:00:00] In this episode of Scaling Postgres, we talk about managing Auto vacuum better. JSON avoiding updates, and OS tuning. I'm creston Jameson. And this is scaling postgres episode 168. [00:00:21] Alright, I hope you, your friends, family and coworkers can continue to do well. Our first piece of content is enabling and disabling Auto vacuum in PostgreSQL. This is from CyberTech Postgresql.com, and they're talking about more than just enabling and disabling, but a little bit how vacuum works. The three main things that Vacuum takes care of is first, creating statistics for the postgres optimizer. The second is cleaning out dead rows from a table. And then the third is handling wraparound protection. So in terms of creating these statistics, the query planner uses statistics based upon information in the tables to make the most optimized plan for answering queries. Therefore, it collects statistics on those tables. And that's one of Autovacium's jobs, is to update these statistics. Now, you can run a separate analyze command, but Autovacium does it as well. And they mentioned that you can also tweak or adjust the frequency at which it runs, either for the whole database or at the table level with a few different configuration parameters such as Auto Vacuum Analyze Threshold and Auto Vacuum Analyze Scale Factor. Now, Auto Vacuum also cleans out dead rows. So anytime you have an update or delete, it actually doesn't remove the row immediately because it uses MultiVersion concurrency control. It actually keeps those rows around for other users of the system to be able to access that data until it's no longer needed by any transaction. Therefore, an updated delete merely marks rows for deletion, and then at some point later, they are deleted. But that deletion is handled by vacuum or auto vacuum. The next area that Autovacium covers is the handling transaction wraparound. So every transaction in the entire system is identified by an ID, and that ID wraps around. So for example, every new transaction creates a new ID and it keeps moving forward in the future. But this has a 2 billion limit, so you don't want it to hit that limit. So what happens is that Vacuum goes in and reclaims these older IDs so that they can be reused again in the future. So this wraparound issue needs to be addressed. And that's another thing that Autovacium does. Now, they say you can enable or disable Auto Vacuum on a table basis by doing Set Auto Vacuum enabled off or Set Auto Vacuum enabled on, but you really don't want to do that. And they say that there's only one use case where you may want to do that, and that is to avoid some Auto vacuum updates. If you are creating a temporary table to be able to process some data and you're going to be throwing it away, that's the only case it makes sense to actually do this. So you don't incur additional processing for a table you're going to be getting rid of anyway. Basically you're just creating the data and storing it elsewhere. And in this particular table, it's a throwaway table, so you could turn it off there. And the last thing they cover is why is auto vacuum so slow? And generally when you run a manual vacuum, it runs pretty quickly, whereas the auto vacuum runs much slower. Now, the main reason is because of the auto vacuum vacuum cost delay. Now, in previous versions they said it was set at 20 milliseconds. And right now for versions 13 and higher it's set at two milliseconds. But essentially a manual vacuum, it set at zero. Now, the other factor that could help improve performance is also increasing the cost limit. So how many actions, each having a particular cost can be executed while a vacuum is running? You can extend that limit or reduce the auto vacuum vacuum cost delay. Either of those should speed up your vacuums. So if you want to learn more, you can check out this blog post. [00:04:08] The next piece of content better JSON in postgres with PostgreSQL 14, they're talking about the new feature of being able to do subscripting with JSON. And they also cover the fact that Postgres supports JSON with two data types. One is a JSON data type, the other is a JSON b. In general, you always want to use JSON b because it enables indexing and it compresses gets rid of white space in the JSON. If you want to preserve how JSON was received, then you could use the JSON data type. Otherwise just use the JSON b. Now, before when you wanted to query a particular table and get information from it in the where clause, you'd have to use these arrow operators to be able to pull out the exact data that you want. But now with subscripting you can do it this way. So it's much cleaner, particularly for people who are used to using Ruby or Python or languages that do something similar. And you can easily just update a single record with a value by doing this with an update statement. So this is far superior to using these I'll call them arrow operators before. Now, the key thing you also need to do is index it for better performance. And you can index JSON b data types using a gen index, but you're going to have to query it slightly differently and use this contains operator. I haven't tested it yet, but it looks like you can't necessarily use the subscripting and get the indexes to work in some cases. So maybe just do an explain analyze if you're testing it out with subscripting. But this is another post from Crunchy Data discussing the new subscripting of JSON allowed in postgres 14. [00:05:47] The next piece of content PostgreSQL built in trigger function to speed up updates. This is from fluca 1978 GitHub IO. Now, I don't quite agree with this blog post name for what they discuss. They're discussing a trigger that is called Suppress Redundant updates. Trigger? And that's exactly what this post discusses. It's a way to avoid essentially redundant updates of a value. So for example, if you're going to say, update a PG bench accounts and set the filler equal to the filler, well, essentially every update is going to be entirely redundant. You don't need to do that. Now, if you run this command, what essentially you get is a double in the size of rows, like we were talking about MVCC or MultiVersion Concurrency Control creates a new row, it doesn't update the old one, you need to vacuum it away. So it doubles the size of the table and it really doesn't complete any work at the conclusion. And you can see that it updated 10 million rows here. However, if you use this built in internal trigger function and they did a before update on PG bench accounts for each row, execute this function. Now, when you do this update, zero rows will be updated. Now, in the terms of the length he had, depending on how bloated the table was, he saw maybe about a 10% to a 60% performance impact. But what's great is it essentially does no work and it should not create any additional rows that then need to be vacuumed up. Now he goes in and does some further analysis and actually does an implementation using normal triggers to do this, but this internal trigger is much more efficient at doing it. So this is a great tool to avoid redundant updates that may be happening in your application. And if you want to learn more about it and how it works, definitely check out this blog post. [00:07:38] The next piece of content tuning debian Ubuntu for PostgreSQL this is from Enterprisedb.com, and they're talking about some settings you should make at the OS level to fully optimize PostgreSQL performance. Now, I haven't necessarily set all of these in my installations, but we'll cover some of what they discuss here. One is doing some kernel level tuning using Sysctl in terms of changing how often things swap, how things are flushed to dirty pages, turning off TCP Timestamps and a few others. They also discussed disabling transparent huge pages. So that's another recommendation as well as enabling and configuring huge pages. So if you want to get the most performance out of your hardware, definitely check out this post from Enterprisedb.com. [00:08:29] The next piece of content monitor PostgreSQL performance using a database health dashboard. This is from Arctype.com, and a lot of this post does talk about Arctype and using it as a dashboard tool. But what I found more interesting is that in the post they do discuss queries that you can use to get the same information to present on a dashboard. So you could actually just use these queries against postgres and build your own dashboard using another tool. So for example, this is the query they use to get a list of all the tables from the database. This is how you can get row counts in a given database schema. This is how you can check for open connections in your database, how to get the size of your database, how to calculate cache hit ratios, how to scan for unused indexes, and also a way to inspect database caches. So this post is a great one to cover all the different ways to collect metrics about your database just using SQL queries. So if you're interested in that, you can check out this post. [00:09:34] The Next Piece of Content postgres FDW enhancement in PostgreSQL 14 this is from Percona.com, and they're talking about all the different enhancements that have been made to the foreign data wrapper for postgres. And again, they're moving toward basically a scale out scenario. So they're continuously adding new features to the foreign data wrapper because that's envisioned as the method to do a scale out of databases. So the first performance feature they mentioned is being able to push down parallel async scans to a foreign server and they discuss how you can do this and implement it. They mentioned the bulk inserts, which that was mentioned in last week's episode of Scaling Postgres, where you can insert rows in bulk more easily. You can now run the Truncate command on a foreign server table. You can do a limit too for being able to access child partitioning. You can actually see what foreign data wrapper connections are available and define whether to keep them or not, as well as a reestablish broken connections. So if you want to learn more about the foreign data wrapper features that are being added to postgres 14, definitely check out this blog post. [00:10:45] The next piece of content. Postgres 14 highlight reindex table space. This is from Pakier XYZ. He's talking about a new feature coming in 14 where you can actually do a reindex and at the same time choose a different table space for those indexes to go to. So this is great if you're running out of disk space on say, your primary table space and you want to set up a second one. The number one thing that I mostly do is recreate indexes on that secondary table space to be able to reclaim space on the primary one. Well, with this you could just do a reindex concurrently command and specify this other table space and it will move the indexes for you. So it sounds much more convenient. So if you want to learn more about this patch, definitely check out this post. [00:11:31] The Next Piece of Content dropping support for Version Two Protocol this is from Pgsqlpgpool Blogspot.com and he's talking about since version postgres Go 7.4 postgres started to use a version three protocol for clients talking to servers. Through all this time, they've still been supporting version two, the one prior. But with 14 they're actually getting rid of version two and only using version three now. So basically the thing to be aware of that. If you have a really old client wanting to talk to postgres and you're upgrading to 14, maybe you want to update that client as well because they're going to be getting rid of this version two protocol. But if you want to learn more, definitely check out this blog post, the Next Piece of Content implementing Incremental View Maintenance for PostgreSQL port Two this is a case where they're wanting to have a dynamically updated materialized view so you don't have to refresh it. It refreshes automatically and they're calling it incremental view maintenance. So basically when one of the underlying tables gets updated, the materialized view is automatically updated as well. Now, they had some relatively good performance with I think it was a single table with basically an OLTP workload. This one, they're looking at multiple tables. I think they're joining six tables here and seeing what the performance is like. And unfortunately this is their performance as they add more connections without using this materialized view. And they got a pretty good performance, but it tanked when they started using their incremental view maintenance because it has to do locking in order to do the updates to the materialized view, particularly if it's using multiple tables in the materialized view. So that's not great, but it's not unexpected. You can't really bend the laws of physics to get something for nothing. So this is a feature in progress and they're going to continue working on it to see what improvements they can make with this potential new feature of Incremental View maintenance. So if you want to learn more, definitely check out this blog post. [00:13:38] The next piece of content. Getting started with PG underscore Bulk Load this is from Haigo CA and they're talking about bulk loading data using the PG bulk load tool. Now, typically, I just used PostgreSQL's copy. I haven't needed to get every single bit of performance out of it. But if you're looking for that, maybe you want to consider a tool like this, because this post does cover how to install it and how to get it working. Importing Data the next piece of Content disaster Recovery strategies for PostgreSQL deployment on Kubernetes this is from Bping Blogspot.com and they're talking about using Kubernetes and deploying postgres clusters, using it to achieve disaster recovery and high availability. And for that purpose, they're using the Crunchy Data PostgreSQL operator and they have a diagram of the two clusters they created in different availability zones, presumably, and they go through all the commands that they ran to set it up and how it works. So if you're interested in doing this, definitely check out this blog post, the Next Piece of Content PostGIS at 20. The beginning. So Postgres is 20 years old as of May 31, and this blog post goes into the history and how it evolved and developed. So if you're interested in that historical context, definitely check out this blog post from cleverelphant CA the next piece of content Pgsev metrics exporter for PostgreSQL. Not only this is from Losolsky Medium and this is a new metrics exporter for Prometheus. So if you use Prometheus for monitoring postgres, maybe you want to check out this exporter to use instead of the postgres exporter, because he also mentions that this does support a PG bouncer as well as just postgres. So if you want to learn more about this tool, definitely check out this blog post. [00:15:29] And the last piece of content is the PostgreSQL person of the week is Nikolai Samak Havalov. So if you're interested in learning more about Nikolai and his contributions to postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you could get links to all the content mentioned in the show notes be sure to head over to where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube rideunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 170

Redis vs Postgres, Hard Quadrant, Optimizing Joins, Materialized Views | Scaling Postgres 170

Episode 234

Rust for Extensions, Timescale vs. Postgres, Uninterrupted Sharding, Data Flow | Scaling Postgres 234

Episode 245

ENUMs vs Check Constraints, Faceting With Roaring Bitmaps, Better Scaling, In DB Business Logic | Scaling Postgres 245

Managing Autovacuum, Better JSON, Avoiding Updates, OS Tuning | Scaling Postgres 168