Checking Vacuum, Fast Distinct Count, Group By Speed Up, Health Check

[00:00:00] In this episode of Scaling Postgres, we talk about checking vacuum. Fast distinct count group up. Speed up in postgres health check. I'm Kristen Jamison and this is Scaling Postgres, episode 57. [00:00:21] Alright, I hope everyone is having a great week. So our first piece of content is Scaling PostgreSQL check your vacuum. And this is from Git. Engineering, I believe. Blog. And in it they're talking about using a table. And they were doing a lot of inserts, like 5000 requests per minute, 30GB inserted daily. And because it was mostly inserts and they didn't need some of the historical data in order to avoid void deletes, they went ahead and used partitions. So that's great, they can just drop the oldest partition whenever they want to. And they don't have to have the overhead of deletes because due to postgres's MVCC architecture, the deletes need to be vacuumed up. And they said, quote great, we had tables with daily partitions and we thought it all had it all figured out. Surprisingly though, after about two weeks in production, the response time suddenly and sporadically escalated to around 200 milliseconds from what was about 20 milliseconds. And basically they discovered that it was an issue with auto vacuum because even though they were inserting this data, apparently they were doing some updates and deletes that of course required auto vacuum to run to clean up the old dead rows. So basically they went with what they're calling the event store approach. So only doing inserts, keeping the data immutable, basically trying to avoid updates and deletes as much as possible. But again, they say, quote, like the previous solution, this ID was running just fine a couple of weeks and everything looked great. Unfortunately, about three weeks into this variation, we suddenly started experiencing the same issue as before, a sudden unexpected performance degradation. Now, the next thing they did is move to postgres Ten because they were on 9.6. And one of the reasons they did that is because 9.6 uses a callback function which actually inserts a row into the parent table and then deletes it, which essentially leaves a dead row and causes the vacuum process to kick off. Now, part of this is due to the fact that they were using an orm, and they say here since orms rely on PostgreSQL's returning statement for every database transaction, which can only be achieved by first inserting the row into a parent table. So there may have been another way around it, but basically they're saying they could choose to move away from the orm or use the native partitioning solution in version ten, which is what they went with doing. Unfortunately, yet again, they ran into a problem with auto vacuum and that is due to the anti wraparound mechanism. And of course this is due to postgres's MVCC architecture with determining who can see what rows are visible. And it uses this transaction ID, this XID, to be able to map what rows are visible and what rows are not. If you hit 2 billion in your XID, it can essentially shut down your database because it requires a manual vacuum to clean and freeze those rows so they aren't visible anymore. And what was happening is they were putting about 200 million rows into the table each day, which happens to be at what point a required auto vacuum kicks off to avoid a transaction ID wraparound. So how they solve this issue is they actually adjusted their auto vacuum freeze max age to 500 million transactions and that avoided the wraparound auto vacuum from kicking off. So basically, as they say here quote, we figured it out. So I thought this was a very interesting story about how they discovered and explained how auto vacuum works in different cases and for their use case what kind of issues they were running into. So, definitely a pretty great blog post for getting a story based education about using auto vacuum. [00:04:03] The next post is best way to count distinct indexed things in PostgreSQL. Now the TLDR is that when you're wanting to do a distinct count, this is the wrong way. So from sumtable do a count and then of the distinct text hashes in it. So what you'll get is a sequential scan even though you have an index on the text hash. And really what you want to do is select all the distinct text hashes as part of a subquery or CTE and then do a select count of all of those rows. Then it can use the index and return data much faster. In his example here, it returned the result in 5 seconds versus 40 seconds. So if you're needing to do distinct count of items, definitely keep this in mind. And he also references this blog post from Citizens Data that you can examine that kind of goes into more detail about it. [00:04:59] The next blog post is speeding up group by in PostgreSQL. Now, they describe this as a way to squeeze the last couple of percentage points out of a query so to basically squeeze a bit more performance and they classify it as a small optimization, but it can help. So basically they created a table and generated some data within it looks like about 10 million rows. And then they ran a simple aggregation, doing an average of a numeric column and grouping it by two other columns and it will do a sequential scan and it says PostgreSQL will read the entire table sequentially and perform a hash aggregate which is where it spends most of its time in the hash aggregate. And basically the only change is reordering the group by putting the value with the most distinct values first, grouped by the less frequent values later. And that will run the hash aggregate algorithm more efficiently. And here he shaved about 400 milliseconds off of what was a 3300 millisecond query, so definitely more than a 10% improvement. Now, what's interesting is that they mentioned it's very highly likely that future versions of PostgreSQL will already do some of this change automatically. But in the meantime it's just something to keep in mind if you want to make your queries as fast as possible. So definitely a blog post to check out. [00:06:22] The next post is a health check playbook for your postgres database and this is from Slitistdata.com and basically it's a playbook for you to what to consider and examine to maintain the performance of your database. The first one is the cache rules everything around me which is basically monitor your cache hit ratio and try to maintain it at 99% by doing things to maximize how your cache is being hit. This could be getting more memory for the server, this could be getting rid of unused indexes. Whatever you would need to do to try and maximize the amount is done using memory only and not having to go to disk. [00:07:00] The next consideration is beware of dead tuples. So basically keep on top of your auto vacuum settings to make sure that your Bloat is as small as possible as well as helps you avoid auto vacuum wraparound issues. [00:07:15] And as you can see here, have different queries that you can run to determine the state of the database for these different things. And then they talk about over optimizing is a thing where they're talking about periodically you want to get rid of unused indexes because they will slow down your insert update delete performance as well as cause you use more disk space and potentially might reduce your cache hit rate. And then of course they consider checking on your query performance. So using things like PG stats statements to understand what queries are running against your database, you could also turn on the slow query log so you can see what queries are running slowly in your database and get a sense of what queries you need to improve to improve your overall database performance. So generally pretty good review of the things you should keep on top of with regard to maintaining your postgres database. [00:08:04] The next post is simply called Indexes in PostgreSQL for Btree and this is from Habr.com and I believe it is from a previous post on Postgres Ru and it's been translated from this site which is in Russian. So this is the English version of it and it really goes into a comprehensive discussion of Btree indexes from how they work to sorting to how to handle order of columns and nulls and even mentioning things such as covering indexes in postgres eleven and what's more than that. This is actually the fourth post with regard to this and they have number one was on the indexing engine of Postgres, the interface of access methods and the hash index. So if you want to learn more about PostgreSQL and its indexing, these are definitely a series of blog posts to check out. The next post is waiting for PostgreSQL twelve reindex concurrently. Now, right now there is the reindex command, but it doesn't really work concurrently. So usually that means large databases or any active database can't really use it. They have to actually do a create index concurrently and then drop the index concurrently. And then if you want to maintain your naming scheme, then you have to alter the index to give it a new name. But with this patch, presumably you can do a reindex and it even works on primary keys. So this looks to be a great addition in PostgreSQL twelve. [00:09:38] The next post is loading data into PostgreSQL and this is from a presentation that happened at Pgcomf 2019. I'm not quite sure which one, but this is on the PostGIS US site and it's a PDF presentation and it talks about different forms of loading data into postgres. So from using the copy to copy, from talking about loading data into large binary objects, using foreign data wrappers to load data in. So you can connect to say, an external postgres database or Oracle database to load data in, as well as just reading directly from it, or even using the foreign data wrapper file to be able to pull data in from a file. And they also covered the Http extension. If you want to reach out and contact an Http site to pull data into postgres, that's covered as well as some command line tools such as a PG loader and some of the others mentioned here. So if you have a need to load data into PostgreSQL, definitely a presentation to check out. [00:10:39] The next post is where and when you need root CRT file. And this is from secondquadrant.com blog and this is a discussion, this is a very short post, but it's talking about TLS, which is basically SSL. And in setting that up with PostgreSQL, it talks about the cases when you would need a root cert file. Basically if you want to do some things to authenticate, you're contacting the right server and where and when you would need this root CRT file. So if you use SSL connections or TLS connections with your PostgreSQL networking setup, definitely a blog post to check out the next post is GeoJSON features from PostGIS. So again, this is about PostGIS and something I wasn't familiar with until I looked at this post. And GeoJSON, it's basically a standard to convey geometry or geographic information within JSON. And it says here quote, with a little scripting, it's possible to make a passable function that converts the rows from postgres into GeoJSON. So he sets up a simple table here. He shows how you can convert it to JSON B from just querying the table. And then he shows this function that he developed. It actually converts it into the GeoJSON format. He says, quote now we can turn any relation into a proper GeoJSON feature with just oneish call. So he shows you how to use it. So if you use PostGIS and want to convert things to GeoJSON, definitely a blog post to check out. [00:12:14] The next post is Text Search, a custom dictionary to avoid long words. So this is about text search and this is actually very in depth post that involves looks like a little bit involved working with C code. So it's definitely some low level stuff he's talking about. With Text search, basically text is converted into TS vectors and normally when you're converting into TS vectors you're going to be reading some sort of a document. He says what about undesirable pieces of text? So he says quote, for instance, when indexing mail messages, badly formatted messages may have base 64 contents that slip into text parts and that these can be pretty big. So basically he wants to avoid these types of super long terms when converting to a TS vector. One solution that he mentions here is filtering by length and he mentions several dictionaries that exist in the PostgreSQL source code that you can use as a reference to build your own custom index to do this. And here's where they get into the different C functions for kind of setting this up and then how you could potentially use this. So if you have a need for this use case, this is an interesting blog post to check out. [00:13:28] The last post is PG Backrest Archiving Tricks, and this post specifically covers asynchronous archiving of the wall files and it says with the PG Backrest info command for performance reasons, PG Backrest doesn't check that all the needed wall segments are still present, whereas Checkpg Backrest is built for that. And he covers setting up PostgreSQL PG Backrest and ways to potentially avoid that. So if you use PG Backrest, this is definitely a blog post I suggest checking out. [00:14:01] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to where you can sign up to receive weekly notifications of each episode. Or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 261

Easy Foreign Data Wrappers, JSONB Cheatsheet, Updating Cost Limit, Parallel Aggregate | Scaling Postgres 261

Episode 358

Postgres Conference 2025 | Scaling Postgres 358

Episode 80

Just Upgrade, Vacuum Updates, Connection Strings, Postgres Showcase | Scaling Postgres 80

Checking Vacuum, Fast Distinct Count, Group By Speed Up, Health Check | Scaling Postgres 57