Unconventional Advice? | Scaling Postgres 379

Episode 379 August 17, 2025 00:19:03
Unconventional Advice? | Scaling Postgres 379
Scaling Postgres
Unconventional Advice? | Scaling Postgres 379

Aug 17 2025 | 00:19:03

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss when you should reindex, how to handle case insensitive data, how to index jsonb and the top recommendations when doing performance optimizations.

To get the show notes as well as get notified of new episodes, visit: 
https://www.scalingpostgres.com/episodes/379-unconventional-advice/

Want to learn more about Postgres performance? Join my FREE training called Postgres Performance Demystified here: https://www.scalingpostgres.com/courses/postgres-performance-demystified/

View Full Transcript

Episode Transcript

[00:00:00] So there's not as much content this week. It may be due to people taking vacations. So this episode may be a little shorter than normal. But it is interesting and I think there are some blog posts with some unconventional ideas and I actually probably contribute some of my own as well. [00:00:19] Also, be sure to stay till the end of my Consulting Corner segment where I talk about some of the common things I've seen in some of my performance optimization engagements. [00:00:29] But I hope you, your friends, family and co workers continue to do well. Our first piece of content is Should I rebuild my PostgreSQL index? This is from CyberTech PostgreSQL.com and this blog post is a reaction to people asking if they should rebuild their indexes regularly because they have some concern of fragmentation or maybe bloat. And most of this post does does talk about B tree indexes, but of course there are other types of indexes and other reasons you may want to re index those. But firstly he talks about the B tree index and the importance of having densely packed pages. Now he does mention in here by default indexes have a fill factor of 90%, which means it kind of reserves a 10% buffer. Or you could call that bloat within the index to allow for more index insertions because an index is an ordered index and it is beneficial to have some padding within the index to account for new index entries. But in terms of performance, he says that is inversely proportional to the number of levels or the depth of the index. So basically the fewer levels the better. Now what determines the number of levels? Well, since your index page is 8 kilobytes in size and that doesn't grow or get smaller, the question is how many entries can can you put into that 8 kilobyte size? Which implies say a 4 byte integer index will be a lot faster than a multi column index with two dates and a big INT for example, or even UUIDs. So a more densely packed index is more performant. That also means of course that bloat can hinder the performance of the index. So you don't want to get an index too bloated. Now he does say here accessing a single row with a B tree index bloat is relatively negligible where it can really impact performance if you're doing range scans. So you want a whole set of data pulled by that index because as it's doing that range scan it has to skip over these bloated areas and you lose the advantage of potentially the operating system kernel doing pre fetching of disk blocks because this can really improve the performance of index scans. Now he says as you're using a B tree index, it's really common for the fill factor to reach 70% as a table is being actively worked on. But his opinion is that's not necessarily a reason to rebuild the index, but you should keep track and monitor it. And how do you monitor it is you use the PGSTAT tuple extension and then you can use the function pgstatindex. And he shows an example of that here. So he created a table and created an index, created the PGSTAT tuple extension, and now he can query the PGStatIndex looking at a particular index and looking at say average leaf density, which is 90% for a new index and how much fragmentation exists. So as a test, he deleted a bunch of rows and then he checked it again and nothing happened. You have the same value. That's because nothing has been vacuumed up yet. But after vacuum runs, you can see the average leaf density dropped to 60%. [00:03:39] So his advice is definitely keep track of your bloat and if it continues to increase so it's a pattern that it continuously grows higher, then you should probably consider re indexing. I know personally, I had one client I worked with a number of years ago where they were running into disk space issues actually, and due to the level of bloat they had in their indexes, we decided to do weekly re indexing jobs. And that really helped knock down the disk usage for this particular application. Given how much data was being updated and deleted and inserted into these tables. I don't necessarily advocate that, but that happened to work in this case. But you know, one thing he does mention is that, you know, it is a tendency for indexes to become more bloated than a table because a table, you can insert data relatively anywhere, fill space, whereas indexes, it has to go in a particular area to keep the index order. So they do tend to become more bloated. But if you want to learn more, you can check out this blog post. Next piece of content. There was another episode of postgres FM last week. This one was on Case Insensitive and they were basically talking about how do you deal with something like email addresses or maybe usernames in your system and how do you handle case differences? Because a lot of times you want to keep user names unique and you want to keep emails unique, but if you use just a text or a variable character data type, different casing of the emails or the Usernames would be considered different and ideally you'd probably want to keep those the same. So the question is, how do you handle this? A lot of places I know use expression indexes. So they do a lower of the email or lower of the username and, and make it a unique index. And that is a way to ensure you have case insensitive unique emails or usernames. Now the disadvantage of that is every time you go to query this column, you have to include the lower of the email in the query. So that's potentially spread all over the application when you go to query that. And I think Michael was saying that generally makes your code pretty ugly. Another alternative they talked about is CI text. So the case insensitive text, that's an extension you can install in postgres and creates a text data type that is case insensitive. Now the irony of that, they said, is that when you look at that extension, it actually advocates not to use the extension but to use case insensitive collations. [00:06:08] So you can use that extension if you want in that data type. I personally don't have any experience with it. I haven't used it. I've mostly seen people using expression indexes on those columns I was talking about and just like a general text field field. But I also haven't run into anyone using collations, which appears to be the more elegant way to deal with it. But even then Nikolai and Michael got into talking about, oh well, could this potentially become corrupted when you go to upgrade your database? Because what if glibc versions change or ICU versions change? Could your collations be corrupted? And that's just another thing to work about. They didn't mention it, but I wonder if you can use the built in collation as well. That came in PG 17, so that would potentially make this easier to deal with. But they did include in a blog post a way to set up a case insensitive collation because Django actually removes citext support from Django and they have this blog post that explains how to use case insensitive collations instead. Now you can definitely do that, but I personally have taken a different route. I basically said, okay, how many ways is this email going to be updated? [00:07:20] And frequently if you're talking about an object oriented language that is just in a single model or a single class that has the responsibility to update an email. So basically I force emails or usernames to be lowercase at that point. Now I know that doesn't safeguard everything and people can do a load of email addresses into the system and that would bypass that kind of protection, but you could put an audit on it to check for that, or maybe some sort of automated test. But up to this point that's frankly how I've dealt with it. But if you want to learn more, you can definitely listen to the episode here or watch the YouTube video down here. [00:07:59] Next piece of Content Indexing JSON b in postgres this is from crunchydata.com and he's talking about how you would index the JSON b type. And the first type of index he focuses on are GIN indexes. He says they are great for querying JSONB when you're looking to see whether it contains something. So you can use the containment operator here, the ampersand and the greater than symbol. So you just create your index using GIN on that particular index you want. And he says you can even create partial indexes with gin as well. Now you need to be aware that there are only certain operators that work with the GIN index when you're working with jsonb. So you can use the containment operator, you can use key existence. So does a key exist? You can do an any key match or do all keys match, but there are some operators that don't. So he says when you're using path based navigation to try and find values that won't use a gen index, or when you're trying to do comparisons within jsonb that won't work either, nor regex patterns, those can't use the index. They have to be the containment operator or existence of certain keys. And he also brings up, like the first post, the importance to re index when needed and also using the PGSTAT tuple extension to monitor bloat because you can really get a lot of bloat. He says if you're frequently updating large JSON B columns. [00:09:26] Now another alternative for working with JSONB is actually using standard B tree indexes, but use them essentially as expression indexes. So you can target very specific segments of the JSON B and check a given value for it. But if you want to learn more, you can definitely check out this blog post. [00:09:44] Next piece of content bypass PostgreSQL catalog overhead with direct partition hash calculations. This is from Shayan.dev and he was setting up hash partitioning and he actually set up two levels of hash partitioning. So for example, he was querying an events table and he wanted to choose a partition based upon the user ID and event type. So he created two levels of partitioning, and he was getting some poor performance from that. [00:10:15] Now, at first I thought, well, is it the hash calculations? But he also says the catalog lookup was taking some time as well. But he said what you can also do is directly target the partition if you know the hashing algorithm. So, for example, as opposed to doing the query like this, you already know what data stored in the table, so you can go ahead and do the direct query like this. [00:10:40] So he says, quote, this completely bypasses the PostgreSQL catalog traversal and you have no hash function calls and no hierarchy navigation. Of course, the question is, if you're doing this, do you even want to use PostgreSQL's partitioning? Because essentially you're doing something in the application already you could devise your own partitioning scheme. So I don't know if I would necessarily advocate that. But he did create a pghash function, rubygen that mimics the hash functions that postgres does. Right now it only supports partitioning of integers. It doesn't support Text strings or UIDs or other data types for hashing. Now, he says you can also call these PostgreSQL hashing functions directly. And he showed an example of this here, and he's claiming in here that it's 20% faster to do the Ruby calculation that I was a little suspicious of because, you know, Postgres runs on C. C is a lot faster than Ruby. But he says, quote, the benefit here is from eliminating network round trips. So presumably this slowness is due to the application contacting the database, doing a hashing, coming back to the application, and then sending it to the right table, which seems like an awful lot of work to avoid PostgreSQL doing the partition targeting calculations. [00:12:03] So I'm not sure I would necessarily do this. I mean, there is one advantage doing it in the application has is that any sort of calculations can be spread across 20, 40, 100 application servers, whereas generally you only have one big database server. So if you can spread the compute to all of those application servers, then I can understand that. But I really wonder how much of a win this was. But if you want to learn more, definitely check out this blog post. [00:12:34] Next piece of content. PG Lord of the Ring this is from mydbanotebook.org and she's expressing frustrations from using various client tools that postgres offers. I'm assuming she means psql, pgdump, pgdumpball, pgbasebackup. You know, all the different utilities that postgres offers. And what was bothering her is that there was inconsistency and even incompatibility among flags. For example, sometimes you need the hyphen D flag for the database names, other times it's optional. And I'm right there with her. I've experienced this and I'm frustrated about it at times because I'm like, I can't use that flag with this one, but I could use it with this one. And she did say she raised the issue with the community but got a lot of pushback because of backward compatibility. [00:13:24] Frankly, I think it would be a good thing to come to alignment all the different client tools to use certain flags for certain functions and target those changes for a certain version maybe two to three years down the road and emit deprecation warnings if certain flags that are going away or changing could be used. I think there's a way to do it, but I don't know if people in the community have interest of doing that. We'll have to see. [00:13:50] But she actually bypassed that and created her own tool she calls PG Lord of the Ring. So basically she's trying to create a consistent tool with consistent flags that you can use with postgres. And you can check out the tool and the documentation down here. [00:14:06] Next piece of content pgstat Monitor needs you join the feedback phase. This is from percona.community and they have an open source monitoring tool called PGSTAT monitor and they are looking for feedback and people to join the open source project to continue its development. So if you're interested in that, you can check this out. [00:14:25] Next piece of content Lies, Damn lies. And LLM output. This is from thebuild.com and he's expressing his frustration over a blog post that was dumb released. And he said, oh, look at all the different features coming in version 18 of Postgres. So he took a look and said, oh, oh dear, oh sweet mother of God. [00:14:47] And the blog post lists 10 new incredibly amazing features in Postgres. But as he looked around, there was a high degree of inaccuracy. [00:14:55] Things that they said were coming were already here. So basically it's a rant about trying to get LLMs to write postgres content. And he said, it probably took me much longer to write this blog post than it took the original author to write the article. And I like the first comment down here. Preach. [00:15:12] So I definitely agree with this. If LLMs were writing all the blog posts that I had to review, the show would be much less interesting. But check this out. If you want to learn more Next Piece of Content There was another episode of the Talking Postgres podcast. This one is actually on AI for Data Engineers with Simon Wilson, so I don't know how much about Postgres they're talking about. They did say they're going over Postgres fine grained permissions for working with AI powered workflows. So if you're interested in that, you might find this episode of interest. And the last piece of content is that Cloud Native PG 1.27.0 was released. This is from PostgreSQL.org and the highlights of this are dynamic loading of PostgreSQL extensions, logical decoding, slot synchronization, and primary isolation. Check is now stable, so check that out if you're interested and now it's time for my consulting corner. So I had a performance optimization engagement recently and I definitely have a few of these, but I thought I'd mention some things that I tend to see in the clients I'm working with. Usually the number one recommendation that I have is implementing some sort of multi column index. So they do have a fair number of indexes on columns already. That's usually not a problem. But usually adding a multi column index in certain cases can definitely improve the performance for certain queries, particularly if there are ones that are run a lot on the database system, it can be worth it to have that multi column index in place. This can be beneficial not just from a narrowing down the number of rows, but also if you have some order bys. So having a column to help narrow what's being pulled and then the other column to assist with the ordering in the right way. And that ordering column should be the last in the index. The next set of types of recommendations I tend to do are partial indexes. So again, I've mentioned this before in the show, but partial indexes are great when you have very narrow statuses. Maybe a Boolean field you can create an index where some value is true or where some value is false. And sometimes it's more efficient to create two indexes, one where the value is true, one where the value is false as well. But again, that depends on your workload. But partial indexes can definitely give you really large wins. The other recommendation is actually considering pulling less data. So for example, could you pull data with a particular date range? Because sometimes if you're pulling 100,000 rows or 10,000 rows, I mean if you're not doing analytics, how important is is it for you to be pulling all of that historical data, could you put some sort of boundaries around it to help the database do less work? But if it's still needed, maybe you need to consider storing aggregates and then only pulling the recent information to combine with the aggregates to present data to your users. And then apart from that, it's just asking the question, are you sure you really want to run this particular query 500 times a second? You know, why do you need 2,000 rows of information 500 times a second? Being able to reveal that to developers and put that question out there can definitely prompt other types of application changes to potentially reduce database usage. But I would say those are the top recommendations I've been giving recently, so hope you found that beneficial. [00:18:43] I hope you enjoyed this episode. Please be sure to check out scalingpostgres.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode there. You can also find an audio version of the show as well as a full transcript. Thanks and I'll see you next week.

Other Episodes

Episode 143

December 07, 2020 00:18:42
Episode Cover

Microsoft SQL Server Compatible, Time Series Performance, Feature Casualties, BLOB Cleanup | Scaling Postgres 143

In this episode of Scaling Postgres, we discuss Microsoft SQL server compatibility, the performance of time series DBs, DB feature casualties and how to...

Listen

Episode 147

January 10, 2021 00:20:28
Episode Cover

DBMS of the Year, Better Data Migrations, Idle Connection Impact, Security Implementation Guide | Scaling Postgres 147

In this episode of Scaling Postgres, we discuss PostgreSQL as the DBMS of the year, running better data migrations, the impact of idle connections...

Listen

Episode 105

March 16, 2020 00:19:22
Episode Cover

World's Best, Subtransactions, Sharding, Schema Changes | Scaling Postgres 105

In this episode of Scaling Postgres, we discuss how Postgres is the world's best database, performance ramifications of subtransactions, the current state of sharding...

Listen