Go Faster, GIN Indexes, Collation Stability, PG14 & Beyond

[00:00:00] In this episode of Scaling Postgres, we talk about go faster, gen indexes, correlation, stability, and PG 14 and beyond. I'm Kristen Jamison, and this is scaling postgres episode 194. [00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. My apologies for being late with the episode this week due to technical difficulties, although I did publish the webpage with the links for this week at the same time, but the YouTube video and podcast were significantly delayed. Hopefully we'll be back on our regular schedule next week. So our first piece of content is slow things down to make them go faster. And this was presented at Postgresbuild 2021. And this is from the website V yruss.org. And there's basically a brief summation of the presentation, but they present the YouTube video as well as a link to the slides here. Now, this was a very good presentation and it talked a lot about how to make postgres go faster in terms of not slowing it down with your application. So looking at locking contention and ways to work better with postgres's process model and what slowing things down to make them go faster is referring to is that sometimes you can throw so many connections onto postgres that its limited CPU resources has trouble context switching between all the different connections because each connection is its own process in postgres. And they presented this example here where you have 100 connections, you get a TPS transactions per second of around 1500 latency of 52 milliseconds. Whereas if you ramp that up to 1000 connections, given a particular instance size, which is R five eight x large, the TPS drops by almost a third to about 1200 TPS. And the latency gets significantly longer over ten times longer at 668 milliseconds. So this is kind of the problem that he's talking about. When you try to throw too many connections at postgres, you can result in this hindered performance. And if you actually reduce the number of connections, you can actually get more throughput. Now, how you handle this is basically using a connection pooler. So for example, PG Bouncer is a popular one that I tend to use and what he mentioned in this talk, but there are others you can use as well, but it basically slows down those connections so you can actually run faster. But he also covers some other points, including trying to avoid explicit locking, maybe using serializable use cases more, but definitely a great presentation. I suggest you check out the next piece of content. Understanding postgres gen indexes the good and the bad. This is from Pgandalyze.com and they're talking about gen indexes. Now, the thing to keep in mind with a gen index compared to the typical index, which is a B tree index, for every entry in a table, there is a corresponding entry in a B tree index. So it's essentially a one to one relationship. It has to look it up in the index and then it knows where to find the rows in the heap, whereas a gen index is a little different in that it stores multiple locations by value. So basically the index is set up per value, which is what makes it good for text searching as well as JSON b fields that have multiple values within the structure of the field. [00:03:45] So you can think of a typical book index as how you would think about a gen index. So you have a particular value it knows about and it knows the locations where to find the information for that value. And what they say here is a single row in a gen index can be represented in many places within the tree. And the next thing they talk about is the use cases I mentioned. They talk about full text search on how you can use the gen index to do full text searching using particular operators. And you can also do similarity searches using the gentrigram operators flag when you create that gen index and that will allow you to do similarity searches against a table using that index. And they did mention of course, that you can use a gen index on JSON b columns and they showed some use cases for that where you can discover what the value of a particular field of data you have stored in there. You can also use different operators like JSONB Path Ops to look for JSON b, where a particular value is contained within it. So you could use the containment operator for those use cases. And they even mentioned you can use a b tree index on a JSON b field using this type of syntax where you're targeting a specific value you want to store in that b tree. That's an alternative to using a gen index. [00:05:02] Now, sometimes you're going to want to combine these, so a lot of times you may want to look at a given customer ID that they mentioned here as well as particular set of data within a JSON b field. There's a couple of different ways you can do that, but one of the great ways to do that is using an extension called Btreegen and it allows you to combine a Btree index and a gen index at the same time and they show a use case of doing that here. [00:05:29] Now, one of the downsides of the fact that you're storing the index by value and it references multiple locations is that it has expensive updates, as they mentioned here. Because once that value changes, you may have to go and visit all those different locations that that value is referenced to update that information on where to find it. So how it handles that is it actually defers some of these updates and it maintains, as they mentioned here, a pending list of those deferred changes that need to be done. Now, what triggers this deferred update is you have a gen pending list limit that is by default at four megabytes. So if it hits that during an index update, it knows it needs to do the work to commit them to the actual index instead of the pending list. Another option is genclean pending list function, which of course empties that out and updates all of the index entries and then running an auto vacuum on a table with the gen index. But this can also lead to some performance problems and they mentioned a problem that GitLab had with this recently and they talk about the situation that they encountered with it, but some ways to handle the disadvantages of a gen index. They list basically six different ways here that they propose you could do that. One, you could reduce the gen pending list limit and this may actually allow these flushes to happen more frequently and faster as they mentioned here. You could increase the limit so that maybe it doesn't happen during prime times, but you better control when that happens. By making the limit larger, you could turn off fast update which basically no longer postgres things in a pending list. It just does the work as it comes in. So basically, as they say here, quote, you're taking the overhead of each individual insert update at the time it happens as opposed to deferring it. You could tune auto Vacuum to run more frequently is their number four. Number five, you could explicitly call the genclean pending list function more frequently instead of using Auto Vacuum, or you could drop the gen index. But this is a great blog post explaining gen indexes and some potential downsides of them and of course their advantages. So if you're interested in that you can check out this blog post. [00:07:43] The next piece of content correlation stability. This is from Rhos Blogspot.com and he's talking about an issue of correlations with regard to indexes and that by default postgres uses the operating system libraries for handling collations, but it does have support for ICU based collations. Now these correlations are important for ordering indexes and unfortunately if your correlations get off like you do an operating system upgrade, even though you don't change the postgres version, it could cause the correlations to change if the underlying libraries change significantly. And then when you go to search on data using an index, you're going to get invalid results from it, which is a big issue. Unfortunately, what this article mentioned is that it says, quote, the system Thomas Monroe designed to detect changes in the collation version had enough problems that it had to be reverted. So this was a change that they're going to at least notify you. Hey, your correlation versions changed which may impact your indexes. Unfortunately there were some issues with it and as clearly they had to remove it. So this post doesn't have a good plan for handling this. I mostly wanted to showcase this to let you know that this is an issue that exists. Some ways to mitigate it is basically using a separate library like ICU correlations, so using those instead of your built in operating system library, that way you can specifically version control when that happens. And basically if you run into this and your indexes stop working after some sort of an upgrade, the solution is to basically re index your text based indexes so you wouldn't need to reindex like a Boolean field or an integer field or things like that. It's basically things that will be sorted like your text based fields. But if you want to learn more about these correlation issues, definitely encourage you to check out this blog post, the Next Piece of Content PostgreSQL 14 and beyond this is from PostgreSQL Fastware.com. This is a really long post, talking about all the different features that have been introduced in postgres 14, as well as some future features coming in postgres 15. So what are some things that are being planned in the community at this time? So in terms of the big features they are talking about in 14, they're talking about the logical replication improvements, additional SQL features, data corruption fixes, indexing changes, extended statistics, as well as vacuuming. And he follows up with performance improvements that were done. Now, in terms of 15, they again mentioned a lot of different things. They're working on a lot of improvements, again focused on logical replication to make that better server side compression and backup technology that would be great to see. To make backup smaller improvements in automatic switch over failover of primary. To replicas improvements in hash indexes a shared memory stats collector to give more relevant information about the state of the shared memory changes with regard to parallelism, in terms of parallel, inserts copy from things of that nature. So if you want to learn more about all the updates that came to postgres 14 as well as potential future improvements, definitely check out this blog post. [00:11:07] The Next Piece of Content PostgreSQL versus Python for Data Cleaning A Guide this is from the [email protected] and this article talks a lot about data scientists using Python for preparing their data, particularly cleaning it before they start their analysis. Well, this shows you a way you could do it actually within PostgreSQL and it should be faster and more efficient because you're working right within the database as opposed to using external Python tooling to be able to do that kind of cleaning process. So if you want to learn more about this process, definitely check out this blog post. [00:11:43] The next piece of content. Primary keys versus Unique Constraints in PostgreSQL. This is from CyberTech postgresql.com. They're talking about the difference between designating something as a primary key versus just creating a column that is a unique constraint. Basically, the only difference is that the primary key is not null and the unique constraint can be null, although you can of course specify it to be not null as well, but that's basically the only difference. And the system does have a special designation for a primary key, so that is something that in exists and it was mentioned later in this blog post or in the comments. I believe that some tools rely upon that primary key, like they'll give specific messages if one doesn't exist or it needs to be there in certain cases. So you do want to use a primary key generally with every table that you have because there's some tooling and utilities that rely upon it. But in general it's identical to a unique constraint that is not null. But if you want to learn more about that, you can check out this blog post. [00:12:49] Next Piece of Content PostgreSQL 14 database Monitoring and Logging Enhancements this is from Procona.com and they mentioned a number of the different enhancements that are directly related to monitoring and logging. So the main area they're talking about here is the query Identifier being now unique across the different statistics tables. So you can make note of it in the log, you can make note of it in Explain as well as the PG Stat Activity table. The next thing they mentioned is you now get IO timings for Auto Vacuum Auto Analyzing and there's also additional connection logging that you can enable. So if you want to learn more, you can check out this blog post, the Next Piece of Content transition Tables in Incremental View Maintenance Part Two multiple Tables Modification Case this is from Yugo Nagata pgSQL Blogspot.com and he's talking about the new feature that's being worked on Incremental View Maintenance. This would allow you to create a materialized view that essentially keeps itself updated. So this particular use case is looking at a view where you have multiple tables that the view encompasses and how does it keep both of those tables up to date. So if you want to learn more about the status of this new feature, you can definitely check out this blog post. [00:14:08] Next Piece of Content PostGIS Day 2021 this is from Crunchydata.com and this blog post basically mentions a number of different presentations that were done on PostGIS day. They talk about PostGIS for neuroscience, talking about different climate tools as well as different extensions in postgres and different types of mapping projects. So if you want to review these presentations, check out this blog post. [00:14:33] The Next piece of Content Tricks for Faster Spatial Indexes is also from Crutchydata.com and they're talking about how if you add randomization to your data before creating indexes that are relevant to PostGIS, particularly the Gist indexes, you can actually get a performance improvements and this blog post explains why that is the case. The performance was about 11% faster, so not a huge improvement. But if you could do this simple randomization before applying these indexes, you'll get a speed boost. So if you're interested in learning more about that, check out this blog post the Next piece of Content permissions Required for Postgres this is from Rustprooflabs.com, and they're talking about the permissions required to get PostGIS installed and running. So if you're interested in that, you can check out this blog post. [00:15:23] The last piece of content. The PostgreSQL Person of the Week is Fabian Coelho. If you're interested in learning more about Fabian and his contributions to Postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 198

Monitoring Progress, More SQL, Replication Slot Failover, Postgres Contributors | Scaling Postgres 198

Episode 308

Scale Through Sharding | Scaling Postgres 308

Episode 267

PostgreSQL 16 Beta 1, Rust Functions, Partitioning Memory Problems, Tags & Arrays | Scaling Postgres 267

Go Faster, GIN Indexes, Collation Stability, PG14 & Beyond | Scaling Postgres 194