Index Downsides, TCP Keep Alive, Development with Postgres, Learning PL/pgSQL

[00:00:00] In this episode of Scaling Postgres, we talk about index downsides. TCP keep alive development with Postgres and learning plpgsql. I'm Kristen Jameson and this is Scaling Postgres episode 189. [00:00:25] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is how partial covering and multicolumn indexes may slow down updates in PostgreSQL. This is from postgres AI, and they're talking about the case where indexes can cause downsides. And the first thing he mentions here is, well, I'm not going to try to pronounce it, but it's basically first do no harm, which is kind of like the Hippocratic Oath. And it's basically saying when you add an index, you want to make sure you don't harm the performance of the database. And of course, anytime you add an index, there is a burden on the system to maintain that index. But he's talking about certain indexes where it can lead to a pretty big impact of performance. Now, when he first covers some different index types, he's going to test out here. The first one is partial indexes, which only index a portion of the table. So for example, where it says build is not true, you're only going to index where build is false, for example. And then he talks about multicolumn and covering indexes. A multicolumn index basically has multiple columns as a part of the index to help efficiently pull when you're asking for information from both columns. And a covering index is basically an additional payload added to an index. Now the benefit of these is it allows you to get more index only scans. So index only scans basically you can get all the information you need from a query just by using the index. You don't then have to go to the actual table in the heap to pull more information. So if you're only selecting column one and column two, in both of these cases, either with a multicolum index or an index only scan, if all you're asking for in the select is these two columns, then you only need to go to the index to pull the information. And then he set up a number of experiments to see what impact different indexes have based upon certain columns being updated. So he has an example of an asset table with a primary key, an owner ID created at timestamp three columns and a price. So basically he inserted random data into this table and he wanted to look at by owner ID a sum of the price. And he's also going to add where the price was not null to test a partial index. And then he tested an updating of the price based upon the ID. So he put these three different queries and did a PG bench run on it. Now, he had some additional considerations here and you should definitely review this post, but I'm not going to go into each of the points he makes here. But based on what he wanted to test out. He wanted to test the null condition where essentially no additional indexes are added. How do these transactions perform? He added a one column index on the owner ID. He added a multicolumn index including owner ID and price. He added a partial index where the price is not null on the owner ID, then a covering index that is on the owner ID and includes the payload of the price, and then a partial covering index using the owner ID. And the price is a payload where the price is not null. And then he looked at what the performance resulted in. And so basically, without the index, the transactions per second is quite low on the select because it's doing a parallel sequential scan. The second select where the price is not null is again using a parallel sequential scan. It's a little bit slower, about 50 transactions per second. So really, really slow. The update transactions per second were pretty fast, 46,000 transactions per second. Now we also include the Hot update ratio. What this is, it's a heap only Tuple update. So it can just go to the heap and update one of the Tuples price, in this case without having to update or change anything about the index. And this shows you the percentage of the heap only Tuple updates that you get. Now, adding the one column index on owner boosted the select transactions by about 1000 fold for the version one and the version two. So that was quite significant in improving performance, 1000 fold. And it actually went a little bit higher in terms of the update transactions per second and the heap only Tuple ratio was equally high. There was a little bit of an increase in the index size though, and this index size is measuring before the tests and after the test. So what amount of Bloat happened as a result of it? So, because nothing really changed in the index, it was basically doing heap only Tuple updates of the price. Nothing changed very much. Now looking at the multicolumn index or even the covering index, because those were similar, it actually gave about a 2000 fold improvement, so faster than the one column index, but it suffered performance wise with the updates. So the updates went to about 30,000 transactions per second, which was a loss of about 33% of the performance. And you're not getting any heap only Tuple updates because the price is now a part of the index. So that's the key thing to keep in mind here is that if the column you're adding to an index, be it a multicolumn index or a covering index, if that column is updated, you are not going to get heaponly Tuple updates. So that can cause quite a bit of harm to your transactions per second for updates. And as you can see here in this table, you can also see that you're going to have a fair amount of Bloating with that index which will require more vacuuming to resolve. So basically this could be one significant downside of adding indexes to improve select performance. So by adding some of these indexes you got a 2000 fold improvement in performance, but you've reduced your update transactions per second by about a third. But keep in mind that this only happens because it's a column that is being updated frequently. If that column isn't being updated, then you won't see this type of transaction performance reductions for updates and won't see the Bloat. But there may be cases the performance of those selects outweigh the reduction in the update performance. And he mentioned in this blog post that he would probably go with the single column index because it does give you a thousand fold improvement but still allows a lot of heaponly tuple updates and avoids a lot of index float. So this was a great post and I highly encourage you to check it out because it showcases some examples of where there can be downsides to adding new indexes in a pretty obvious way. [00:07:08] The next piece of content TCB Keep Alive for a better PostgreSQL experience. This is from CyberTech Postgresql.com and they're talking about TCP Keep Alive and basically it helps detect broken connections or abandoned connections. Now, they mentioned there are some new features in different versions that can help track and monitor some of these. So they are mentioning that there are new session statistics in version 14 of postgres that allow you to track sessions abandoned if there's a disconnect. And it also has a new parameter called idle session timeout which closes idle connections after a while. But they say the problem with this particular feature is that it can terminate healthy idle connections as well. Because maybe you want your connection puller for your application to maintain those database connections open even though there's no active traffic going through them, but you want them to be there and ready to process database statements if one comes in. And what they're saying is that you can alter your TCP Keep Alive values to get a much better solution to this problem. And then they talk about TCP Keep Alive that allow quote, to keep network connections from being idle and to detect if the other communication end has left without closing the network connection. And it actually has three parameters that you can adjust. One is the idle time and this is the amount of time that it is allowed to be left idle. Nothing going on before anything happens, the interval at which after the idle time it's going to check on those intervals to see if the connection is there and then count how many times before it actually times out that connection. Now, one thing that they said is that at 2 hours, a lot of times firewalls or routers may close those connections. So the first thing you're probably going to want to do is to reduce this less than whatever that timeout is and that will help you keep an idle database session alive. And secondly, you can adjust the interval and the count to be able to detect dead connections because otherwise it's going to take approximately ten minutes. If the interval is 75 seconds, then it counts to nine. So you're probably going to want to reduce those to more quickly detect invalid connections. Now, to make a change to these parameters, you can do it in the postgresql.com file so they are configuration variables that you can change for postgres and on the client there are connection parameters and you can change each of these in those connection parameters. And then finally you can do it at the operating system level that would cover all settings for every application on it, not just the database, but that could be another option if you want to go that route. But basically with these settings it gives you fine grained control over being able to keep the sessions open you want to and then do early detection of database connections going down. So if you're interested in learning more, I highly suggest you check out this post from Cybertechf and postgresql.com. [00:10:12] The next piece of content is actually a YouTube video from ElixirConf and it's I can do all things through PostgreSQL lessons for the Elixir dev. Now, this is a great video. It's about 35 minutes in length and it has a fair bit of humor in it. And it basically covers where you would want to use the tools offered by postgres as opposed to trying to solve solutions on your own within your application code. Now, Elixir is known to be a highly performant language, so if there are cases where it's more advantageous to use postgres to solve a solution, you could probably apply this to any other language, particularly Ruby, which I am also familiar with, which tends to be much slower in tasks than Elixir is. Now, they cover things like using constraints within the database, using things like sitetext to handle case insensitive text, the benefits of using it for searching in terms of text, searching in fuzzy matches as well as Bloom filters, basically a lot of different areas where you're dealing with a large data set. It's more efficient to use the tools within postgres as opposed to trying to pull a bunch of data over and process it using an Elixir. But even if you're not familiar with Elixir, I thought this was a great presentation. [00:11:29] The next piece of content exploring plpgsql strings, arrays, recursion and parsing JSON. This is from Eatonfil.com and he's basically giving a little mini course on starting to use plpgsql and he starts off very basic with hey, this is the easiest function you can create. He's basically renaming the length function and it calls length itself. So it's basically this is one of the simplest functions that you can design and he shows how you can use named and unnamed arguments, specifying out declarations, and then doing a few things like working with numbers and recursion strings and arrays, and then even building a parser for JSON within it. So I thought this was a great post communicating how to get started with plpgsql. And if you're interested in that, you can check out this blog post. [00:12:23] The Next piece of Content secure PostgreSQL 14 with CIS Benchmark this is from Crunchydata.com, and they've released a new version of their center for Internet Security benchmark that applies to PostgreSQL 14. And this is a benchmark that's predominantly used again by the center for Internet Security that discusses how to handle things like installation and patches, directory and file permissions, logging, monitoring, auditing, user access, control connections and logins, postgres settings, replication, et cetera. So if you want this benchmark document, you can definitely check out this blog post and they have a link to it here the next piece of content. Also security related is Database Security best Practices on Kubernetes this is from Crunchydata.com, and they're talking about different practices you should use for securing postgres on Kubernetes. And a lot of these are applicable to even if you're not running it on Kubernetes, but they call out specific configuration variables you're going to want to set for it. And they cover areas such as running the container as an unprivileged user, how to encrypt your data, how to handle Credential management, so passwords and keys, and things of that nature, how to keep your software up to date, like keeping up with new versions, how to follow configuration. Best practices limit where you can write, so define certain file systems as read only, and identifying the weakest link in that Kubernetes is one thing, but you have to cover other areas like your hardware, software, operating system, the applications themselves, who has access, etc. On. And then a final post is on encrypting postgres data at rest in Kubernetes. This is also from Crunchydata.com, and this post specifically talks about using Kubernetes and postgres within the Amazon AWS framework and using their EKS service, and talking about how you can handle encryption values with this setup. So if you're interested in the security articles, you can definitely check out these. [00:14:29] The next piece of content. The PostgreSQL person of the week is Alexander Kokushkin. If you're interested in learning more about Alexander and his contributions to postgres, you can check out this blog post and the last piece of content. We had another episode of the Rubber Duck Dev show this past Wednesday. This episode we talked about how many third party libraries should your application code use? So if you're interested in that discussion, you can check out this post. The next episode will cover how much planning you should do before you start coding. [00:15:01] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show. Notes be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 214

Backup Compression, Postgres IO, Parquet Files, pg_stat_monitor | Scaling Postgres 214

Episode 367

How Often Does That Query Run? | Scaling Postgres 367

Episode 231

PG15 Public Schema, Kubernetes Experiences, Dump Manifests, PgBouncer Fork | Scaling Postgres 231

Index Downsides, TCP Keep Alive, Development with Postgres, Learning PL/pgSQL | Scaling Postgres 189