Overhead of pg_stat_statements | Scaling Postgres 304

[00:00:00] The number one extension I always want installed on any postgres system is basically PG stats statements, because it's the best tool right now to give you the most insight into what queries are going on in your system. But the question is, what is its impact? That's the question we're going to be talking about today in the first piece of content, but I hope you, your friends, family and coworkers continue to do well. Our first piece piece of content is overhead of PG Stat statements and PG Stat kcash. This is from Postgres FM, and this is their most recent episode, and Nikolai and Michael are discussing this overhead. Now, I think this may have been talked about in the postgres AI post that I mentioned a couple of weeks ago in scaling postgres. This is where postgres AI was stress testing, as they say here, spot instances up to 2 million transactions per second. And in this episode they talk about the impact of PG stat statements in PGSTAT kCash. So this is less common, but it's primarily a means to track, I think, cpu utilization and disk usage for postgres processes. But it's not used that commonly in my experience, whereas PG stat statements, I'd actually like that to be used everywhere and kind of on by default. But then the question is, what is the overhead of it? And they mentioned this episode, they heard people mention oh, 7%, oh, it's less than ten, oh, it's 1% or less. But of course, I think what is the bottleneck in the PG StAT statements code? What is it that's slowing it down in particular scenarios? And to my knowledge no one has looked at that. But it's interesting, the benchmarking that they did and what they found with regard to it. So they were testing single select queries using PG bench, basically running them as fast as they can to try and hit millions of transactions per second. So all those statements were hitting a single row in PG stat statements because it was the same statement done over and over again, and they found they hit a bottleneck with that. So much like if you update a single row in postgres over and over and over again, if you do it fast enough, you're probably going to hit lock contention, not to mention the bloat that's going to happen. But it seems like a similar thing happens with PG stat statements. So if you update the same row in fast succession, you're going to hit a limit with that, because apparently there's some locking going on when trying to update that. And if you're not familiar. PG Stat statements basically takes your statement that you're running and removes the parameter information. So if you're looking up an account by an id, you can see select all from account where id equals question mark so it doesn't show you each id. And then if you've run that for a thousand different accounts, you'll only see that one query entry in PG stat statements. So the bottleneck that happens in this case is if you have the same statement structure running many, many times. So if you're getting a user by an ID and you have 20 million users, that query is going to be run a lot and that's going to cause a higher overhead for PG stat statements, whereas more random types of statements should encounter minimal overhead. Now there may be other bottlenecks in the system, but it's very interesting to think about the parallels is that when you're updating the same row of information, you're going to hit a bottleneck with that. So the more common statements you have, the more calls you have for that particular query, then PG StaT statements is going to have a higher overhead. It'll have less overhead, the more varied your queries that are hitting the system. Now they did mention one way to get around this is to start doing sampling. So maybe above some threshold of calls per unit of time, you start sampling the data as opposed to recording every instance of a call. But I found this episode super interesting and you can listen to it here or watch the YouTube video down here. [00:03:55] Next piece of content logging what, why and when. This is from hdombrovoskaya WordPress.com and she's talking about logging in postgres and that she knows a lot of people that don't really consult the logs for anything. And she says here, quote, we rarely use postgres logs for anything. Like a lot of monitoring happens on the instance itself, maybe looking at high cpu load, disk usage, things of that nature. If there's a performance problem, look at PG stat statements, but not frequently at the logs. And that, quote, all cases when I or anybody whom I know would use postgres logs, it was for investigative purposes. So they're trying to find some specific event that happened at some specific moment in time. She also said, if anybody's reading this uses postgres logs as is on a regular basis, let me know. Well, I'll say I use postgres logs on a regular basis. I consult them weekly because I like looking at what fatal errors happened, what warnings happened, what errors happened, because it can give you insight into problems before they start getting really bad or they give you insight of strange errors that happened and then you can look into okay, what actually caused this. So I might be an outlier, but I definitely scan the logs on a regular basis to know how my instances are running because there's some information there I can't really get anywhere else. But she goes through logging, definitely talking about PG Badger as well, and that these were the recommendations for the logging for PG badger. And these are super aggressive, like basically logging all statements, which I can't fathom doing for some production instances, logging connections and disconnections. Again, depending on your application, this could be super bad because some application frameworks for some reason aren't using a puller and they're constantly connecting and disconnecting. I know, which is horrible for performance, but that's how some of the application frameworks are designed, I guess. I don't know if you really need to log all temp files. I guess in certain cases that could be beneficial. I always like setting some limit on this. The rest makes sense to me. Tracking vacuums, logging the lock weights, the checkpoints. Definitely important. And she said another technique that some people use is that when they start noticing problems, then they turn all the logging on to get information with regard to that, and then they turn it off once the situation is resolved. But I found this an interesting post and question for you, the viewer or the listener. Do you use postgres logs and monitor them for issues, or do you just look at them when you're trying to track some issue that happened? [00:06:41] Next piece of content a story of a spectacular success. Intro to AWS RDS database migration series this is from smiley.com. In this post I found a bit odd. So they talk about they were on EC two instances and up to a terabyte of data and another eleven terabyte database. And they wanted to migrate to RDS. Okay. And then when they were evaluating it, they said it looks like they could significantly drop their costs. And I was like what? But the reason they said that is because they shrunk their database to 3% of its original size. And the way this post is written, it looks like they're saying RDs is the reason, where it seems like maybe they did a logical replication upgrade that got rid of all their bloat. Therefore their database was 3% of the size. So not that they should have, but they could have set up EC two instances and done a logical replication from one to the other and do a migration and eliminated all of that bloat and gotten the same savings. Now they didn't discuss how they really did anything in here. This is like an intro post and say oh, more is coming later. But they did use a custom migration service because it's interesting. On the one hand they were saying the problems they had with AWS's database migration service and issues with the documentation, but then they praised AWS because now they're running on RDS and don't have to deal with infrastructure issues. So I thought that was a little bit of a dichotomy with that. Anyway, to me it just begs the question, what were they doing before that caused their database to grow to such a huge extent? But if you want to learn more, you can check out this blog post next piece of content Foscom 2023 Heracle how postgresql can help you enforce best practices this is from virus and this is the blog of Jimmy Angelacos and I really like his content and the videos that he's produced. So he has a YouTube video of the presentation he did, as well as the slides here. And he just covers all sorts of best practices, talking about things like data types to use, how locks can affect performance, different types of configuration options, security best practices, high availability best practices, a lot of good content. So definitely check this out if you have the time to do that. [00:09:06] Next piece of content SQL optimization a comprehensive developer's guide this is from eversql.com and this is just a list of suggestions for running your SQL queries in a more efficient fashion from using indexes, improving joins avoiding joins when you can, improving subqueries paginating results using window functions using partitions materialized views so it's more like a bit of a reference manual, but it might warrant a perusal if you're interested in something like that. Next piece of content last updated columns with postgres this is from Morling Dev and a lot of application frameworks created at and updated at timestamps for every table. And he says I really would like to be able to update these in the database as opposed to relying on the framework to do it. And he says the created at is easy because you can use current timestamp for the default, but the updated at is a bit of a problem. But he said you could do it again. Set the current timestamp as the default for the updated at, but then in your update statement make sure updated at is equal to default and that will basically update the updated at column of the table. But of course you have to do this for every update, otherwise the updated at won't get updated, and if you have to do it anyway, what's the difference? Putting something like the now function. But this was just a quick blog post about that next piece of content, moving local settings for pghba.com and postgresql.com out of pg data this is from dan langil.org and I know I really like the way ubuntu does it, where the PghbA comf and the postgresql.com are stored under the Etsy directory and namespaced with the version and the cluster name of the postgres instance. That makes them super easy to manage. And he talks about using the include directory option in postgresql.com in order to add his configuration that overrides the default. This is exactly the way I like to do it too. I like to put a custom file that defines all the settings, and I never change the original postgresql.com file. But you can check out this blog post if you want to learn more about that next piece of content an efficient way to check for existence of multiple values in SQl this is from blog j o oq.org and he's talking about how you can use exists to determine basically if a row exists, and it's much more efficient than trying to do something like count all the rows and then see if the count is greater than zero. But what if you wanted to check that there were at least two or more rows in the table? Well, you can't use exists in that case, but you can go ahead and use account. But what you do is you limit it by two. So that way you don't have to count through all the rows, but you just get at least two of the rows and then you can go ahead and stop the query, count it, and then it gives you the results. So check out this blog post if you want to learn more about that next piece of content. Out of range planner statistics and get actual variable range in postgres this is from pganalyze.com and this is Lucas'five minutes of postgres and he talks about out of range planner statistics with regard to the stats that are collected for the values in each table and how it tries to maintain a histogram of all the available values in a particular column. And when you have one that's constantly increasing, like a serial data type or a timestamp that's always increasing, then your stats are going to be continuously out of date depending on how fast that gets updated so it really underscores the importance of frequently analyzing your table to make sure those stats are up to date as possible. He also mentioned that indexing can also give the planner some assistance in knowing how many rows exist when doing a particular query. But if you want to learn more about that, you can definitely check out his piece of content and the last piece of content RFC extension metadata typology this is from just aetherory.com and this is talking again about the postgres extension network that I think David Wheeler is working on, and he's specifically talking about controlfall metadata that will be a part of this. So he goes through and discusses the topology of what he's looking for in designing this. So definitely check out this piece of content if you're interested. [00:13:42] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content discussed, as well as the podcast, audio and video, and a full transcript while you're there. You can also sign up to receive weekly notifications of each episodes via email. Thanks and I will see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 302

UUID vs Bigint Battle!!! | Scaling Postgres 302

Episode 132

Postgres 13 RC1, Upgrading at Scale, With Ties, HOT Updates | Scaling Postgres 132

Episode 62

Index Include Clause, Exporting CSV, JSON, Zedstore | Scaling Postgres 62