How Much Observability Is Needed? | Scaling Postgres 294

[00:00:00] How easy is it to do postgres performance optimization? Personally, finding where the problem is is not too difficult. I mean, generally I just look at PG stats statements and it's pretty obvious what's causing at least a database wide problem. The hard part is actually figuring out how to fix it. I mean, sometimes it's really easy, but other times it can take quite a bit of time to think about the best way to handle a particular problem. But our first post talks a lot about building in some observability so you can figure out what things are going on in postgres. But frankly, my number one metric is how the CPU is doing, and based upon that I can generally track down what the problems are. But I hope you, your friends, family and coworkers continue to do well. So our first piece of content is optimizing performance of Spark's Postgres database. This is from MItOs. Net, and they had an application that they just started and they quote quickly discovered that our database was having a hard time keeping up with the increased load. So the first thing they wanted to do is set up more observability. So they were on the fly IO platform and they kind of said, okay, we generally want what queries per second are happening, what's the overall query duration? And then as they researched it a little bit more, oh, we want to know the fetches, the inserts updates deletes, proportion of index scans over total scans CPU load memory use now, like I was alluding to, most of this stuff is nice to have for me personally. If I have the CPU load and I can distinguish between what's in I O weight and what system or user, I can get a good sense of how hard the database is working and roughly where some problems might reside. Now, not highly specific, but then at that point you can look at PG StAT statements and see, okay, what statements are taking the most time in it, and then from there do your query optimization. But this blog post talks about how he got all these different metrics, and you'll notice a lot of it. It's coming from PG StAt database. So this is just a view on all postgres installations. Then for the query throughput, again, that's also in PGSTAT database for doing his query duration. There might be a way to do that in PGSTAT statements. I'd have to think about that, but he used a metric that apparently fly IO is collecting, which gives the maximum transaction duration from PGSTAT activity. But again the rows loaded versus rows returned. Again, this is PGSTAT database and then index versus sequential scans is from PGSTAT user tables. But looking at this, he says they had a low fetch to returned ratio. That's not something I generally look at or pay too much attention to. Max duration has peaks that are gradually becoming higher, and then the CPU load has spikes reaching 15. Okay, so the database is doing a lot of work doing something, and then he looks at missing finding indexes, having to install the PG STAT statement extensions and enabling it. Then he had to set up some way to collect and store these. So it goes into a lot of different ways, although it's quite easy just to query PGSTAT statements. But actually fixing this little queries, he said was the easy part. I guess if it's simply missing some indexes, yes, that would be easy to do. And again, this is what I generally do, find the slowest queries reported in PG STAT statements and then figure out which columns need to be indexed. So I was looking at all these things he was trying to measure and my thought was okay, if you have a high CPU, make sure PG STAT statements is on there, track what's going on, and then address the slowest queries. He also mentioned that they did some things at a caching layer above the database, which resulted in better application performance. But of course you're storing cache values somewhere and not hitting the database, so that can be very performant. And as a result of these changes saw reduced CPU load in the database from 15 to 0.4, which is good. And then Max query duration went down. So you should be able to see that as well in PG Stat statements. [00:04:02] He also did some things to reduce a lot of aggregate queries that were summarizing a lot of data and created periodic aggregates that they could query against that instead of the raw data, and that sped up things as well. Anyway, has this quote here. Setting up observability for postgres performance requires a bit of work, so it can be, but it gives you valuable insights into the performance of your database and it definitely gives you insights. But again, most of the time for me, if I know what the CPU is doing, how busy it is, how much time is being spent in IO wait, and they give me access to PG STAT statements. I can generally detect when something's going on and then investigate and resolve the solution. But I'd love to know what you think. How important is all these other metrics for postgres to you? Do you take actions based upon that? Like do you see the ratio of fetch to return? And that leads you to take certain actions, but anyway, let me know and put it in the comments. Next piece of Content how to avoid Deadlocks in PostgreSQL migrations this is from the Dovetail engineering section of medium, and they had an issue where they were running a migration within a transaction and they were altering a row level security policy on one table and then altering another table at the same time. But they had some others doing a user request through a query on both of these tables and they were running into deadlocks when they were trying to run this migration. So basically a deadlock is when you have a sequence of locks where they cannot be resolved. Like one lock is waiting for another, but what it's waiting on has been locked by the other process. And basically postgres chooses one of those transactions to be terminated as a result of it. So it's a deadlock. Now, their proposal for fixing this migration is actually do them in separate transactions. So for example, first alter the row level security policy of one table, and then do your alter statement in another transaction. They give some information about how you can actually see what is potentially locked when you're doing a particular DDL statement. And in this case they start a transaction, they start doing the alter command they want to do, and then you can check the PG locks table to see what impact it has. Now, it's not their recommendation to do this for every migration to check what it's going to do, but they're just showing this here to demonstrate where you can see how things are locked. But the first recommendation is basically don't do multiple DDL statements within a transaction. Now of course that runs counter to a lot of benefits of postgres is the fact that you can do that. So you can do a series of DDL changes, and then if something needs to roll back, you can easily roll back and not have partial DDL changes committed. But that's one proposal to try to avoid deadlocks. The other is to try to order your lock taking consistently. So if you can do it in a particular order that you know you can avoid deadlocks. That's another possibility. The other thing they mentioned is lench your database migration. So there's particular tools that look for certain activities that can actually lock tables in a long running fashion and potentially bring your database down. Like I use Ruby on rails. And there is a strong migrations gem that helps you avoid some problems when you're doing migrations for your database. They also mentioned you do have some postgres configuration parameters you can adjust, such as setting the deadlock timeouts or adjusting the maximum number of locks per transaction. But they said they actually haven't changed any of these yet. But the number one thing I did not see them talk about is the lock timeout parameter. I would say that's the number one tool in your toolkit to try to avoid deadlocks or other things that can lock up your application or prevent queries from running. So basically, when you do a DDL transaction, you set a lock timeout for that session. Maybe you set it for 900 milliseconds, or maybe you set it for a few seconds, depending upon what you're trying to do. And if that DDL statement cannot finish within that period of time, it rolls itself back. So no harm, no foul, and queries can continue running without an issue. Now, if you're having an issue with deadlocks when you're trying to run a migration, I would think the lock timeout parameter could work. So maybe you set that at 900 milliseconds, which should be below the default deadlock timeout, which I think is 1 second. So that way if the migration is unable to complete it at 900 milliseconds, maybe it's because something's locked and it will eventually result in a deadlock. It'll roll back, allowing other queries to not be impacted. So that's my number one tool when doing migrations to avoid problems such as these. This is also the post covered in five minutes of postgres at PG analyze, called avoiding deadlocks and postgres migrations. So Lucas covers this post as well as he shows the impact of lock queues. So once you have something locked up in this fashion, even if it's not a deadlock, you can see other requests start to queue up behind it, which can cause big problems for your application. And he also mentions the strong migrations gem that actually Andrew Kane wrote as well. So definitely check out these two pieces of content if you want to learn more. Next Piece of ConTENT PGSQL Friday 14 Wrapping up this is from Cyberpusgresql.com and PGSQL Friday is the monthly blogging event. This one was on Postgres events, not events within postgres, but events about postgres that people go to online or offline. And this had an enormous amount of participation. There were twelve blog posts written about this topic, and this blog post gives his own summary of them and links back to them. But we'll just quickly look at them in the tabs here. So the first one is from Frank on Dev, two in the AWS Hero sections, and he recounted what happened with regard to PostgreSQL at the AWS reinvent event. Next one was actually from Postgres FM, and this was their weekly episode they did on events. So you can listen to their episode here or watch the YouTube down here. They talked a lot about Nikolai's experience actually running some events, and both of them give their experiences as well as different advice if you're attending an event or want to be a speaker in event. The next one is from [email protected] and he has a perspective of being a newbie to the postgres world and he felt it's very welcoming, but compared to the Microsoft SQL community, it feels very top down structured, at least the big events. Next one is from [email protected] and he recounted attending the Past Community Summit in the Seattle Postgres Users Group. Next is from PG Mustard, and Michael, who's also from Postgres FM, gave some additional words of advice and his experience in a written form as well. [00:10:43] Next one is Ryan from Software and Booze.com talking about his experience at the Past Data Community Summit. Next is Claire from Citusdata.com giving an illustrated guide to the SamePass Community Summit and talking a lot about the different talks that were present at that event. [00:11:01] Next is Ryan, and he gives more general experience about networking benefits and also talking about virtual versus in person and benefits. And Andreas from Andreshurbon, La. And he actually leveraged different people he interviewed at PostgreSQL Life and talked about how many different times all the different conferences were mentioned. So that's some interesting data. [00:11:25] Henrietta at hdumbrovaskaya Wordpress.com talked all about her experience with events. And then we had Rajiv at Rajivarvalka Tech gave a short synopsis of his experience. [00:11:39] And lastly, [email protected] gave his experience of attending postgres conferences as well as the benefits of each of them. So definitely check out these blog posts if you want to learn more. [00:11:52] Next piece of content Postgres Tool and FIPS mode this is from Peter Eisentraut.org and FIPS is reference to the federal Information processing standard. And mostly you hear about this with regard to security, particularly with regard to OpenSSL. And he says the one that's probably the most relevant is FIPS 142, which I'll have to take his word for it because I'm not quite familiar with this. And he says out of the box OpenSSL does not satisfy the requirements of FIPS 142, but some versions can be configured to operate in a mode that satisfies BIPS 142. Now, he talks about this with regard to OpenSSL and then how it impacts PostgreSQL as well, and he goes through a lot of detail, but he has a very great summary at the bottom here. So basically Postgres 14 and higher MD Five authentication won't work if you've enabled FIPS mode. At PostgreSQL version 15 and higher, PG Crypto will require OpenSSL subject to the FIPS mode, and at Postgres 17 and higher, all PostgreSQL test suites will pass OpenSSL and FIPS mode. So basically this blog post is a result of him trying to get all the tests to pass with OpenSSL in FIPS mode. So if this is something of interest to you, I highly encourage you to check out his blog post. [00:13:11] Next piece of Content fun with Postgres ASCII Map and Cardinal Directions this is from CrunchyData.com and this is the Day 23 challenge from the 2022 advent of Code. And again, I won't go through this further because it would spoil it, but the last piece of content is advent of code in PostgreSQL tips and tricks from 2022. So if you're looking to try to use postgres to resolve some of these challenges in 2023, here are some tips that you can go through to do that. So definitely check out this blog post if you want to learn more. [00:13:46] I hope you enjoyed this episode. Be sure to check out scalingPostgres.com where you can find links to all the content mentioned in this episode, as well as the podcast version of the show and eventually a full transcript. Thanks, and I will see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 173

Delayed Replication, Materialized View Permissons, Paranoid Postgres, Memory Overcommit | Scaling Postgres 173

Episode 140

Replication Conflicts, Postgres Governance, pg_crash, MobilityDB | Scaling Postgres 140

Episode 266

Postgres Releases, Postgres Distributed, Privilege Template, Real-Time Dashboards | Scaling Postgres 266