[00:00:00] In this episode of Scaling Postgres, we talk about query optimization, normalization visualizing vacuum, and sharding's future. I'm Kristen Jameson and this is Scaling Postgres, episode 66.
[00:00:22] Alright, I hope you're having a great week. Our first piece of content is that PostgreSQL twelve, beta one is released. So you can download it now and try it out. The link is in the postgres news and it goes over all the different feature highlights. I'm not going to go into depth of all of these because we've mentioned some of these different features in previous episodes. But talking about index, performance, functionality and management, mainly with regard to management, the reindex operation that can be done concurrently, that's pretty huge. And also some performance and efficiency gains for different indexes is good inlined with queries or basically common table expressions where they're no longer materialized, but you can always add that option if you want. Again, different partitioning improvements, JSON path queries per the SQL specification, different collations feature, a most common value extended statistics, so you could create statistics to cover more complex relations between columns. The generated columns feature the new pluggable table storage interface, page checksums and authentication and connection security changes. And also notable behavior changes that are definitely something to watch out for. One is that the recovery.com configuration file is now merged into the main PostgreSQL file, so that's definitely something to be aware of. Also interestingly, JIT compilation is now enabled by default and then OIDs can no longer be added to user created tables. So definitely a lot of new features coming in twelve, some of them very welcome. So if you want to try it out, feel free to go ahead and download it.
[00:02:06] The next article is a tale of query optimization and this is from Parallel Thoughts XYZ. And he talks about a query that they were running for. I believe this is the visual website optimizer platform.
[00:02:21] A query that a customer was running that was slow and he looked at the investigation, this is what the query looked like. And they actually have tables per account, so it wasn't another customer and they were trying to figure out why it was taking so long to run.
[00:02:38] Now, what I found interesting is that he shows the plan time, execution time, but he doesn't show the Explain plan anywhere. So I really like posts like this where they look at a performance optimization and what they did to get queries running faster. And I usually like looking at the Explain Analyze output to see okay, where is it exactly? Need to be improved or at least hypothesize. Oh, that's the problem. But he didn't include that in this particular post that I could see anywhere. But what he did first, maybe he looked at Explain Analyze, but he basically started looking at what could be the problem. Maybe it's the I like in doing the scan, trying to find that. So he looked at that query in isolation and that wasn't too bad. So then he said, all right, is it the multiple join statements? So he looked at all the different join statements and that was pretty fast too. So we said maybe it's the URL sub query. So we tried using an exists instead and that was pretty fast. So we tried moving into a subquery into a CTE, trying to break up what was required, like he said, but it was still extremely slow. Again, for some of these tests, like particularly this one, I would have liked to have seen the Explain Analyze plan to see if it targeted what the problem was. But basically the issue was the ampersand ampersand operator or the and and operator. Now he did mention he did run Explain Analyze when looking at this ampersand ampersand. He said there were multiple rows of just these filters. And he says quote which means that not only was this operation expensive, it ran multiple times. So he basically did the problem in isolation and just did this select queries to actually pull out the data himself to be able to do the compare and then unnest what was there. So as he says, we can look at this operation as traversing items in an array or rows in a table and stopping as soon as the condition match is met. So basically, coming all together, he redesigned the query like this and the final execution time was 2 seconds. Now, I don't know if you looked at how many seconds the original one took, but the original query took 24 minutes. So this doing taking 2 seconds is a huge advantage. So I think there are definitely some insights to gain. And if you want to look into this post in more depth, I definitely encourage you to check it out.
[00:05:15] The next post is how I decimated postgres response times for my SAS. And this is from the blog Checklehq.com. Now this is again another query performance optimization post, but this one's a little bit more simple. So I'll skip to the end where basically it was just a single multicolumn index that drastically improved the performance. So he had an index on a check ID because they do API checks and he had an index on a date field and making it a multicolumn index for doing queries where such as this, where the check ID is a certain value and it's ordered by that date time field, that resulted in an enormous performance benefit for him. So again, this is a simpler perspective on a query optimization, just using a multi column index to satisfy where and or order by clauses and make them run more efficiently. But if you're interested, definitely a blog post to check out.
[00:06:19] The next post is optimizing your app by Understanding your Postgres database. And this is from actually he's from Citus Data and this was a presentation given at rails. Comp. And this is part of the Confreaks YouTube channel so this video is about 40 minutes in length, and it is a very, very good presentation because how he approaches it is from looking at the perspective of looking what your database statistics are telling you. So not just saying you have a single query, but looking at the database holistically. So the number one thing he says to look at first, for example, is your cash hit ratio and to keep that in the 99% realm. And if it's not looking at some ways to improve that. He also covers things like the PG Stat Database table the PG Stat User tables the PG Stat User Indexes to get an overall sense of how your database is running, how often vacuums are running, whether Tables is using. Indexes or sequential scan, whether it's more updates, deletes, inserts and also examining PG Stat statements where you can get statistics on your overall queries against the database. So it's definitely a very good presentation that I encourage you to check out.
[00:07:32] The next post is Normalize to save space. So this is from Luca Ferrari at FluCAN 1978 GitHub IO. And he basically looks toward normalization as a way to save space. So he says, I don't tend to over normalize data ahead of design.
[00:07:54] So he had this case where he had this name that existed on essentially a statistics table, or I should say a sensor table that was also located in another one. So he decided to normalize it and remove the name from that table. And of course, when he did that, he went from an overall database size of 13gb, not huge, but down to essentially 9GB. So it's just a short little post demonstrating how when you practice normalization, you can actually decrease the amount of data your database contains.
[00:08:25] The next post is visualizing PostgreSQL vacuum progress. And this is from Dave Jacob's
[email protected]. Now, this is a pretty long post, but it's very interesting. So he goes over looking into and how they visualized vacuum's progress, because vacuum for a lot of people is a black box, and there is one area that remains a black box, no matter using the new view that they have added in recent versions. So he's using this way to visualize using grafana the progress of vacuum. So this is a single table being vacuumed. The blue line is the total number of heap blocks that need to be scanned. And then the view can tell you how many heap blocks are scanned. So you can see over time it's progressing toward that goal, essentially. Now the green line indicates what vacuum phase it's in. So basically, this is scanning the heap, and then it goes into this next phase that is vacuuming indexes. So it vacuums each index, identifying and removing these dead tuples, essentially. And then it goes on to actually vacuuming the heap and removing those tuples. So you could see here that the heaps being vacuumed goes up to the point of the heap block's total over time. So the view does give us insight into when things are being scanned and when things are being vacuumed. But this middle part here, the vacuuming indexes, we have essentially no idea what's going on. Now, with a manual vacuum, you can do it in a verbose mode and it will tell you each index as it's happening. So you do can get some insight in here, but it's not contained within the Pgsat vacuum progress view. Now, this is an interesting visualization and throughout this post he talks all about how vacuum operates. So again, this is another great post I suggest you check out. But going back here to multiple index scans. So if there's a case where your maintenance work mem is not high enough, it actually has to do multiple index scans to vacuum up all the tuples. Now this complex graph kind of explains this, where it goes through the phase of scanning the heap, doing the index scans, vacuuming the heap, but then it must go back into scanning the heap again because maintenance work memory wasn't sufficiently sized to be able to hold everything that needed to be vacuum in memory. So it must basically do a second pass of doing everything again. So definitely a message to make sure your maintenance work memory is sufficiently sized. So if you're having large vacuums, maybe you want to look into this type of visualization process to know what's going on and know how many times you're essentially doing a vacuum. Like this demonstrates here where you had to essentially vacuum twice because maintenance work memory wasn't sufficiently set. Now, he does provide this view here showing their vacuum progress layout in Grafana. So again, I find this super interesting and he goes over a lot of depth explaining how things are working. So if you want more insight into your vacuum process and even just to know a little bit more about it, definitely a blog post I suggest checking out.
[00:11:39] The next post is introducing PG Auto Failover, a High availability and automated failover postgres extension. And this is from Citusdata.com blog. Now, in a previous episode, I believe it was episode 63, I mentioned PG Auto Failover, it was from the Microsoft blog that was announcing it. And this one again is making an announcement on this blog and it goes over a good introduction about how there exists a monitor, which is a postgres database itself with the Auto Failover extension. And it basically registers and checks the health of the different active postgres nodes. So it basically is a self contained high availability solution. Or I shouldn't say self contained, but this one open source project is a high availability postgres solution. And then you register each postgres node with the monitor. And the monitor itself basically manages these postgres instances for you. And if it notices that one goes down, it switches the traffic to a new one and it goes over in depth about how it works. It has a good flowchart here of their finite state machine and even goes to exactly what commands you need to run to make it work. So again, this is a good complement to the previous post I did. So if you're interested in a high availability solution that looks relatively easy to set up, definitely check out this blog post and the previous one in episode 63 and try it out. I'm actually going to be looking into this myself because I do find it pretty interesting and wonder how easy it will be to set it up for perhaps some of my systems. The next post is Swoop Dedup and this is from the Second Quadrant.com blog and they're talking about the problem of duplicate rows and they have imaginary scenario where created a context table and there's no kind of controls in here, no keys or anything to prevent duplicate records. Let's say you use copy to copy multiple times because there's issues. It's supposed to have 26 records but it has 104. And here they use the solution of first creating a primary key as a serial type and then they use a window function using a row number over and partition by different columns to look for duplicate names and then do a delete statement from this query where the row number is greater than one. So eliminate for example where the row number is two and three. Anything more than one, eliminate these two rows keeping this one. So this is one solution you can use to remove duplicate data from your tables.
[00:14:12] The next post is the Future of Postgres Sharding by Bruce momgeon. This is from Momgeon us. And again, this is something that's coming in the future and it's postgres slowly moving towards having Sharding built in. And he talks about scaling up, scaling out and talking about Sharding and kind of what exists today where essentially relying on a postgres foreign data wrapper, you can send data to multiple foreign servers is what they're called. And you can push down again with the most recent version of now eleven different types of aggregates to the server. And then what they're saying here is that some of these need to be done serially when you're for example, calculating aggregates. But they basically want to have parallel Shard access. So that's kind of one of the next stepping stones. And then the other thing is joins with replicated tables. So for example, maybe you have certain tables you want to exist on each one to make push down queries easier when looking at data on a particular Shard or in a particular foreign database. So they're looking into that as well as of course shard management and global snapshots. So this is a pretty brief presentation but it kind of puts forth some additional features that are needed to make Sharding a built in component of the community PostgreSQL.
[00:15:34] The next post is parallel PostGIS and PostgreSQL twelve. Now, the TLDR he says here is that postgres twelve and postgres three have finally cracked the parallel spatial query execution problem and all major queries execute in parallel without extraordinary interventions. Now, we mentioned this previously in Scaling Postgres, where you had to basically alter costings to be able to trigger some of the PostGIS queries to make them happen in parallel. But with the default costings in this you can do parallel sequential scans, aggregates and joins. So he's particularly pleased by this capability. So if you use PostGIS, maybe you want to start checking out PostgreSQL twelve to see if it can benefit your postgres queries.
[00:16:24] The next post deploying active active PostgreSQL on Kubernetes. And this post is using symmetric DS, which I believe is a type of doing a master master type PostgreSQL database using a trigger based replication system. So if you're wanting to use Kubernetes with PostgreSQL, definitely a blog post to check out.
[00:16:46] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks our.