Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about planner optimization, fish to elephants, large databases and spatial data. I'm Kristen Jamison, and this is scaling postgres episode 213 one.
[00:00:22] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is Optimizing with the PostgreSQL deterministic Query planner. This is from AHA IO, and they're talking about the perspective that quote, the PostgreSQL query planner will always discover the optimum query plan, given its statistical knowledge of your system and your data. Now, I usually don't deal with absolutes, so I'm a little hesitant to say always, but if you give the query planner more information, as much information as you can, it will find you, the vast majority of the time, the right plan. Does it do it 100% every time? I'm not willing to go quite that far, but this is a great post that describes different ways you can give more information to the planner to optimize your queries. So the first area he suggests changing is setting the right cost for page access. So here they're talking about the ratio between the sequential page cost and the random page cost. So this is something you adjust in the configuration parameters of postgres, and by default, the random page cost is set to four, the sequential page cost is set to one. And the reason being this is set this way, is there's an assumption that random reads from a hard disk is going to be 40% slower than a sequential read and that maybe 90% of your disk pages are cached in memory. So a wonderful ratio. And what they basically did is they reengineered this assumption into this function and they said, well, we're not using magnetic disks, of course they're using NVMe drives or something equivalent, some sort of network fast based storage. So what is the appropriate ratio? Well, their cache hit rate is close to 99%, and they're assuming there's probably a five x ratio instead of 40 for the cost of a random access versus a sequential access. Therefore computing that into the function, the random page cost, instead of being four should be something like 1.5. So that's what they set it to. And they saw a market decrease in full scans happening unless you had very small tables. And that, of course, resulted in better performance because the system was assuming that a random page cost of an index scan would be faster than doing a full sequential full table scan. The second area he looked at was setting up statistics for dependent columns. So they have a query here where they're looking at a particular table with an account ID, product ID, initiative ID. And Postgres considers these indexes independent. So if you're looking for this, it's potentially going to try to do a bitmap merge of all three indexes to try and get the right answer. But the reality is these are not necessarily independent and the data looks more like this from their perspective. In other words, this is all encompassed within an account ID. The product ID is a part of it, an initiative ID is a significant part of the product ID. But there's this overlap and you can communicate this type of dependency to the planner for it to make better plans on how to query it. So you do that using create statistics, so used create Statistics feature stats, and they define dependencies on the account ID, product ID, and initiative ID. And then once they analyze the table, of course the statistics get updated. And when querying it, they saw a 20,000 fold improvement. Now, I've seen improvements like this using say, compound indexes or something like that can give a drastic improvement. But essentially you're relying on the planner to choose the path that a custom crafted compound index may do. And basically it can see that because initiative is such a small subset, you can query that and just throw out using a filter any product ID or account IDs that don't match well, you won't have an account ID that's not one with the initiative three. And it's just a few product IDs that have to be filtered out. So that allows a planner to make a far more efficient plan. The next area they covered is bewareing the join collapse. Now, this is fascinating, a lot of this, because some of this is actually new to me. And the situation is when you have a lot of joins, it can really start slowing down. The reason being is that Postgres uses a deterministic method to find the optimum plan to execute your query. And when you have too many joins, the possibilities go exponential and planning time would take forever, or at least a really long time. So there was actually another organization that developed a genetic algorithm based query optimizer that is based more on heuristics. So it kind of does the genetic concept of evolution to pick the best plan. So there's a bit of randomness or nondeterministic methods to seek out what the best plan is. Now, it's not always going to be the best plan. There's a chance that it's going to choose a poor plan, not the best one, the way a deterministic would, but allows you to not check all the paths, but most of the time pick the optimum plan. And when to switch from this deterministic to nondeterministic is determined by the Geqo threshold. And by default this is set at twelve. And then how much actual work that planner does. The heuristic planners, based upon the Geqo effort, which is from one to ten, and it defaults to five, well, they've had good experience increasing the threshold to 20. So pushing back when that heuristic calculation gets used. They found good experience doing that and they're considering for when it does go over that threshold and they start using the heuristic algorithm, potentially increasing the effort value to spend more time to try and get the best plan. Because for them, it's more important to get an accurate plan than to make the queries a little bit faster by cutting down in the plan time. But this is a pretty interesting adjustment that has allowed them to get a bit more performance for their particular queries. Now the other area that they didn't mention that you'll definitely want to do is increasing your default statistics target. So by default it does a sampling of so many rows for each table to collect statistics about them. Now, as your data sizes increase, you probably going to want to increase the amount of rows sampled because I've had queries that suddenly started performing poorly. And to resolve the situation, all we had to do is bump up by tenfold the amount of rows that were included in the statistics. So that's another area increasing that either on a full database level or at a table level. So the planner has more relevant rows to work with to determine what the best plan should be. But this is a really great post I encourage you to check out.
[00:07:12] Now related to this, this was also the episode covered in five minutes of postgres. So five minutes of postgres Episode 15 tuning postgres's Deterministic Query Planner Extended Statistics and join Collapse Limits So he reviews this post as well and I definitely encourage you to check out his episode for his perspective on it. The next piece of content teaching elephants to fish. This is from Timescale.com and they're talking about BabbleFish. So if you're not familiar, BabbleFish is an extension that was developed, I believe by Amazon AWS that allows a PostgreSQL instance to be line compatible with Microsoft SQL Server clients. So basically you can have a Microsoft SQL Server client that speaks its standard wire protocol to other Microsoft SQL servers. But now I can actually talk to Postgres and send the exact same types of data. Now with regard to this, this post is talking about the community members because looking at the DB Engines ranking, you could see over the last nine years that postgres has continued to increase. The other relational database systems have been declining slightly, but Microsoft SQL Server has been declining more. So so he's saying there's the potential for many more new community members to come into postgres from Microsoft SQL Server because of this. And he basically says we need to have a plan for this. And he goes over five areas that we should consider in terms of doing community management for potential new members. One is. Lead with empathy and curiosity. Two is Lower the bar for the entry level hashtag Pghelp. Third is Support new members by cultivating more leaders. Fourth, seek leaders proactively as opposed to just having them come up. And then fifth, develop consistent messaging around community. So if you're interested in keeping and developing a healthy community, definitely a blog post to check out next piece of content working with large PostgreSQL databases. This is from Procona.com and they're talking about things you should do when you're working with a large database as opposed to a small database. Now that's kind of a loaded question. What is small, what is large? Some people may say a small is on the megabyte size, some say it's the gigabyte size where there's people that have large databases dealing with petabyte sizes. But the other thing to consider, because a lot of the things mentioned here I don't necessarily are all small versus large, but it's also when you have a large database, you're going to have more customers. Those customers are going to have more eyes on your product. And if you go down for any reason or have issues, it's going to be far more visible to more people. Large databases also imply you're going to be making a lot more money. So there's more money on the line if something happens to the database. And then a large database implies again more money that you probably have more engineers working on the product, more people working with the database. So that also pushes a lot of the things mentioned in this article, like they're talking about starting to use more automated tools. Well, that becomes more important the more people's hands that are in the mix and you don't want to do things manually talking about consistently monitored and go through an active tuning cycle. Again, these become more important and the more people that are using it. Now from a technical perspective, as you get large, you will have more bloat you have to deal with, you have to do more auto vacuuming tuning compared to when the database was small. You have to have optimized. Indexes are much more important on the large databases versus small databases. So there's definitely technical differences that need to be addressed. But this post does a pretty good overview of the things you need to be aware of as your database increases in size into that large area, whatever you call it. But if you're interested in learning more, you can definitely check out this blog post.
[00:11:02] The next piece of content using Uber's H Three hex grid in Postgres. This is from Rustprooflabs.com. They're talking about an extension called H Three that was developed by Uber that's for developing hex grids. And it's actually an extension so you can create it in your PostgreSQL database and allows you to create things like these geometry boundary for hexes and even work at different perspectives. And here you loaded in some OpenStreetMap data to give a hex representation of different concentrations of traffic signals. So if you're interested in that, you can check out this blog post. The next piece of content mapping fracking Wells in Colorado with PostgreSQL. This is from Selectrum Dev and he's talking about using PostgreSQL to analyze and draw maps for this fracking. So again, this uses PostGIS to be able to map these and present that data. So I encourage you to check out this blog post if you're interested.
[00:12:01] Next Piece of Content analyzing Historical Flight Data with Mobilitydb this is from Cyprdeck Postgresql.com and Mobilitydb is actually an extension that sits on top of PostGIS to do spatial temporal analysis of the data. So he walks through getting this set up, he shows the stack that he's using for doing this analysis and shows you the different commands to go ahead and do this type of analysis on historical flightplat patterns. So if you're interested in that, you can check out this blog post.
[00:12:33] The next piece of content global Deadlock in a distributed database cluster. This is from Higo CA. And of course, a Deadlock is when one session takes a lock in a row and needs to lock another row, whereas another session has locked the row it wants to go to and locks the row that the first session already has. So basically there's no way to resolve that locking solution. So you have to declare a deadlock and kill one of the sessions. Now this happens on the local PostgreSQL cluster side, but what happens when you're starting to try to do a distributive cluster of PostgreSQL systems? Well, then you need some sort of global deadlock mechanism. And this blog post walks through setting this up using the postgres foreign data wrapper. So they create two coordinator nodes and two data nodes and show the example of a deadlock happening. And basically it waits forever. So basically this is a scenario that we can get into with the current path of trying to maintain a cluster of PostgreSQL instances. So how are we going to solve this particular problem? And they have a little bit of a discussion about it, but basically this is something in process and it highlights the issue at hand and some thoughts for how to address it. So if you're interested in that, you can check out this blog post.
[00:13:53] The Next piece of Content formatting SQL code with PG formatter within Vim this is from Endpointdev.com and we've mentioned PG formatter on previous episodes of Scaling Postgres, but here he's using it within the Vim editor to actually auto format your SQL. So here's a quick little animated GIF that shows taking unformatted SQL and then showing the format of the SQL. And he mentioned a link for someone doing it in emacs and he's using it with them. So if you're interested in that, you can check out this blog post.
[00:14:28] The Next piece of Content eg postgres Major Version Upgrades using PGO version 5.1 this is from Crunchydata.com and this is using PGO, their postgres operator for Kubernetes. So if you're using Kubernetes to manage your PostgreSQL cluster, their operator allows you to do a full version upgrade. Now it does require a stop and then a start of the postgres instances, but it does seemingly allow you to do it quite easily using their operator. So if you're interested in that, you can definitely check out this blog post.
[00:15:02] The next piece of content is the PostgreSQL person of the week is Philippe Peruis. If you're interested in learning more about Philippe and his contributions to Postgres, definitely check out this blog post and the last piece of content we did have another episode of the Rubduck Dev show this past Wednesday evening. This episode is whether to dry your code or not to dry your code. So if you're interested in more long form, developer based content, maybe you'd like to check out our show.
[00:15:30] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.
[00:15:49] Our.