Query Planner, BRIN Indexes, Approximate Algorithms, Delayed Replication

[00:00:00] In this episode of Scaling Postgres we talk about the Postgres query planner brin indexes, approximate algorithms and delayed replication. I'm Kristen Jameson and this is Scaling Postgres episode 53. You alright? The first article this week is debugging the Post as query planner. And this is from the gocardless.com blog and in it they're basically talking about they have a two terabyte database and they saw that a query started or a set of queries started operating slower. And apparently, I guess based upon the amount of data, the statistics caused the planner to start using a different plan. So they say, quote we saw a concerning number of queries that were very long running that normally execute under 50 milliseconds and this was the query that they were seeing the difficulty with. And so they do give a little bit of a background about how their data is laid out. And they have a payments table that connects to a payment transitions table and they make a number of observations here, like approximately 20% of the payment transitions table will be marked with a payout ID. Now the first thing that they did to examine this is use explain to get a sense of what the query plan is using and they noticed that it's using a nested loop which is a little bit more inefficient compared to some other methods of doing the join. And so basically they go through looking into it and trying to get a sense on what the statistics think are there, how many it should expect. And they were looking at the PG stats table to get some of this information and then they had a discussion about what their ideal plan would be in going through it. And they even drill into the source code trying to determine kind of what went wrong in terms of where they were going with the planner. And they think that the major thing was regard with it looking for a limit of one, it was using a plan that would search through too many rows to be able to find one because the statistics were off. It thought just looking for that one sequentially would be the fast way to do it. And then they talked about how they fixed it. And how they fixed it is basically adjusting the Payout ID statistics payment transition table to be able to sample it at a higher rate. But of course they mentioned here that you don't want to sample it too higher rate because then your vacuum or your analyzes having to be run very frequently the auto analyze function. So I thought this is an interesting blog post, but one thing that I tend to like to do, going back to the Explain plan, what I always like to do is do an Explain analyze so it can actually give you an estimate of what the planner thinks the costs are and what it's going to be seen. And when you add analyze it actually runs a query and gives you what those reports are so you can more easily see discrepancies between what the planner thinks are there versus what it actually took to do the query. So in general I always like to do that. [00:03:06] The other thing I noticed is that immediately when they said approximately 20% of the payment transactions will be marked with Payout ID. So that means 80% of them don't have one and they have an index, a simple index on it. I guess my first inclination would be to do a partial index on it. So perhaps even before looking to adjusting the statistics, maybe put a partial index where the Payout ID is not null. That should give it much more, much fewer rows to sort through and will probably increase the speed of the query in general. But anyway, I thought this was a good post that goes through how they diagnosed a slow query that came out for a large table and how they chose to use statistics as way to resolve what the query planner was doing. [00:03:52] The next post is PostgreSQL Brin indexes, big data performance with minimal storage. And this is from the Crunchydata.com blog and basically they're exploring the Brin indexes. So now this they're thinking about using for a use case where maybe you have some time series data. So they're talking about a sensor reading. So they create a table that has a unique ID, what was scanned for the sensor as a float and then a timestamp and they want to be able to query by the timestamp. So first they inserted 10 million rows into this table and then they did a query for a time range, grabbing a month of the data by day and turning off parallelism and running the query. It ran in 1.6 milliseconds. When they turned on parallel queries and ran it, it actually ran in 460 milliseconds. So over three times as fast. But one thing that they didn't mention in the blog post that I observed here is that they're actually doing a disk merge from all the data that's being pulled back. They're actually doing a merge on the disk, or I should say they're doing the sorting on disk which is going to really slow down the query. Whereas they're using memory sorts when they're doing it in parallel, I guess due to the size of it being lower than work memory. So I wonder what the timing would be if actually the work memory was bumped up a little bit to be able for this to fit in it. Then what would the difference between parallel versus just a single execution processor? Now this is all done sequentially with no indexes. So he goes ahead and adds a standard B tree index and with that it's at 1.1 second. So it's faster than a sequential scan but not as fast as doing it in parallel. But again, we're running into it's doing the sort on the disks. So that may be the reason for the speed difference here. And then he made note of what the index size is, 214 megabytes, and then he tries a Brin index and that ran in 967 milliseconds. So just a little bit faster than the B tree index. [00:06:01] But look at the size, it went from 214 megabytes to 32 KB. So that's one thing that he's mentioning here is that Brin can be really good if you have a lot of data in something, and especially if it's time series oriented. That's kind of where Brin indexes can give you a little bit better performance, but they give you huge gains in terms of disk space savings. [00:06:27] So in this case, it's 100th the space of the B tree index. So he said because it tends to excel with larger amounts of data. So in this case, he tried a Brin versus a B tree index with 100 million rows and again with a no parallel query, just sequential scan, it's doing the disk's merge and it ran in 10 seconds. Again, resorting to doing it in parallel, it keeps it in memory, doesn't go to disk, and it does it in 2.6 seconds. Adding the index causes it to go down to 1.1 second. So definitely faster than the parallel sequential scan. And the size of the index is over 2GB in size, but with the Brin index it runs in 975 milliseconds. So 1.1 for the bind Btree index and 0.9 seconds for the Brin index. So again, the Brin index is a little faster. But again, what's amazing is the disk space savings. Whereas the B tree index is at over 2GB, the Brin index is 184 KB. So what's interesting about the Bren index is that in certain cases it can give you better performance versus the B tree tends to be with sequential data and you have a lot of it, but it gives you huge savings in terms of disk space. Now that could be advantageous because perhaps you're able to cache Bren indexes more than a B tree. So, definitely interesting blog post, I definitely suggest you check it out. [00:08:03] The next post is approximation algorithms for your database. And this is from the Citusdata.com blog. And here they're covering two extensions you can use to give you approximations, because they're talking about once you start getting a lot of data, getting accurate counts when you're wanting to do things like doing a distinct count or a top end, like a top five or top ten from a list or maybe even a meeting, can be hard to do. And there's two extensions that can help with that. So Hyperlog Log is an extension you can add to PostgreSQL that gives you approximate counts. And it does things like this to help you get a unique count of visitors. And you may have some sort of analytics that maybe an approximate count is okay. And this would be a use case for using an extension like this. Then they also talk about top end where you can get the top end of a set of data. So in this case it does a top ten, for example. And again, this is doing approximations so it's not 100% accurate, but it gives you an approximation when you have a large amount of data. So if you're interested in doing something like that and you have a lot of data, definitely a blog post to check out. [00:09:17] The next post is how we use delayed replication for disaster recovery with PostgreSQL. So this is something that I actually haven't implemented but I've thought a lot about and that's basically they're using delayed replication to be able to have a relatively recent copy of the data that they have without it being immediately in sync and not having to do a full restore from a backup. So basically what they're doing is they set up a Replica server that stays 8 hours behind their primary database. So it's kept in sync, but it's always 8 hours behind. Now, where that's interesting is because if you have something that happens in your master database and if you have a multi terabyte database, this is where this comes in. [00:10:02] If something happens, a table is dropped erroneously or something of that nature happens. To restore a multi terabyte database can take a really long time. But the problem is if you have like a Replica in sync, generally that table drop happens immediately on the Replica. Whereas the advantage of this is that it delays that replication by a number of hours. So if that incident happens, you can actually pause the replay on that Replica, set it up to a particular to the time right before the incident happened, replay that replica up to that point in time, and then you can correct or grab the data or do something with it to be able to restore data in your production system. And they talk about the main configuration option that does this in your recovery.com file when you're setting up your Replica. Is recovery mean apply delay? So they have it set to 8 hours. So this particular Replica is always following what the primary database is doing but it's just committing those changes 8 hours behind where the primary is. And they go over even some of the code about how PostgreSQL implements it and how they have put it in place to be able to do data recovery without having to do a full restore and backup. Now, they're not saying this takes place of a backup. As they say, replication is not a backup mechanism. But if you have multi terabytes of data, and if you need to do some sort of a data restore event, having one of these delayed replicas can be a great way to restore data relatively quickly without having to. Restore the entire multi terabyte database because you already have it there. And you just reconfigure the recovery time to an exact time when you want to recover that data. So you haven't considered using this. Definitely something to investigate. [00:11:58] The next post is Webinar achieving high Availability with PostgreSQL follow up. And this is from the Second Quadrant.com blog. Basically they did a webinar about achieving high availability with PostgreSQL. [00:12:13] So you just click on this link and you fill out some contact information. They give you access to the webinar. So they talk about availability in terms of recovery point, objective recovery time, objective setting, replication for your high availability, how PG bouncer can be used. They talk about handling backups and they do cover some of the products that Second Quadrant offers in terms of the Replication Manager, their barman backup recovery tool, as well as going the Master Master route with their bi directional replication. So if you're interested in that sort of thing and achieving higher availability for your PostgreSQL instance, definitely a webinar to check out. [00:12:56] The next post is the current state of open source backup management for PostgreSQL. [00:13:02] So by default we have PG Dumps and we also have PG based backup for doing what I call physical file backups. But they also go over some other products that are available such as PG Barman by Second quadrant PG Backrest. PG Pro backup by Postgres Professional and Bart by EDB. So basically they go over each of the features. What are some pros and cons of each of them. So if you're interested in adjusting how you're doing your backup or you want to do something different, definitely blog post to check out to look at a comparison between the different tools that are out there now related to that crunchydata.com. Their blog has put out a blog post called PG Backrest performing backups on a standby cluster. So really this is talking about setting up two systems and getting PG backup up and running on it. So it goes into really in depth about all the commands you would need to use to set up both PostgreSQL in terms of two instances and doing setting up PG Backrest to be able to backup using a standby system. So if you're interested in exploring PG Backrest, definitely a blog post to check out. [00:14:15] The next post is exclusive backup method is Deprecated what now? So we're talking about exclusive backup. So that means the method that you would use to do physical backup before PG based backup came out, that is you would execute a PG start backup with some sort of label before you start taking copies of the files and then do a PG Stop backup after you are all done. And they call that exclusive because you can only have one of those going at a time. The instance that just has this because it actually creates a file in your PostgreSQL data files saying what backup is going on and what the label is. So exclusive in terms you can only have one at a time. Now they mentioned the problem with exclusive backup method here is that if you have PostgreSQL crash or the operating system crashes during a backup, that file is left around and it says, quote, there's no way to distinguish the data directory of a server crashed while in backup mode from one that's trying to be restored. And you'll get errors when you try to bring up a system after it's crashed during a backup. And he mentions that PG based backup overcomes this. So this backup label is not written to the data directory but actually just added to the backup. Now they also mentioned in 9.6 you could actually do a nonexclusive backup for PG start backup and PG stop backup. And that basically keeps the database session open. It doesn't use a file and in the same session where you started you can tell it to go and stop the backup. But there can be some issues with issues with it because that same database session has to stay open. But they also mentioned the documentation that the exclusive method will eventually be removed and he says you probably have until 2021 before that's actually fully removed. But that can be a problem because some backup software rely on a pre backup command or commands to run and then a post backup commands and then they are generally not happening in the same database session, so it doesn't maintain. So how do you handle that? So basically he has come up with a set of scripts to do this. So if you potentially want to use their script here to do a non exclusive backup of your PostgreSQL database system, definitely a blog post to check out. [00:16:40] The next post is using parallelism for queries from PL functions in PostgreSQL ten and they talk about using intra query. Parallelism was introduced in postgres 9.6, however, it didn't work within PL SQL functions. So when you tried to do it within a function you would actually get a sequential scan. But in version ten it does paralyze it. So again, this is just a good reason to always try to keep up on your versions of PostgreSQL because it can give you a lot of performance improvements. [00:17:21] The next post is parallel queries in PostgreSQL and this goes through kind of the current state of parallel queries starting with 9.6 and looking at different ways where parallel sequential scan was introduced, how it works in terms of process of the communications, how to configure different workers and cost estimates. And then talking about Nested loop joins and Hash joins and merge joins as well as a partitionwise join and parallel append. So if you're wanting to get more in depth about parallel query execution in PostgreSQL, definitely a blog post to check out. [00:17:56] The next post is why you should use a relational database instead of NoSQL for your IoT application. And this is from Timescale.com blog. And Timescale is an extension for PostgreSQL that allows you to use it as a time series database, basically having very small partitioning of time series data and it's basically talking about the advantages. Of using SQL over NoSQL in terms of it offers, joins, aggregations, window functions, common table expressions and roll ups, and how schemas are actually a good thing as opposed to having no schemas and no structure at all. Also talks about their reliability. So it's basically advocating using SQL databases for your time series data. [00:18:48] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 359

Trillions of Rows & Metrics | Scaling Postgres 359

Episode 56

Practical Indexing, Vertical Scaling, Partition Migration, FDW Performance | Scaling Postgres 56

Episode 268

Faster Copy, psql Variables, backup_label, Bad Encoding | Scaling Postgres 268

Query Planner, BRIN Indexes, Approximate Algorithms, Delayed Replication | Scaling Postgres 53