Scaling Out, Planner Estimation, Create Statistics, Stay Curious

[00:00:00] In this episode of Scaling Postgres, we talk about scaling out, plan our estimation, create statistics and stay curious. I'm Kristen Jamison, and this is scaling postgres episode 103 one. [00:00:21] All right, I hope you're having a great week. Our first piece of content is preparing your postgres data for scale out. And this is from Dev Two, and they're discussing different preparations you would need to do if you're wanting to potentially scale out your database. Now, they talk about two different types of scale scaling up, which is just basically putting your database on a larger instance, give it more memory, CPU, disk, et cetera. Scale out is when you add one or more instances and try to scale out across multiple databases. And they have a little table that talks about the ease of doing both the effectiveness, unique benefits and Gotchas. Basically the easiest is definitely this scale up. It's relatively quick to do. The Gotcha is a single point of failure, whereas scaling out requires some more planning upfront. But it does have some interesting benefits in terms of maintaining the database being online. Now, one of the first areas they cover is do you want to do read only or read write? Because read only is just basically setting up some read replicas, which are relatively easy to do. And we have some tutorials at Scaling Postgres that discuss how to do that. Read writing is harder because then you're going to actually have to shard your data across different databases. Now, going into sharding, they talk about different patterns you can use. You could do range based partitioning and they use the example here. So you have a set of last names and you want to put certain last names into separate database systems. Now, it's a derived example, but it still demonstrates what range based partitioning means. But of course, the disadvantage of that, you can get hotspots with it. Like they mentioned, there's tons of people with the last name Smith in the United States. [00:02:04] Next type of partitioning you can consider is vertical partitioning. So for example, this is basically a service based approach. So maybe you have some microservices and each of those have the separate database. Well, with that your service can be scaled up by and each individual database that is responsible for part of that service doesn't have to be as large. But as you scale, you may eventually get to the point where the databases of one of those services is too large and you need to scale it out. Then they go into hash based partitioning. Basically you have some sort of value and you create a hash of it and you assign data to that particular partition. Now, I think a lot of solutions like Citus may be using something similar to this or some sort of partitioning key in order to do sharding. So I would say this is probably one of the more popular ones. And then lastly, he talks about directory based partitioning. Basically you have some service that sits between your app and the database and dictates where it goes. So this is probably more complex because you have to have the separate service that interprets where the data should go. And then of course there's difficulty when you actually do decide to shard because eventually you may want to bring that data back together and then how do you do cross shard, joins and things of that nature. But this blog post is a pretty good overview of the things you need to consider if you eventually get to the point where you actually want to scale out across multiple database instances. So if you're interested in that definitely blog post to check out the next piece of content is Planner Selectivity Estimation Error statistics with PG Qualstats Two. So quite a mouthful. This is from Rjuju's blog at rjuju GitHub IO and he's talking about a Qualstat, specifically qualstats Two which is an extension. And I'll have a link here that keeps statistics on the predicates found in where statements and join clauses. So generally, like for example, I believe PG Stats statements does not include what are the exact values being used, whereas this is something that Qualstats tries to track. Now they had a previous post that they discuss here where it's talking about being able to do index prediction on what indexes would be good to add. Whereas this particular feature talks about being able to look at relationships between columns within a table in terms of estimating if they need additional statistics. So by default statistics are tracked per column and they don't track relationships between columns. So for example, it doesn't track there's so many zip codes in a state or any kind of relationship like that. He developed an example where the planner, even after vacuum analyzed the planner believed it would come up with 12,500 rows but it actually only came up with zero rows, so there were no rows that matched this query. So this is an example of the planners using its estimates based upon looking at the statistical coverage of each column individually and trying to make a conclusion about both of them because the where statement is taking both into consideration and there is a relationship between these but the planner doesn't know that. So it's making an estimate of say, half of the rows are this and half of the rows, looking at them separately to get some kind of an estimate, but it's very inaccurate because there's actually zero rows. And he goes over the source code and how this comes to this conclusion and looking at the PG Stats view to get that information. And then it gets worse when you try to do a join. So in this example here it thinks it's going to find 313,000,000 rows where there's actually zero rows. So it can get worse. Now, what PG Qualstats, at least the version two can do, it can help detect this problem. So here's an example of a query you can run against this PG qualstats table. It's part of the extension and it looks at when looking at this value in isolation. So looking at this column, sorry, in isolation, the mean ratio means how relatively accurate the statistics are is close to one. So just looking at one column it gives a good estimate. But looking at one column in relation to another meaning anded with another, as he references here, or Val two column, the mean ratio is huge, it's very far from one. So what that means is that these would be good candidates for creating statistics for so you can define the relationships between these when they're looked at in combination. And he covers discussion of this talking about extended statistics where you can actually use create statistics to create these two values so it understands the relationship. Now, when you do a query, you can see expected one row, but I got zero rows so much more accurate, it's not say, hundreds of millions off. Now, with that, he also talks about some slides that Thomas Vondras put together for a talk on the subject with regard to create statistics. Now actually he did a webinar here that was published this week called Webinar all you need to know about Create Statistics follow up. So this is an hour long webinar about create statistics. So you can just click the link here and register for it and you can get immediate access to looking at it. So if you want more information about create statistics and how it can help you educate the planner on relationships between columns, definitely a webinar to check out. [00:07:35] The next piece of content is actually a YouTube video and it's PostgreSQL at low level. Stay curious. And this is from the all systems go YouTube channel. And what he covers here is not necessarily so much PostgreSQL, but PostgreSQL on how it runs on an operating system in containers, all sorts of different tools to look at to see what kind of behavior you're getting for your Postgres instance at a low level where Postgres runs is it running in a Kubernetes cluster, a VM? And he talks about all the different tools to analyze it from Perf to S trace. So if you want to learn more about the actual environment you're going to be running PostgreSQL in, definitely a YouTube video to check out. [00:08:20] The next piece of content is researching PostGIS Slowness 2019 edition. And this is from secondquadrant.com. He's talking about a customer they were working with where they had a particular query that was 60 times faster when there was only a difference of zero zero one between two different values in terms of calculating a point. But again, it was 60 times faster without that present. And they tried to replicate the environment but they used the exact version of PostgreSQL postgas Lib project and they couldn't see any differences. Then they tried Ubuntu 18 Four and finally they saw a difference and what it led to is actually a difference in Glibc versions. So once they upgraded to the most recent Glib C versions, both queries became fast. So this is just something to keep in mind. That PostgreSQL is great, but it also relies on other libraries, as does PostGIS and other things in your environment. So be sure to keep aware and keep up on patches and upgrades for all the different libraries that you're using in your system to make sure you're getting optimum performance. So if you're interested in this story, go ahead and check out this blog post. [00:09:34] The next piece of content is optimations in group by and select distinct. This is from that guy from Delhi.com and he's talking about differences in performance when doing select distinct, a certain number of columns from a table versus doing a group by. And sometimes group by can give you better performance and it has some specific optimizations that if there is a column present, it only uses that column for the group key. But he also made an interesting observation that a unique index wouldn't do it, but a nonunique one would and he looked over different variations of choosing different columns, different keys to see what the differences were. So it's an interesting blog post to cover if you're interested in learning about specific optimizations you could be using with group by or select distinct. [00:10:25] Next piece of content is creating a PostgreSQL procedural language. Part Three executing user code. So this is the part three that we've covered in previous episodes. This is using Julia language and this is the next phase where they're actually going to be executing user code. So if you've been following along, the next blog post is available. [00:10:46] The next piece of content is actually a new website that seems to have been put up recently. It's called PostgreSQL Life and it looks like its intent is to interview a PostgreSQL person of the week and as of the 27 February they've posted the first interview. So, quite interesting, and maybe check this on a weekly basis to see all the different people working in postgres. [00:11:14] The next piece of content is Parallel Vacuum and upcoming PostgreSQL 13. So this talks about the parallel vacuum coming in 13 that we've discussed before. They talk about how you can enable it, the different settings that will be available, and looked a little bit at its behavior. It's a little early, so they had some unexpected results so far, but interestingly because you have to give it a parallel command when you run vacuum. I wonder if auto vacuum cannot do in parallel yet and you have to do it as a manual vacuum. I didn't see any reference to that, but it's just a question I'll have to keep track of as version 13 gets closer to release later in the fall. And this is from the highGo CA website. [00:11:56] The next piece of content is on Recursive Queries. This is from Haber.com, and it looks like it's been translated from Russian at postgrespro Ru. And it's all about recursive CTEs. So with recursive ads So if you're interested in this content, definitely a blog post to check out. [00:12:15] The next series of posts are all about PostGIS. The first one is PostGIS Day in STL, meaning St. Louis, and there's a number of presentations talking about PostGIS that have been presented here. This is from the clever Elephant CA blog. [00:12:32] Next piece of content is visualizing OCM data in QGIS. And this is from Cybertechyphenposgresql.com. So if you have an interest in doing this mapping, definitely a blog post to check out. [00:12:46] The next post is setting up SSL certificate authentication with PG pool two. And this is from the higo CA. And with the version four of PG pool, they added some additional, as they say, securing authentication features. One of them talking about SSLs. So they talk about how you can generate an SSL certificate and start using it with PG pool two. [00:13:12] And the last piece of content is how to use the SVM machine learning model with two UDA PostgreSQL and orange. Part two. So again, this is about machine learning. Again, I don't have a background in this, but part two of this blog post is available from secondquadrant.com. [00:13:33] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notification of each episode. Or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 346

Things I Don't Do In Postgres | Scaling Postgres 346

Episode 166

Postgres 14 Beta 1, PG 14 Improvements, Best Primary Keys, Composite Keys | Scaling Postgres 166

Episode 368

Postgres 18: The Deep Cuts | Scaling Postgres 386

Scaling Out, Planner Estimation, Create Statistics, Stay Curious | Scaling Postgres 103