Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about the rise of postgres, row count estimates, nulls, and Linux configuration. I'm Kristen Jameson, and this is scaling postgres episode 44.
[00:00:15] You.
[00:00:20] Well, I hope everyone had a great holiday. I know we still have New Year's coming up, so I hope your New Year's is a good one. Again, we're still kind of light in the content department. So there's just four posts that I've selected this week. The first is PostgreSQL on The Rise 2018 Postgres findings and 2019 trends. This is from the Several nines.com blog and it talks about the rise of postgres and it goes over some different resources that it kind of puts together. So this is an interesting post to look over, like they're looking at a DB Engines survey rank and it shows that PostgreSQL is number four in terms of most used database and number two for open source database. But what's super interesting is how much score has increased compared to all the other database systems that they track here. And it also mentioned that DB Engines declared PostgreSQL as the database management system of the year for 2017. So we'll have to see what 2018 brings. Even the Stack Overflow survey, it's in third position with reference to top used database technologies. They talked about an article Why Oracle's Missteps have led to PostgreSQL's moment along with another one. Has time finally come for PostgreSQL? And then they talk about a little bit of their best blogs. So if you're wanting to look at kind of where PostgreSQL is trending, this is a blog post to check out. The next article is row count estimates in postgres. And this is from Median.com, the PG Mustard area. And it basically just talks about row count estimates and why they're important. And primarily why they're important is for the query planner to make good decisions on what plan to use to execute your queries. And basically it keeps track of these by analyzing statistics about the data in each of your tables on a periodic basis, usually along with vacuum, although you can run an analyze manually, or you could do a vacuum analyze manually. And they talk about a tool they have called PG Mustard that helps you diagnose and fix performance as issues with your PostgreSQL database. And they talk about they flag it when there's a factor of ten difference between what the planner thinks, how many rows are there based upon the statistics, and then how many actual rows were returned for a particular query and then they go in how not to fix your plan. Basically you don't want to try to do tricks to kind of choose one index over the other or turn out certain features. Basically what you should probably do is help the planner out so increase the frequency that you're collecting those statistics so that they're more up to date, or maybe you have some jobs that's drastically changing the data that's in tables, maybe you could alter that. If not, you could either run analyzer vacuum manually more frequently or you can adjust the default statistics target in the configuration or on a per table basis by setting the statistics target on a per table basis. And then lastly they follow up with multivariate statistics. So these are statistics that you can set up. They are not collected manually, but if you are aware that there is some relationship between two pieces of data, like certain cities are only in certain countries or certain cities, or most of the time are only in certain states or states or in certain countries, you can define that relationship in statistics starting with PostgreSQL ten. So those are some tricks and tips to hopefully get better plans for your queries. The next post is a PostgreSQL story about null is null equals null and not null. So basically this is a post about nulls. So if you're not familiar with what they are exactly and how to use them, it's a good post to go over. But just to summarize it, nulls are undefined. So if you do select ten equals null, it's not going to tell you false, it has no idea what null is, it's undefined, it could be ten, it could not be ten. So it just returns null itself.
[00:04:34] Even if you say is something undefined equals something undefined, well, maybe not, we don't know it's undefined. And if you want to actually get an actual true false value of it, you have to do select null is null. Or you could do the check is not null. And in this initial example, where do you select ten equals null? You could do select ten is null and it will actually give you a false return value. And you can also do this for a row of data. So the row function lets you do a row of data. Now, I don't necessarily know how useful this would be, but they do have an example where you have a row with all nulls. You can do a test to say it is null true, or if any of the values are not null, it's going to give you a false value. They talked about null handling and limit clauses. So if you do a limit using null it's basically just going to return all values, it's not going to enforce the limit. And then they talk about order bys. So in order bys you can specify whether the nulls are going to be first or the nulls are going to be last. So just keep that in mind if you're ordering data where the nulls are going to be at. And then they talk about aggregate functions where generally they ignore null values, except in the case of count. So if you're looking to learn about how nulls work inside PostgreSQL, this is definitely a blog post to check out.
[00:05:55] The last post is benchmark PostgreSQL with Linux. Huge pages so here this is a benchmarking post which I always kind of find interesting the best ways to configure your system. But of course they mentioned here you definitely want to do it in your own environment and for your own application, make changes and then test and ensure that those changes give you performance improvement versus a decline. So they show their benchmark machine and I should say this is all about Linux kernel changes so not so much configuration of PostgreSQL they do list their configuration that they started with but their focus was adjusting the Linux configuration parameters. The first thing that they do is they do disable transparent huge pages and their belief is that disabling this feature and defaulting to classic huge pages is always recommended. And so here they are looking at different huge page configurations and they show the benchmarked results here with huge pages set at 4 megabytes and 1GB with different clients and sizes of databases. And generally for this layout they tended to get the best transactions per second based upon the highest huge pages setting. But again, there's a little bit of caveat that the highest database size they went up to was, I believe, 96gb.
[00:07:16] And I regularly work with databases in the terabytes range, so I'm not so how much sure how that would fit with a terabyte range database, because their conclusion here, deciding on the size of the huge page to use, requires a bit of trial and error. But this can potentially lead to significant TPS gains where the database size is large but remains small enough to fit in the shared buffer. Well, if you're at a multi terabyte stage, I don't think that's going to be fitting in a shared buffer of a database system. So just something to keep in mind if you look over these benchmarks, that does it. For this episode of Scaling Postgres you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode or you could subscribe via YouTube or itunes. Thanks.