CTE Warning, PG 11 Features, Death by DB, Correlation

[00:00:00] In this episode of Scaling Postgres we talk about a CTE warning. Postgres Eleven features death by database and column correlation. I'm creston. Jameson and this is scaling postgres episode 31. [00:00:22] Alright, I hope you're having a great week. Our first article is Be careful with CTE in PostgreSQL and this is from Hakai Benita. Now a CTE is a common table expression which you could also think of as the with clause. So it's basically taking a subquery and putting it earlier in the statement and being able to call it again. So he shows in a simple example of one here with CT as select all from Foo. So CTE is what he's named it and then he can then do select all from CTE where an ID equals something. Now, what he's saying you need to be careful of is a simplistic example here where he has a foo table and he's looking up by an ID that has an index on it and that lookup took less than one millisecond. Whereas using a CTE, of course you're defining all the records in foo so that is going to take a while. But then you do a select all from CTE where the ID equals the same and it suddenly takes 227 milliseconds. So basically what's happening here is the CTE kind of materializes, what you're asking for basically caches it and then can use that query again. And it's not, for example, pushing down this ID into the Cde here. It's essentially materializing. This part that you've done and he references some things from the docs where it says, quote a useful property of with queries is that they are evaluated only once per execution of a parent query. So basically you could use that with query multiple times and it's not going to reexecute it. But again the quote says however, the other side of this coin is that the optimizer is less able to push restrictions from the parent query down into a with query with an ordinary sub query and again with the explained plan. You can see where just the simple query looking for a particular where ID is using a index scan. Using the CTE example, it actually has to do a sequential scan. Now, unlike a CTE just using a basic sub query. [00:02:35] So select all from and then do the sub query here select all from Foo where the ID equals that it is able to push that down. So you can see it does do a sequential scan in this case and you get a query response of less than a millisecond. Now he also mentions Oracle actually looks different in that it is able to essentially inline the CTE and give the same response time whether using it as a CTE or a subquery in quote Oracle is not materializing CTEs by default. [00:03:12] So basically this is just a decision that PostgreSQL has adopted that you just need to be aware of, that it does materialize its CTEs and because of that it can't push down predicates into your CTE from the parent query and if you're needing to do that, just use a subquery to do it. So it is a warning in terms of you should understand how PostgreSQL works and this was a great post that goes through it and I encourage you to check it out. [00:03:41] The next series of posts all talk about features coming in PostgreSQL eleven and our first post is PostgreSQL eleven Something for Everyone. And this is from the LWN net and it's an article contributed by Peter Shogein. My apologies on that pronunciation, but this article goes through a lot of the major features of PostgreSQL eleven and a little bit of the background in it. So of course some of the biggest features are the partitioning changes and we'll go into some additional articles and talk a little bit about this more in detail. But I thought this was probably the best article to read about some of the upcoming changes and the reasoning behind it and why it's happening now versus not earlier and things of that nature. But it goes over the partitioning improvements, parallelism improvements, just In Time compilation of queries. And again, based upon my understanding, this is something that you need to enable when you're actually building from source. So for example, using the Hyphen hyphen with Hyphen LLVM flag there's the introduction of procedures. So not just functions, but procedures where you can actually do commits and rollbacks within them, the introduction of covering indexes. So basically you can add additional columns to an index that doesn't necessarily index byte, but it includes additional columns in essentially the index payload so that you can increase the number of index only queries. And this is a great one here the instant add a column with a default value so not having to rewrite the whole table. So again, I definitely encourage you to read this if you're interested in a lot of the new features. [00:05:27] The next post highlights some additional features, but it's a little bit shorter of some things that were notable to him. And this is postgres eleven. A first look, this is from the Craig Kirstein's blog. Number one is quitting postgres. Typically you had to do a backslash Q, but now you can type Quit or exit. So this is great for new users to using postgres and then a headline fear column addition no more. Essentially this is where adding a column with a not null and a default essentially had to rewrite the whole table. He talks about performance improvements going back to the parallelism and he's highlighting the capabilities of doing a parallel hash join capabilities of doing parallel append. And I believe this comes in cases where you're doing a union and have multiple select queries. It can parallelize that as well as parallel index creation, which again that's a great improvement. He mentions the Just in Time compilation that's possible and he also covers creating statistics which wasn't mentioned in the previous post, but with ten you could start defining multiple columns that you can create statistics across. But with postgres eleven now you can create statistics based on expression indexes. And lastly, he mentions some improvements with PG underscore pre Warm. So again another post talking about the new features coming in postgres eleven. [00:06:56] Now my understanding is that postgres eleven is due to be released essentially any week now and they just released beta number four, so it's definitely going to be released relatively soon. [00:07:08] The next article is actually a PDF of slides that were given as a talk. Again, the title is Major Features Postgres eleven. And this is by Bruce momgian. And again he goes over again the partitioning improvements, the parallelism improvements, the introduction of store procedures, the executor stage compilation which basically is the just in time compiler, preventing table rewrite for alter table add column with non null default. Again, that's a huge one for large databases. He mentions finer grained access control, so there's some additional access control features that have been added you can check out in this presentation in addition to some write ahead log improvements. I believe this is where you can actually increase the size of the wall files that are being generated, which may be helpful for larger databases. Again, he references the allowing the quit and the exit when you're going to be exiting psql. So again this is about 24 slides worth of information. And again, if you're looking to learn and understand more of what's coming in postgres eleven, definitely another piece of content to check out. [00:08:16] The next post is why covering indexes are incredibly helpful. And this is from the Crunchy Enterprise PostgreSQL blog. And essentially this is talking again about a feature that was mentioned in the previous piece of content, but it's dedicated to covering indexes. This is where you're adding additional columns to an index. It's not indexing on that column but essentially it's an additional payload so that when you use the index that data can immediately be used. It doesn't have to go to the heap to pull additional data. So essentially this increases the amount of index only scans you can have without that data exactly having been indexed. So this goes into detail about that and gives a very practical example of using it. So if you're a developer and potentially interested in this feature, this is definitely a blog post to check out. To learn more. [00:09:08] The next post is Partitioning improvements in PostgreSQL eleven and this is from the second quadrant.com blog and again this just goes into more detail of the partitioning improvements that we've discussed. So just to run those through quickly, PostgreSQL eleven has introduced hash partitioning. So if you don't have a natural key to partition by, this allows you to just use a hash to do it. Update can now move rows between partitions if it needs to insert on conflict update now works across partitions or for partition tables. And you also have support for a default partition in a partition table. So if partition table is not necessarily targeted, data can be placed in a default partition. Better DDL support are basically easier to work with partition tables for DDL commands. So for example, you can use create index on a partition table whereas you used to have to do them individually. You can do create unique constraints as well as primary key constraints as well as a partition table can have foreign key constraints. And then he talks about the performance work in terms of pruning the partitions in version ten that only happened at one stage at query plan time. However, with eleven with some improvements it can now do this type of pruning at query plan time when the query parameters are received as well as at each point where one query node passes values as parameters to another node. So this should allow much more queries to target only the partitions that they're interested in. So definitely huge improvements in that area and they also mentioned partitionwise joins as well as partitionwise aggregates. So again, this is a post dedicated to those partition improvements. If you are using partitioning or considering, it definitely a blog post to check out for PostgreSQL eleven. [00:11:03] The next post is death byte database. And this is from the obid GitHub IO blog. And it's basically kind of like a parable and a story of working with someone who had a particular data model for their application and how that data model essentially constrained what they could do and all the workarounds that would have to be done to be able to make it work. They wanted it to, just because of how the database was modeled. So basically the post is emphasizing really thinking about your data model and doing proper database design. Now, related to this, there is a linked YouTube video, it's a presentation that's approximately 45 minutes in length called how to fake a database design by Curtis Poe. And I definitely suggest you checking this out if you're a developer and wanting to learn more about database design. I really like some of his key concepts to keep in mind when you're doing it. So this was a great piece of content to check out. [00:12:01] The last post is Correlation of PostgreSQL columns explained. So basically this is explaining that after you analyze a PostgreSQL table, the distribution statistics are located in the PG stat system view and there's a correlation column. So what is it correlating basically how well does the order you want in the index match the physical ordering of the data on the table? And he describes this as the logical ordering versus the physical ordering and a perfect correlation between that would be a one or a negative one depending upon the order and a zero means there is no correlation. So basically if you're using a sequence for primary ID. That primary key index that generally gets created should have a perfect correlation because all new data coming in has an increased index number that should be perfectly correlated with an ordered index. Whereas if you have random data being inserted into a value and you have an index on that, there's probably no correlation at all between the index and where the data is located in the heap. Now he mentions why should I care about this? Because when you have a good correlation, you're going to get faster response times for your queries. Generally because there's only one place in the heap it has to pull it out and doing a sequential retrieval of that data should be faster than having to do random I O all over the heap area. Now, he does an example here where he's creating two tables, one with a high correlation and a low correlation. So going back where you have an index and just a series of numbers, you see the correlation is one. Whereas when you are randomly ordering data in an additional table he's creating the correlation is essentially close to zero, which means no correlation. And you can see the different execution times when querying those tables. The correlated table returns in one millisecond and the uncorrelated returns in 2.7 milliseconds, so more than twice as slow. Now, are there some things you can do in order to increase your correlation if you're really interested in speed and he mentions some of this at the end of the post. If you need to efficiently scan bigger portions of an index, it's good to keep the table in index order. However, there are no index order tables in PostgreSQL. However, a high correlation column can be maintained in two ways. First, he says automatically where table rows are inserted in logical column order and there are no updates or deletes on the table. So the physical ordering should be identical to the logical ordering. So again, this is the example I mentioned where you have a primary key columns generated by sequences or perhaps with timestamps. And the second way he says is by a clustering. So you can use the cluster command to actually rewrite a whole table so that the physical ordering is identical to the logical ordering. But again, if you're having updates and deletes, it's going to alter the order of that. So if you're really looking to push the performance, this is definitely something you may want to potentially consider, particularly if you have a large analytical database or maybe a data warehouse and you have more flexibility in terms of ordering data in the table, this might be something to consider. [00:15:22] That does it. For this episode of scaling Postgres us. You can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode. Or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 169

Use Bigint, High Availability, Autocommit, Synchronized Sequential Scans | Scaling Postgres 169

Episode 245

ENUMs vs Check Constraints, Faceting With Roaring Bitmaps, Better Scaling, In DB Business Logic | Scaling Postgres 245

Episode 273

Debian / Ubuntu Packaging, ivfflat Indexes, In vs Any, View Logging | Scaling Postgres 273

CTE Warning, PG 11 Features, Death by DB, Correlation | Scaling Postgres 31