Postgres 12 Released, Generated Columns, Parallel Estimates, Change Data Capture

[00:00:00] In this episode of Scaling Postgres, we talk about postgres twelve being released, generated columns, parallel estimates and change data capture. I'm Kristen Jameson and this is scaling postgres episode 84. [00:00:22] Alright, I hope you're having a great week. Our first piece of content is PostgreSQL twelve is released. This is from Postgresql.org website and it was released on October 3. And some of the main areas they talk about in terms of what's been updated is overall performance improvements. So the first is a significant performance and maintenance enhancements to the indexing system and partitioning. So indexing can get up to a 40% space reduction in gain in query performance and it handles workloads better where the indexes are frequently modified. There's also partitioning improvements where you have thousands of partitions and also improves insert and copy performance as well into partitions. They reduce the write ahead log overhead when doing Gist gen and SP Gist index types. You can create covering indexes. The include clause on Gist indexes create statistics now supports most common value statistics just in time compilation happens by default. Also enhancements to the SQL JSON path common table expressions are inlined by default. But again, you can change it whether you want it materialized or not. And then generated columns which we'll discuss in a post a little bit later. [00:01:44] And then some other things interesting mentioned in terms of administration is now you can reindex concurrently, which is huge, and also PG checksums. Now there's a few other postgres that we're going to talk about that discusses postgres twelve. The next one is eight major improvements in PostgreSQL twelve. And this is from Enterprisedb.com. First thing I talked about is the partitioning performance. Again, when you have a lot of partitions that was improved. We saw a post on that in a previous episode of Scaling Postgres as well as in certain copy statements, performance improvements. Again going back to the Btree index enhancements, doing spatial utilization and some of the other ones I just mentioned, talking about the multicolumn most common value statistics so that you can define statistics across columns, the common table expressions or CTEs having the width query being end aligned, and you can define whether you want it to materialize it or not. They mentioned prepared plan control and they mentioned that previous versions of PostgreSQL would use a custom plan five times and on the six generate a generic plan and uses as if it's as good as one of the custom ones. But that behavior can now be controlled through a variable called plan cache mode and this allows you to use a generic plan right away. And they say that this should benefit users who know their parameters are constant and know the generic plan will work. They mentioned the just in time compilation, the checksum control, and that now you can change a cluster from no checksums to checksums without having to dump and reload the data. [00:03:24] You must be offline to do this change. So it sounds like it does require a stop and a start, but again being able to turn them on or off, that would be beneficial. And they mentioned the reindex concurrently. So again, a lot of great changes in version twelve. [00:03:41] The next post again related to the new version twelve feature is generated columns, and this is from PG IO. And the first thing you talk about, what are generated columns? Well they just created a table here with three columns, looks to be a width, a height, and then you can calculate the area, but that's generated always as a width times height and it's going to be stored. So this is stored in the table and whenever the width or height are adjusted, this will automatically update. So you can see here you're doing an update, you're going to set the width to 40 and now that area will automatically be regenerated again. And they mentioned a few points you should know about generated columns. One is persistence. So currently the value of the generated columns have to be persisted and cannot be computed on the flight query time and the stored keyword must be present in the column definition. Two, the expression used to compute the value has to be immutable, so it can depend on other columns but not other generated columns. They can be used in indexes, but the values of generated columns are omitted from the output of a PG dump in the copy table commands. Now they have an interesting example here where they actually use it to make a column to support a full text search. So they have scenes and some information here and they have a body column with the full text of this scene from a play and they generated a column as a TS vector and converted the body to a TS vector. So now that you can easily do full text searches, so that could be a really great use case for generated columns. So if that's of interest to you, definitely a blog post to check out. The next post is freshly baked PostgreSQL twelve and changes are brought to PG Center. So I believe this is a product for monitoring postgres. I'm not that familiar with it, but they mentioned some of the changes that have been made. The first two they mentioned are more progress indicators. So when you're doing an index creation, you can look at PG Stat Progress to check the status of the creation of this index. And when you're doing a vacuumful or a cluster you can use PGSTAT progress cluster. It also mentions that on PGSTAT database there's two new columns that have been added, checksum Failures and Checksum Last Failures, so that you can check on those checksums that you can enable now once you start and stop your database. So again more information regarding changes to the system views to be able to do better monitoring. So if you're interested in that, check out this blog post and this is from Dataegret.com. [00:06:18] The next post is how PostgreSQL estimates parallel queries. And this is from Cybertechn postgresql.com. And they're talking about parallel queries which were introduced in 9.6. And this particular post talks about just parallel sequential queries. So they set up a simple table here with two columns, one with I believe, 10 million different values, and then the other column only contains two different types of values. Now, first they just want to see what non parallel looks like. So they set the max parallel workers per gather to zero and then do the query. So they get a sequential scan and then they want to figure out how it calculates different estimates here. So for example, this is an explained, it didn't run the query. So you're seeing this estimate here and he asks, well, how did it come to that? So he looked at the PG relation size to get the number of blocks. And then he used this calculation taking the current setting of the sequential page cost multiplied by the number of blocks and then the CPU Tuple cost multiplied by, I believe, the number of rows and the CPU operator cost by the number of rows. And you get the exact cost estimate that they come up here for a non parallel scan. Then he did the same thing where he looked at parallel Sequential scan and he shows the formula that is used here to estimate how long it will take to do this particular parallel scan, particularly looking at the number of rows that it estimated here. And they take the number of estimated rows divided by the number of cores plus one minus the leader contribution. That's the leader that's aggregating the results from each of the gather times, the number of cores. So this is a pretty interesting look at how PostgreSQL actually comes up with its estimates, which I was not too familiar with, but if you're interested in learning more about that, definitely a blog post to check out the next post. Fixing Track Activity Query Size in Postgresql.com so they're talking about an issue where in some of the system views, if you have very long queries and some of these system views tell you what queries running for a particular activity, like a PG Stat Activity or PG Stat statement, sometimes those queries get cut off. And how you can prevent that is to modify the Track Activity query size. But unfortunately there are some costs to that. So you probably don't want to increase it by too much and it actually does require restart of the server. But again, if you have a big system, he says here, increasing it is generally a good idea because more information is generally better. So if you've run into an issue with some of the queries you're looking at in system tables being truncated, definitely check out this post and this setting track Activity Query size. And this is from Cybertechn postgresql.com. [00:09:06] The next post is PostgreSQL change data capture with debesium. So this is a post that talks about using Change Data capture. So it's basically using the wall stream and specifically probably logical decoding in order to extract data as it's being produced. So you don't have to rely on triggers or going through logs. Debesium can actually be set up to use logical decoding to read essentially the wall stream to be able to track what data has changed. Now the use case they're talking about here is that maybe you have a couple of relational databases. You can use the Didbesium to look at the wall stream and then send it to this JDBC connector to update an analytics database. So this is a very comprehensive post that goes through all the different commands to set up and get this up and running. So if you're interested in implementing a Change Data capture method or using the beesium, definitely blog post to check out. And this is from Crunchydata.com. [00:10:10] The next post is how to work with Postgres in Go. So if you use Go you would definitely I think find this of benefit to get the most out of working with PostgreSQL. This is mostly about the interaction of Go with postgres and general recommendations and he summarizes a lot of his conclusions at the bottom here. But as you can see it's a very comprehensive post but he gives a suggestion on what postgres driver you should use configure limits for your connection pool size collect connection pool metrics log what's happening in the driver. So a lot of recommendations on using Go with PostgreSQL. So if you're using that tool, definitely a blog post to check out to make sure you get the most performance out of it. [00:10:55] The next post is the Transparent Data Encryption in PostgreSQL and this is from a high Go website and it was recently announced they were looking to target this for PostgreSQL twelve, excuse me, 13. [00:11:09] And this individual says I've been working with a PostgreSQL community recently to develop transparent data encryption. So this goes through kind of the rationale and thinking about it and how to implement it. So if you're interested in that type of content, definitely a blog post to check out. [00:11:24] Another post by Haigo is a guide to create user defined extension modules to postgres. So again, if you want to do extension development, this is a blog post that walks through how you can go ahead and get started creating your own extension in C. [00:11:41] So if that's of interest to you, definitely a blog post to check out. And the final piece of content is trace Query Processing internals with Debugger. So this goes through the whole process of tracing a query through PostgreSQL. So this is again a very comprehensive blog post that goes through describing each steps that the query goes through through PostgreSQL. So if you are looking wanting to look at the internals and how PostgreSQL does its querying. This would be another piece of content to check out. [00:12:13] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks. This.

Show Notes

Episode Transcript

Other Episodes

Episode 236

Postgres 15 RC2, Shared Buffers, Secure Connections, PGSQL Phriday | Scaling Postgres 236

Episode 309

Will Postgres Pull A Redis? | Scaling Postgres 309

Episode 197

Explain Analyze Buffers, Healthier Postgres, 2021 Review, High Availability Aspirations | Scaling Postgres 197

Postgres 12 Released, Generated Columns, Parallel Estimates, Change Data Capture | Scaling Postgres 84