Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about prepared plans, materialization recursive, CTEs, and safe migrations. My name is Kristen Jameson and this is Scaling postgres episode 68.
[00:00:21] All right, I hope you're having a great week. Our first piece of content is tech preview how PostgreSQL Twelve handles prepared plans. This is from Cybertechgresql.com, and they're talking about prepared plans where you can basically prepare a plan ahead of time and then execute that plan. But some of the behavior is different or different what people expect prior to twelve, and then they're doing some adjustments with twelve. So this post examines that. So basically he creates a table here with just a serial column as the ID and a name as the text. He inserts about a million names called Hans and only two names called Paul. So basically, the index will be very efficient finding this. But Postgres will use a sequential scan for this. So he looks at that here, he goes ahead and creates an index on the name. And when he looks for Hans, you'll see it does a sequential scan, and even in parallel to be able to pull out the data. So it does not use the index even though it exists, because it's more efficient to just scan the whole table to find all the instances of the name Hans. Whereas when he does Paul, it does use the index. And this is the optimizer checking the statistics to ensure that there's not so many people named with Paul or Hans to be able to determine whether to use the index or not. And then here he looks at the PG Stats table and shows that part of these statistics, it says some of the most common values are Hans. So there's a high probability it will not use an index when trying to search for this name. It'll just do a sequential scan. Now, in terms of preparing plans, you can prepare a plan and then you can execute that plan, passing in the value you want to test. So we see here that even though it executes the plan Paul, when you do an explain plan, it actually is going to do an index scan. And even when using Hans, it actually chooses a different plan. It still does the sequential scan. And he says, quote, even if we prepare explicitly, we will still get a fresh plan before a generic plan is created and a generic plan is made, assuming some constant parameters. So this is kind of a judgment call. And he points out here is that you could create a new plan for every execution that considers the current parameter value. That will always lead to the best plan.
[00:02:48] But then the purpose of preparing that statement to avoid plan calculation kind of goes out the window because you have to calculate the plan for every time. Or you could use a generic plan that does not take the parameters into account, but then you may not come up with the best plan and it says by default PostgreSQL chooses a middle road. It will generate a custom plan during the first five executions of the prepared statement that takes the parameter values into account and from the six on it checks if the generic plan would have performed as well. And basically it will use that generic plan from that point on. But PostgreSQL twelve introduces a new variable that allows you to control the behavior more explicitly. So for example, you can do a set plan cache mode and there's a couple of different ways you can set it. So for example, if you set it to force the generic plan you will see you'll always get a parallel sequential scan for this. So again, that's not the best plan for Paul, but you're forcing what its generic plan is and you can also set it to force a custom plan or auto which is the default value. So definitely a new parameter to be aware of if you want to adjust the default behavior in PostgreSQL twelve once it comes out.
[00:04:08] The next post is Tech preview improving copy and bulk loading in PostgreSQL twelve. This is from CyberTech Hyphen postgresql.com as well and this uses the copy function where you can copy in data from a file into a table and before you had to just do the whole file. But with version twelve they're offering a where statement. So you can actually use a where clause from one of the columns that you're importing to basically define which set of data you're going to do. So for example, this had a file that had 1000 rows in it, one to 1000. He says give me all the rows where the row I'm loading that's x is less than five so it only loads four rows. So as opposed to having to load everything and then selecting from that table to another area, only the data you want. Now you can use a where clause in order to fine tune exactly how much data you want to upload into PostgreSQL twelve. So definitely a great feature to be added.
[00:05:09] The next post is rethinking the database materialized view as an index and this is from blog Timescale.com and this is TimescaleDB which is an extension for using time series data with PostgreSQL. So they're comparing and contrasting indexes versus materialized views and basically indexes are updated in real time so that you can use them to track what's going on in the database. Materialized views actually are refreshed on a periodic basis and you usually do this when you're having to do aggregations of a lot of data because if one new data point is added it's easy to add a reference to the index but it's harder to then recalculate a whole sum or a whole average. There's techniques you can use to do it, but with a materialized view you generally refresh the whole thing and they're taking an approach to treat a materialized view more like an index, because as they say here, time series data is different. So you don't have writes all across the area. It's basically continuously being appended to say a table. And with that you could have all of your materialized data here and at a certain threshold it knows that all of this is materialized and then you have the non materialized. It's not a part of the materialization yet, but you can add that in on the fly as you need because you know what this threshold point is. Now we've seen some other references to cheating this. Like there have been few blog postgres in older editions of Scaling Postgres where actually citus Data had come up with a couple of different techniques to do essentially the same thing, where you're querying a summarization table or some sort of a materialized or a materialization of some sort, and then adding in the fresh data from a given point to give you essentially real time analytics. And they are talking about the same thing now, but it looks like it's something that's more they're looking to build in as an actual feature in Timescale DB.
[00:07:12] So it's a pretty interesting thought experiment. And if you're interested in this type of thing, definitely a blog post to check out.
[00:07:18] The next post is checking the sequences status on a single pass. And this is from Luca Ferrari at fluca 1978 GitHub IO. And here he's talking about a relatively simple function he created called Seq underscore check that basically can tell him how long it's going to take before he's going to run out of IDs for a given sequence. So it looks like he's using integers for the sequence, I believe. So here's his function and it allows you to tell when your sequences are going to run out. So definitely a potentially useful tool.
[00:07:53] The next post is FizzBuzz in both plpgsql and SQL. So FizzBuzz is this programmer test and he kind of explicitly what it is. Basically print out numbers from one to 100 and at multiples of three print Fizz, multiples of five print buzz and then multiples of three and five print Fizz buzz. So he did an implementation using plpgsql. So it was pretty straightforward programmatically, but then he did a recursive CTE. So this is pure SQL just doing it in a recursive CTE. So I found this particularly interesting. So if you're wanting to learn more about recursive CTEs and potentially how they could be used, maybe this is a blog post you may want to check out.
[00:08:39] The next post is a recursive CTE to get information about partitions. And again, this is from Luca Ferrari's blog. And basically he uses partition tables in his database and he wanted to know, say which partition is growing more, growing less, and each partition relative to each other. So basically we have a hierarchical relationship. You have parents and you have the child partitions so he's using a recursive CTE to get and collect this information so it could output information such as this. So it reports for each partition how many say tuples there are pages, whether something's partitioned or not. So it gives him a way to get a status of his partitions. And again, this is another example of using recursive CTEs that are provided through SQL. So again, if you want to keep more up to date on how to do recursive CTEs, definitely a blog post to check out.
[00:09:37] The next blog post is actually a YouTube channel. Now I viewed several YouTube channels from a Rails comp that recently happened and some of them talked about databases and specifically PostgreSQL. So this one is Rails comp 2019 Railsdb migrate Safely by Matt Ducinik So he basically talks about how to migrate your database safely. The first one here is don't add columns with a default value. Now, version eleven of PostgreSQL makes this almost safe. I mean, it'll do static values, but if you're using a version prior to eleven, you definitely need to take into this account. You need to add the new column without a default. Go ahead and backfill data and then add that constraint. In the next consideration is don't backfill data inside of a transaction, basically because you can lock access to it. If you're going to do it, you need to narrow down the scope of the rows that you're going to impact with that backfilling of data. Maybe 100 rows at a time, a thousand rows at a time. So something to be aware of. He says, don't make schema and data changes in the same migration. Basically, the more piecemeal you do it, the safer it generally is. Do add postgres indexes concurrently and also drop them concurrently, so you also want to be sure to do that. And they also had some downtime where they did a drop and a build in the same one. He didn't mention this, but you definitely want to add indexes first. You could have more than one index on certain columns, so it's always best to add an index and then drop concurrently the one you're not using. And of course, always important is to monitor and test your database performance. So these were the main takeaways, but definitely a good presentation that talks about considerations and again, some real life experience with things that actually impact production. So if you want to get more insight into that, definitely a presentation to check out.
[00:11:27] The next presentation is from Railscomp 2019 is when it all goes wrong with postgres by Will Lean Weber, and he's from Citus Data, and this presentation talks about something has gone wrong. Now from his experience, generally 95% of the time there's something in the application that caused something. Maybe 4% of the time it's auto vacuum related, and then 1% of the time it's something else. So maybe there is something wrong with the database or something of that nature. So vast majority of the time it's usually something the application is doing that's causing the problems in PostgreSQL that you're seeing. Now he had a very interesting section here where he was saying, let's look at what your database is doing in terms of CPU, memory, disk and parallelism, how many connections are connecting to it and what's active. And he basically highlighted each of these in various different combinations to say probably what's going on. Like if you had high disk and parallelism it could be this, or if you have just a high CPU, it could be this. And here he's saying when your memory is high, well, maybe you have some large group buys that are going on or you have some high disk latency due an unusual page dispersion pattern in the workload. So he goes over several of these scenarios and potentially what it could lead to, or what the reason could be if you're seeing these metrics high in given set of combinations. He also went over various tools that you could use to better monitor your PostgreSQL installation from an OS perspective, things like Perf and iostat and pgrep and Htop. So if you're interested in that, definitely a blog post to check out.
[00:13:07] The last presentation from Railscomf is Postgres and Rail six Multidb Pitfalls Patterns performance by Gabe Insulin. Now, he does talk about the multidb a little bit and he says generally what you're probably going to be using this for at first is like replicas. So split your read, write to say, one primary and your reads to replicas. But then at some point you may want to consider sharding. So you could actually take whole tables and put them on their own database. But of course if you do that you need to now think about how are you going to do joins, are you going to do joins in the application. So that's going to start getting to be really difficult. And at that point you may want to consider doing something like cydus to actually shard your database across say, accounts or customers or something of that nature. So not a lot of it was with regard to multidb, but he also went in a whole discussion about different tools you can use such as explain analyze to find my queries are slow. If you're having slow replication, do a lot of data in the table, maybe you need to consider partitioning and even about the importance of connection pooling. So it is a general performance focused talk, not exclusively on multidb, but it's definitely a major part of it. So definitely an interesting presentation to check out.
[00:14:32] The next post is indexes in PostgreSQL 1210, Bloom, and this is from Haber.com. And again, this looks like it was posted again and translated from postgrespro Ru. And they're talking about Bloom indexes in PostgreSQL. Now I'm not that familiar with Bloom, but it looks like it's a highly compact way to do membership checks of an element in a set, but it does allow false positives, so it has some benefits and negatives. So if you're potentially interested in Bloom indexes, definitely a blog post to check out.
[00:15:10] The last post is what's new in Crunchy PostgreSQL operator 40, and this is from Crunchydata.com, and they've released version 4.0 of their operator, which basically, as they say here quote, extends Kubernetes to give you the power to easily create, configure and manage PostgreSQL clusters at scale. So basically this is using Docker, so if you want to use PostgreSQL Kubernetes, definitely a blog post to check out.
[00:15:38] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show Notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notification of each episode, or you could subscribe via YouTube or itunes. Thanks.