Bloat vs zHeap, Composite Indexes, Unsafe Operations

[00:00:00] In this episode of Scaling Postgres, we talk about Bloat versus Zheep composite indexes, unsafe operations, and better developers. I'm Kristen Jameson, and this is scaling postgres episode 39. [00:00:23] Alright, our first piece of content this week is dealing with significant postgres database bloat. What are your options? This is from the Labs contractually.com blog. Now, basically, they were having a lot of problems with bloat, I believe they were saying, due to a lot of different update statements. And first thing this article goes into is what is Bloat? Essentially, due to postgres's MVCC architecture and how they have it designed, whenever a row gets updated, actually a new row gets created and eventually the old row needs to go away. Well, if you were doing, say, updates so fast, those old rows, maybe vacuum isn't cleaning those up fast enough. That essentially you're getting table Bloat. So you have data that is no longer really valid, but it's still filling up your table. And he talks about, okay, what problems does Bloat cause? Well, essentially, it can cause poor performance. Now, their example here is they say consider the case where a table has 350,000,000 dead tuples but only 50 million active rows. So it's a Bloat score of a tool that they use of seven. And when the query planner tries to query this table, it gives horrible instructions that can lead to slow queries because again, it has an accurate count of how many rows are in a particular table. And in this example they're talking about, in this scenario, they had a query that was taking 14 seconds. But after they got rid of the bloat, the query took 37 milliseconds. So a reduction 99.7% in execution time. So Bloat can really cause some problems. Then they talk about how does database bloat typically get cleaned up? Well, that's vacuum's job. But of course, what if vacuum is not keeping up or you're updating so fast? Vacuum struggles to keep up. Well, one scenario is don't update as often. Maybe you should switch some those of updates to doing inserts from your application's perspective. And in the scenario they talked about here, in terms of what's causing the bloat, this may be an overzealous background job that's updating records far too frequently or lack of rate limiting. Basically something in your application you need to try to resolve, to try to avoid so many updates. In particular, if you're trying to update the same record, number one, you'll have tons of issues with locking, but you'll just create tons of bloated records in your table. Now of course, once you have a bloat problem, how do you deal with it? Well, one option is Vacuum full, but that essentially locks the entire table while it does that. So that's only useful if you shut off entire access to a table. And then there are two other tools you can use. One is Pgrapak and one is PG compactable. And in the scenario that they're discussing, here they used Pgrapak. They talk a little bit how it works, but unfortunately when they were using it, they were using logical replication to upload data to redshift and using that Pgrepact tool because their table was so large, they actually ran into some out of memory errors and some issues. So basically they had to drop the slot while they were doing this repack operation. So this was a great post that kind of goes into blow what it is, where does it come from, how to potentially deal with it. So if you suspect that some of your databases being impacted by this, this is definitely a blog post to check out. [00:03:49] The next post is actually also related to Bloat and it's about Zheep and this is a YouTube video. Now again, I had said that this is from the Pgcon YouTube channel and this particular presentation is Amit Capelia, robert Haas Zheep less Bloat, fewer writes and just plain smaller from Pgcon 2018. Now I mentioned the YouTube channel of this previously, but this is one presentation I think you should definitely check out because I think it represents probably the future of where storage is going for postgres. V. Heap is basically a reimagining of how to store data in postgres and handle it to avoid these bloat issues. So in other words, in the scenario of an update you don't create a new row and then eventually delete the old one. You do more in place updates. But to handle that they had to implement essentially a undo and a redo log to be able to roll back changes. So this is a project that they're working on. They don't know if it's going to potentially be in version twelve or version 13 or 14, I don't know. But it kind of represents the future. And since it's related to bloat, I wanted to mention here now this particular YouTube video is not great. I just used it for the audio and I actually found in this piece of content the presentation slides. So I use these in the audio from the YouTube channel because they never show the screen on the YouTube channel. But one thing I found super interesting and the whole presentation is interesting, so should definitely check it out. But here where they're talking. Heap is just the standard Heap as it exists now for eight clients. A particular load was placed on it and this is showing the size of a particular table as it grows. Now because they're not storing all of the transaction visibility in the new Zheep architecture. This is Zheep starting from the beginning. The table size with all the data in it is actually smaller and as they do transactions against it, updates deletes that essentially cause that modification. You could see the bloat increasing to this point here and it kind of stays high whereas with the Zheep it's consistent and then with a bring it up to 64 clients doing operations again, the Bloat grows higher in the existing postgres, whereas in Zheep it's consistent. So basically it's an opportunity to eliminate these Bloat issues and potentially not have to use vacuum anymore. So this would be super awesome. And they also made note again performance changes. This is something in development, so it's going to change. But the TPS transaction per second Z heap was 40% more than the regular heap and the above tests and they've seen an improvement of as much as 100% for other tests. So definitely something to keep an eye out and definitely a piece of content to check out. [00:06:42] The next post is how to use composite indexes to speed up time series queries. This is from the Timescale.com blog. Now this is talking about for time series queries, and again Timescale is a time series extension that you can add to postgres. But I found this post just relevant if you're working with any kind of time series data in terms of when typically you're going to want to look at certain ID columns as well as a time to look at that information over time. In this example they're using of course a Timestamp, a device ID, device type, store ID, and you're going to want to query those in different ways. And how you create your multicolumn index is very important based upon what questions you want to ask. So this goes over the different questions you potentially want to ask and how best to structure your multicolumn indexes to do that in terms of the ordering, because the ordering is very important. Now they do show some explained plans here, but I think some of these are related to the chunking that Timescale DB does. So this may be a little bit less relevant, but if you're wanting to make sure that your indexes are being used appropriately, getting the order right is super important. So if you suspect you need to brush up on that, definitely a blog post to check out the next post is safe and unsafe operations for high volume PostgreSQL. Now this is an old post from back in 2016, but it's recently been updated for the changes in PostgreSQL Eleven. So basically they say what different? So they say what different data type changes are you going to be wanting to do and whether they are safe to do when you have a lot of data, a lot of activity in your database. So for example, adding new column is safe. Adding come with a default in every version. Before Eleven it was unsafe, but now with PostgreSQL Eleven's new feature it is safe. So essentially it updates it. For this scenario, add a column with the default and add a column that is not nullable. Both of these are safe for PostgreSQL Eleven, but still unsafe for versions prior. So this is a good reminder in general about what operations are safe and unsafe for PostgreSQL. So if you need a refresher on that definitely a blog post to check out. [00:09:04] The next piece of content is being a better Developer with Explain by Louise Ranjon. So, this is an older video, but I actually watched all of the PG Day Paris that I think it actually took place in March, but they weren't posted up until June. But I watched all of them and I'm kind of prioritizing what I thought were the best presentations that were taking place there. So I would say this is my top one. And it basically covers utilizing Explain and utilizing PostgreSQL logging in your development environment to be able to track what your queries, what your orm is actually doing, so you can understand and optimize it better. So again, she covers going over how to enable the different settings in PostgreSQL to enable logging. If you just don't want to use application level query logging, she goes through using Explain and how to use it. And then she did something pretty interesting where she did a Python version of generating some data. So just pulling raw data from the database and building the query in Python and got it running in 1.5 seconds. And then she built a query that uses the features of postgres to be able to pull that out, thinking it would be faster, but it actually ran in 48 seconds, so over 40 times slower. And then she used Explain to understand, okay, where is the time being spent, what is causing this to be slow? Once she added the necessary indexes to speed it up, that query then completed in eight milliseconds. So dramatically faster than, of course, the one without indexes, but also dramatically faster than the Python version she did trying to do the merges and the joins in Python. So if you're a developer, I would say this is definitely a presentation you can check out. [00:10:54] The next piece of content is actually also from PG Day. It's change data capture for a brand new world. And again, this is from the PG Day YouTube channel. And this kind of goes into a little bit of a history of exporting data from a database and potentially loading it into a data warehouse or using it for different purposes. And they talk about kind of the first tools available are the ETL tools or nightly dumps of data, or potentially using the copy command in PostgreSQL for exporting data and importing it elsewhere. And then how that kind of evolved to using a trigger based mechanism. So you apply triggers that do change data capture to an audit table or a history table. But some of the disadvantages of those is that they can be brittle if you're going to be changing the schema of the source tables as well as essentially you're doubling your rights. Because every right that needs to go into table also needs to go into the audit tables. And then they said pretty much what most people use today is the logical decoding feature. So basically creating a logical replication slot and then reading it using a tool such as Wall to JSON which essentially reads the wall stream and pulls out the information that's needed. So they go over the different tools that allow you to do this and talk a little bit about Kafka when you really want to get sophisticated. So if you have a need to extract change data from your PostgreSQL database, I would say this is definitely a presentation to check out the different methods and tools that are available to do that. [00:12:26] The next piece of content from PG day Paris is Constraints a developer's secret weapon. By will Lean weber so he's talking about the importance of if you're an application developer, not just using your database as a kind of a dumb hash in the sky, but because for most applications, the database schema changes very little relative to application changes. Therefore, if you can put the appropriate constraints on your data tables, they can help you avoid problems in the application or introducing issues in the application that are causing data issues. So for example, he's talking about using not nulls where appropriate, potentially using unique indexes, using the appropriate data type. So not just using text and integers potentially for everything, but utilizing Booleans or UUID data types. Basically use the data type for the purpose intended to help you constrain the data. And then he also talks about you could even create particular constraints like make sure the price is greater than zero for example. So it's a relatively brief video, but if you feel like you need to add some more constraints to your database, definitely a piece of content to check out. [00:13:40] The next post is Triggers Friends to handle with care. And again, this is another video from PG Day Paris and this basically covers all about using triggers and how to set them up and some caveats with regard to them. So it's not necessarily have information that helps with scaling, but it's definitely an interesting post if you're interested in learning more about or want to use more triggers in your PostgreSQL database. [00:14:08] The next post is Preventing SQL injections, and this is from the Tapoeh.org blog and he talks about some capabilities that PostgreSQL has to avoid SQL injections and one way is basically to parameterize your queries. So he talks about using the EQ exec params. So basically you separate the query text from the parameters and if you use the PGx param protocol to do that, you're going to essentially avoid SQL injections. The other way to do it is using server side prepared statements. So basically you prepare the statement you want and have essentially the parameters in here that it can receive and then you execute passing in the parameter. Now the thing to keep in mind here that actually someone mentioned in the comments is that with Pgbouncer, so if you have scaled to the point you're using PG bouncer, you can't really use prepared statements. So with that at a windshield at a large scale, you basically have to rely on these types of parameterized queries. So if this is a particular concern for you, definitely a blog post to check out. [00:15:19] The next post is zero downtime upgrade PostgreSQL ten to eleven. And this is from the Evolvemonkey Blogspot.com blog and with the advent of Eleven coming out, people are planning their upgrade from version ten and this kind of goes through the different upgrade paths that are available. Number one is PG dump, but that for a large database would take an enormous amount of time. PG upgrades are usually the accepted practice and if you do the upgrade using hard links they can be completed very very quickly for 13 gigabyte database but still they can be completed very quickly. The other option because of logical replication that was introduced in version ten is using that method. Now this is kind of new and he has this caveats area and there's a particular process you can go through to do it, but it's kind of new, a little bit uncharted territory. So you definitely want to examine this method and check it out and maybe this blog post can help you make up your mind. So definitely piece of content to check out. [00:16:23] The Last Post is five cool things I learned at the PostgreSQL Conference in Europe 2018. And this is from the several nines.com blog and he talks about shared buffers and how to think about setting them from this presentation as well as some VM settings for your Linux distribution. You installed it to talked about advanced logical replication. So all the different enhancements that talk about using logical decoding, close to zero downtime upgrades referencing the previous post, discussion about discussion about what's a new in PostgreSQL eleven, all the new features and then Zheep in answer to PostgreSQL Bloat Woes, which was the presentation that I mentioned previously. And lastly parallel query in PostgreSQL and how not to misuse it. That was mentioned in our previous episode of Scaling Postgres. So if you want a different individual's opinion on presentations they thought were great, definitely a blog post to check out. [00:17:22] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scaling where you can sign up for weekly notifications of each episode or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 162

Custom Data Types, Row Level Triggers, Dynamic pl/pgsql Columns, citext Usage | Scaling Postgres 162

Episode 96

Account Configuration, Pricing, System Identifier, HyperLogLog | Scaling Postgres 96

Episode 377

Sharding At Network Speeds | Scaling Postgres 377

Bloat vs zHeap, Composite Indexes, Unsafe Operations | Scaling Postgres 39