Toast Impact, zheap Storage, More Performance, pg_rewind Changes

[00:00:00] In this episode of Scaling Postgres, we talk about Toast impact, Zheep storage, more performance in PG rewind changes. I'm Kristen Jameson and this is Scaling Postgres, episode 137. [00:00:22] All right, I hope you friends, family, coworkers continue to do well. Our first piece of content is statement from the PostgreSQL core team on the EDB acquisition of second quadrant. And this is from postgresql.org. And they're talking about the issue where EDB acquired second quadrant, and now I believe it's, up to 60% of the core team is comprised of EDB employees. And as they say, quote, there's been a long unwritten rule that there should be no more than 50% of the membership of a core team working for the same company. So basically, they say they're looking into this and discussing what to do, and they just wanted to put a post up for the community to let them know that they are aware of this and they're looking to see what solutions they could make. So it's interesting that they actually posted this on the PostgreSQL website. So just some information to keep in mind. [00:01:17] The next piece of content is the surprising impact of medium sized texts on PostgreSQL performance. This is from Hakibinita.com. Now, in terms of medium texts, he basically says small texts are very small strings, like names, slugs, usernames emails, et cetera. And you may be using a varcar instead of an actual text. Large text is large content such as a blog post, articles, HTML content, whereas medium texts are things like longer descriptions, comments, reviews, stack traces, things like that. So it's text that's not quite like a full blog post or a full document, but not very short strings of text. That's what he's calling medium text. And basically, this is a great article, talking about Toast and explaining what it is and how it works. So he goes into what Toast is, and that because PostgreSQL has a fixed page size. A Tuple cannot occupy more than one page. Therefore, Toast was come up with as a way to store more than 8 KB within a row. So basically, it overfills into this Toast storage. And he has information here from the docs that basically say the Toast will compress and or move field values out of line until the row value is shorter than the Toast Tuple target, which is typically 2 KB. So whenever a table is created, there's also a Toast table created as well. And he used this technique here to identify the Toast table of a table he created. Then he actually looked to describe the table and you could see that it has a chunk ID column, which is an OID, a chunk sequence column, which is an integer, and a chunk data, which is the raw data. So basically, even the Toast table can exceed that 8 KB. So basically, it splits it up into separate chunks of data, each with a sequence number that references the value that's Toasted now he does an example here where he inserts a small text value into the table. And when you check the Toast table you can see it's still empty. So everything fits within the main table itself. Nothing is toasted. But then when he inserted a longer text string with over 4000 characters, you can see that now the Toast Table is being populated and that it actually spans three rows in this Toast table. So this reference is the same ID. But as this diagram represents, it's split into three rows in the Toast Table, each chunked with a sequence, as you can see here. Now he also mentioned that it can also do compression and the compression kind of depends upon how random the string is. So to look at some examples, he actually created a function called Generate a random string that you can specify the length as well as the number of different characters to use in generating the random string. So he actually inserted a ten kilobyte string and it was split up into six chunks within the Toast Table after this insert, using entirely random strings. But when he actually restricted it to only three different values, therefore it's easier to compress a string with only three values in it compared to a great many values. Inserting the same length, 10 KB is actually only stored in two chunks and only takes up 3 space, so it compressed it by about 60 or 70%. Now what's even more interesting is that when he uses only one character and the same 10 be stored, it compresses it so well it doesn't even store it in the Toast table. Now he does mention for completeness that you can configure the Toast table per table and that includes the Toast tuple target, which is the minimal tuple length after which PostgreSQL tries to move long values to toast, and then the storage, which is the Toast strategy. The default is extended, but it also offers four other toast strategies and basically this varies whether things are compressed or not and whether they're moved out of line or not. And then he gets into Toast performance. And for this example, he creates a small table, a medium table and a large table. And those links are an indication of the string length. And he inserted 500,000 rows into each of these tables using the different length strings. He turned off parallel queries and it's just doing sequential scans and he looks for a single ID in each table. The small table ran in 25 milliseconds, the median table eventually ran in about 173 milliseconds, whereas the large table did it in 37 milliseconds. So the question is why is the median so much larger than both the small and the large? And the reason being is because of where the data is stored. So the small is all stored in line and it's a relatively small table. There's essentially nothing being stored in the Toast. Now, he purposefully inserted a value that just barely fit into the table and did not spill over into the toast for the medium value. So you can see the toast value is zero, but all those medium strings had to be stored somewhere. So they're stored in line in the table. But as a consequence, the table, as you can see, is almost a gigabyte, whereas the smaller one is at 21 megabytes. Then when he did the large one, pretty much all of that spilled into the toast. It's almost 2GB the Toasted table, but yet the size of the actual table itself is large. So when you're actually doing a scan, it's much faster to scan the small table, 21 megabytes, the large table 25 megabytes, but it's a lot slower to scan that as almost 1GB medium sized table. Now, in terms of this analysis, what he was searching on was the ID. So the ID is located here. If he was actually searching on the actual text value in there, you'd get a much different result because you would have to search through this Toast table and that's what he did. Here where he searched through the Toast values. And here you could see the large table doing a search took over 7 seconds. The median got down to eventually 260 milliseconds, whereas the small was 50 milliseconds. So that makes a lot more sense. It's searching based on how large the data is. And then he looked at the impact of adding an index on the ID. And of course, when you query on the ID you get approximately the same speed, less than a millisecond for each one when querying on the ID. But then he tried querying a lot of data for each of the tables using the ID. So using a range, say pulling half the values of the ID out, and here he saw something different. Where the small table was about 60 milliseconds, the medium table took longer at 284 milliseconds and the Toast table was smaller at 70 milliseconds. So I am a little surprised by this result. I would have expected the medium and the large to be similar because the ID index sizes should be the same between all three tables. And because you're selecting all columns, you still have to go to the heap to pull in the values required. I'm a little surprised that the medium takes longer than the large. He says that the database had to read a larger portion of the table, but because it's returning all columns, I would have expected it would have had to have read all the columns of the Toasted table as well to pull back the results. So, definitely interesting finding. But in terms of adjusting to try to get better performance, he suggests adjusting the Toast Tuple target to determine how much gets stored in the Toast table or not. And the next possibility is creating a separate table to store some of these values. But definitely a very interesting post covering toast and how it works and some performance impact you may run into depending on what values you're storing in postgres. So definitely a blog post I suggest checking out. [00:08:47] The next piece of content is Zheep inspecting storage sizes. This is from Cyberduck Hyphen postgresql.com and they're talking about how Zheep handles storage differently with regards to sizing. And in constructing Zheep, which is a different storage mechanism for postgres, they've made the Tuple header much smaller and the alignment has been improved on how data is actually stored at the page level. So first they created a sample table as a temporary table that's about 500 megabytes. And then they created table like that one using the standard heap. So that's the default storage system and inserted 10 million rows into it. Then they did the same thing for the Zheep storage system and inserted 10 million rows as well. Now, what's interesting is that how much slower the Z heap solution was. Now this is still in beta, it's not production ready yet, so that may have some impact. But the standard storage system, it took 7.5 seconds to insert 10 million rows and it took 28 seconds to insert into Zheep. So that's quite a bit of a difference. But what's interesting is that the size difference is about half the size. So if you look at the standard heap, it was about 500 megabytes. Zheep stores it in 250 megabytes. Now, they say this is due to the things mentioned before, is that the Tuple headers are smaller. So more of the transaction information is stored at the page level as opposed to at the Tuple level as well as they do more efficient alignment and padding in terms of organizing the data. Now, what they also mentioned is that this Excels, when you have fewer column tables with many rows, you're going to get a lot more savings as opposed to having many, many columns in the table. Because again, transaction information is stored at the page level as opposed to the Tuple level. And then of course, the hallmark of Zheep, the reason why people are potentially investigating this is to do in place updates. And you can see when an update is done incrementing each row, the size of the table is exactly identical to the previous table. So there's no doubling in size. If you did this with a standard heap storage system, it would double in size because an update is comprised of a new insert and then the old rows vacuuming away. This isn't necessary using Zheep, but the thing you need to keep in mind is that all of this information is kept around in a rollback area, or maybe it's called Undo and there's actually a directory called Undo where it stores all of this information. So even though it's not stored in the table, it's stored elsewhere. Now, as a consequence, rollbacks are handled differently with a heap storage system. All the rows are right there. So when you do an immediate rollback, it is essentially free because the rows are still there, whereas a rollback with Z heap is more intense because it actually has to literally undo everything and put back in place things. So for example, this rollback on Zheep took 41 milliseconds versus about a 10th of a millisecond for the regular heap. So it's a trade off. You have to consider if you move to Zheep, you get in place updates less table Bloat. But if you do have a lot of rollbacks that can happen, it's going to zap your performance. So, definitely interesting post about Zheep. I definitely suggest you check out. [00:12:06] The next piece of content is postgres 13 performance with Open Street Map data. So this is another post about performance of postgres 13. And here he tested versions 1312.4 and he just used the OpenStreetMap data as a source for checking things out, loading some of the data. There wasn't any appreciable difference. Then you run it through the Pgosm and that happened 12% faster than version twelve and 18% faster than version eleven. Then they did some testing with PG bench and they got a 16% faster result with OpenStreetMap postgres data calculations. They talked a little bit about the Beecher index deduplication that's present, and then they also checked the index time creation and you can see that eleven is faster than both, but 13 was faster than twelve. And I suspect that's because of the deduplication, it takes a little bit more time to create the indexes. Perhaps. But as is the hallmark for version 13, for some of these indexes, he's seen a dramatic 70% reduction in size. Now, in terms of performance, he didn't see that big a difference with the size of the data. But he estimates that when you get into millions of rows, then you're going to really see a performance difference with the indexes as the size impact comes into play. But this is another post from Restproof Labs on the performance of postgres 13. [00:13:30] The next piece of content is TPCH performance since PostgreSQL 8.3. This is from secondquader.com, and this is the second in a series of blog posts. The first one looked at OLTP performance. This looks at more of a data warehousing data mart performance for postgres. And again, they covered versions from 8.3 up to version 13. They used a consistent hardware setup to do these tests. So basically it gives you a trend of showing how performance has changed over time for the different versions. And they used a data set that essentially fits into the shared buffers, a data set that can fit into the memory, and the data set that has to be spilled onto disks. So basically three different variants. And they also added some parallelism as well for certain activities. So the data loading, you could tell that over time it's improved. And 13 is maybe not the best, but still pretty good in terms of loading data, loading data with parallelism enabled. It looks like 13 is probably the winner across all the versions, which is good to see. Then they got into the queries and you could see at the 1GB ten gigabyte and 75 gigabyte, both parallel and unparalleled. You can see the trend is for performance to increase over time and 13 looks to be about the best out of all of these. And then they looked at some parallel queries and there were some variance between the different versions. But again, the more recent seemed to overall be pretty much the best out of the different versions. So things are heading in the right way and there's no great regressions for version 13. So another post about postgres 13 performance if you would like to check it out. [00:15:08] The next piece of content is PG Rewind changes in PostgreSQL 13. So this highlights some changes that have been made to PG Rewind. The first one is that you can now write a recovery.com file in the case of versions prior to twelve and version twelve it actually adds a signal file. They do mention some caveats in terms of working with on how to handle things and that generally it writes the connection information to the PostgreSQL auto.com file because that is essentially the last file checked to assign configuration values to postgres, so it chooses to place those there. The next option they're talking about is restore the target wall. So this can tell the target that you're restoring when doing a rewind of a database system to look in the archive log directory to replay any missing wall from there in order to catch back up with the primary. And also talking about changes to better handle automatic crash recovery. So starting an instance in single user only mode to ensure that crash recovery is performed before doing its sync to the primary. So if you're interested in learning about these changes with regard to PG Rewind in postgres 13, definitely check out this blog post from Sqlinfo de. [00:16:23] The next piece of content is data consistency issues in logical replication. This is from Elephanttamer Net. He's talking about the scenario where you have a publisher and subscriber and that the data on the subscriber can still be updated, deleted, inserted into, and that can cause issues with the replication. So he's basically covering some of these scenarios. So scenario one, data inserted on the subscriber side. So generally you're going to have a replica identity or a primary key in order to logically replicate over data. But the issue is that if you insert data into the subscriber table and then you get an insert coming in from the publisher table with the same primary key, you're going to get an error. So duplicate key violates unique constraint and he says replication will stop and wall segments will start to pile on the publisher. So definitely an issue. So basically you need to watch out and avoid for these situations. But the way to get around it is to delete the data for that particular ID. Scenario two is data updated on the subscriber side. So much like you can insert data, you can actually update data. So again, it's an issue to be aware of. Similarly, you can delete data, which could be a problem if data then needs to get updated. And then if you have the log set at a certain setting on debug one, you could get logical replication. Did not find a row for update in the replication target, so that could be a concern. Now then he covers a few different ways where you can resynchronize data for logical replication. The first he says lock the table for writes on Publisher, dump the data and copy this dump to the subscriber truncate table on subscriber, restore a data dump on subscriber and remove the lock on publisher. Or two, exclude the table from the current publication, truncate the table on the subscriber and create a new publication subscription pair. Now, of course, he talks about a very black magic way of doing it. I'll let you take a look at that if you want to examine that, but definitely some things to be aware of when you're using logical replication. [00:18:27] Now, related to that, there's an article called Logical Replication Upgrade in Postgres. This is from Crunchydata.com and he goes through the process of using logical replication to do an upgrade in postgres and they go through the whole process to be able to do that. Now, they didn't mention sequences here, I'm not sure if that gets brought over in the globals, I'll have to take a look. But it's definitely a process to get logical replication up and then choose to basically switch over to it if you want to do an upgrade. Similar to that is an article from several nines.com talking about how to upgrade PostgreSQL eleven to postgres twelve with zero downtime. And in this case, they're using logical replication again. So if you want to look through these two posts to see how you can potentially use logical replication to do an upgrade, definitely check them out. [00:19:18] The next piece of content is using PostgreSQL to shape and prepare scientific data. So this is data you want to analyze, ideally in postgres. And this describes a process of importing the data into postgres and using postgres to manipulate and massage the data to get it in the exact format and data types that you're going to be using for your analysis. So he has a process to run through doing it all in postgres as opposed to using spreadsheets in order to update data values manually. So if you want to check out a more automated way to prepare your data for scientific analysis, you can check out this blog post from Crunchydata.com. [00:19:57] The next piece of content is the WayPG Store Null Value in Record. So this post goes into some of the internals about how postgres stores nulls and some of the consequences with regard to comms when you delete them in that it can take up some more space because deleted columns are considered nulls for all of those values and that has a storage impact. So if you're interested in learning more about the internals in postgres, you can check out this post from highGo CA. The next piece of content also from highGo CA is Free Space Mapping File and Details. So if you want to learn more about their free space map file in postgres, definitely check out this post. [00:20:38] And the last piece of content is the PostgreSQL person of the Week is Damien Glitchard. If you're interested in learning more about Damien and his contributions to PostgreSQL, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 363

Top Ten Postgres Dos & Don'ts | Scaling Postgres 363

Episode 198

Monitoring Progress, More SQL, Replication Slot Failover, Postgres Contributors | Scaling Postgres 198

Episode 122

Postgres is Fine, Generated Columns, Postgres Inserts, Synchronous Replication | Scaling Postgres 122

Toast Impact, zheap Storage, More Performance, pg_rewind Changes | Scaling Postgres 137