Saving Space, Working with TOAST, Protecting Data, High CPU Utilization

[00:00:00] In this episode of Scaling Postgres, we talk about saving space, working with toast, protecting data, and high CPU utilization. I'm Kristen Jameson, and this is scaling postgres episode 203. [00:00:23] Alright, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is how much disk space can you save by using Int Four, which is a basic int, instead of Int Eight, which is a big int. This is from Dep, and he's talking about people posting in threads where they're debating whether you should use an Int four or just use Int Eight because the space doesn't matter that much. So we actually did a test and he created a table with just n eight and another table with just int fours, inserted a million rows into them and then checked the sizes, and the sizes were identical. Now, he did mention that this is on a 64 bit system. Now, he did the same test on a 32 bit system, and you actually see that the n four is a bit smaller. Well, why is that? And this is due to data alignment. So Postgres aligns the data when you're working with a 64 bit system to eight bytes. Now that kind of makes sense because there are eight bits to a byte and it's a 64 bit processor. So essentially each set of data is aligned to 64 bits. So what that means is an n four, it's only going to use four bytes of that segment, and four bytes will be free. But if you actually have two n fours in a row, like he did in this experiment here, and loaded the table, you get the exact same size as just using one n four or one end eight. And if you do an Int four and then an Int eight, it's going to spill over to the next grouping and you'll get a size of 42 megabytes. Similarly, using Int two or bulls, you're still going to get the exact same size as if you were to do an Int eight table. So on a 64 bit system, the alignment is going to be eight bytes. Now, you're also going to have a lot of other columns associated with it doing things to keep the table's system managed and for data visibility. Like additional columns in the table that you don't normally see are the table OID, the Cmax, XMax, zmin, Xmin, Ctid, et cetera. But each of these have a data cost per row. Now, a lot of people want to leverage this alignment to be able to fit the data most compactly. But the problem with that is that as soon as you add another column onto the table or remove a column, you're going to be breaking your alignment. So I don't think this is a really good thing to rely on for, say, OLTP use cases, typical operational data stores. But if you have an analytical database or data mart or data warehouse leveraging this alignment could allow you to save a fair amount of space so you could take into account this alignment and align your boolean fields and yours appropriately and all your different data types, like he was using INTs. But of course the same thing applies to floats. But if you want to learn more about that, you can definitely check out this blog post. [00:03:11] The next piece of content PostgreSQL toast compression and Toast Tuple Target this is from an A-Y-R-A-T info and he's talking about Toast, which is the oversized attribute storage technique. Basically, when a row is going to exceed the default block size, which is 8 KB, it's going to store that additional data in the toast and it will have a pointer to it from that row in the heap to the toast. And he talks a little bit about the toast structure here, but also talks about the role that compression has. Sometimes it tries to compress the row that it puts in the heap, other times it's going to put it in the toast without compression and sometimes it compresses it. And there's some different parameters you can adjust. So for example, there's the Toast Tuple threshold, it's going to try to compress the columns to fit the row on the block. But there's also the Toast Tuple Target, which defines what's going to spill over into the toast table. And by default these are 2. Then in terms of the toast, there's different settings that you can set for a column to determine whether it's going to be stored in the toast or not. Like for example, the most common thing you'll see is plain where the column is stored in the heap only and without compression. You'll typically see this with integers and floats and things of that nature, but there's also different values and you'll typically see for text or string data, you'll see it flagged as extended, which means the column can be compressed and stored in the toast if necessary. Now, he did some examination of this with JSON b looking at the effect of performance and he's not a fan of using JSON b in general only for attributes, particularly if they tend to be null. Like if you have a set of products with a lot of different feature flags maybe, or different attributes and a lot of them may be null for a particular product, that could be a good use case for JSON b where it's just flagging something. But in general, there's a lot of disadvantages that he mentions here. One is you can't do any type checking, you can't have any integrity constraints, and you don't have any foreign keys. Also writing queries becomes a little bit more complex and you don't have statistics on the keys of a JSON field. You also have loss of storage efficiency because now you're storing the keys in each row as well, not just a value. And then you can't partially update a JSON because if it's stored in the toast, you got to detost it, make changes, and then retost it so that has a performance impact. Now, he particularly looked at the performance of aggregation, so he had separate columns and he did an average on each column to check the timing and aggregating for these four columns. For this data set, it took around two minutes and 50 seconds to do. Then he placed that data in a JSON b field and said, okay, let me run an aggregation against that. And what he saw is the query took about eight minutes and 40 seconds to execute, so about four times slower. So clearly an example of there is a performance hit when you're using JSON b. Now, what is interesting about that is it actually seemed to have to decode the JSON b for each aggregation, which is definitely a burden to have. But he goes into analysis of kind of why that difference exists. And he did mention one way that you could potentially speed things up a little bit more is using the new LZ four compression that exists in postgres 14 to be able to get things in and out of toes faster. But if you want to learn more, you can definitely check out this blog post. [00:06:32] The next piece of content, say, for application users in postgres, this is from Crunchydata.com, and they're talking about a situation where someone sent them a message that said, quote, we deleted our database. [00:06:44] So basically help, so help them do a restore. What this application is talking about is you should probably configure your database so that your application users cannot delete data, so they can't drop tables, they can't truncate tables, and even later on they see even allowing deletions to happen. So the way that they propose setting something up is you create an administrative user that basically is going to own the database or the schema, and then you create an application user and with your administrative user, you go ahead and create a particular schema. They called it Prod here, and then they grant usage on that schema to the MyApp user. And then they created a table with the admin user, and then with the MyApp user, they tried to delete it and they weren't able to. But what they did do is they granted select Insert and Update on all tables in the schema prod to MyApp user. Now you could also choose to give them Create table Access or Alter table access because a lot of application frameworks, they rely on an application user to make changes to the schema. Or maybe you want to create a separate user for doing these types of schema changes because normally an application, if they're not doing a schema change or a schema migration, they only need select Insert and Update. Now in terms of deletion, how they address that is that basically you never delete any data with the application user. What you do is you market for deletion. So for example, you add a deleted column to essentially all of your tables, that is a timestamp, and then when you want to delete something, you just update that timestamp to the current date and you flagged it that this is essentially considered deleted now. And then in your select queries you basically show all the data where deleted is null. So that's a way to handle deletions without actually deleting the data and not giving that permission to the application user. And it's up to you if you want to run your database in this way, but it is definitely a more protective way to run it. But if you want to learn more about the process that they set up for doing this, you can definitely check out this blog post. [00:08:48] The next piece of content. Five minutes of Postgres episode five Finding the root cause of high CPU utilization in postgres with Explain and the Linux perf command. This is from Pginalyze.com and they are doing a review of the A Hairy Incident post that we covered in last week's Scaling Postgres episode, where they look at it in more depth and talk about a few other blog posts that were linked to it and related to it and discuss the situation in general. And I actually think it's a good second viewpoint to look at this incident and what happened and I encourage you to check out this episode as well. [00:09:25] The Next piece of Content the significance of LSN in PostgreSQL system this is from Higo CA and they're talking about the log sequence number as it relates to wall in terms of identifying exactly where you are in the wall. Now he discusses the reason he was examining this as potentially looking at a shared storage solution for postgres. So it's an interesting use case of why he was looking at this, but if you're interested you can check out this blog post next piece of Content SQL CL to transfer data from Oracle to PostgreSQL or Gigabyte DB this is from Dev two in the Gigabyte section. This is talking about a very simple way to transfer data from Oracle to Postgres. And basically SQL CL is basically a command lined way to export data to a CSV format. So he does that but then does an AUC command to make a few changes to make things compatible with Postgres and then loads the data into Postgres. So if you're looking for a probably quick and dirty way to do that, you can definitely check out this blog post. [00:10:28] The Next Piece of Content queries in PostgreSQL Query Execution Stages this is from Postgrespro.com and this is a series of blog posts talking about the internals of PostgreSQL. So this particular blog post talks about query execution in terms of parsing transformation planning, the execution as well as the extended query protocol that allows preparation and parameter binding. So if you want to learn more about the internals of Postgres, definitely check out this blog post the Next Piece of Content PG Timetable version Four Four is available immediately. This is from Cybertechn Postgresql.com, and this is a scheduling tool for Postgres, and it operates outside of Postgres, but it uses the Postgres database to keep track of scheduled jobs in Postgres. So if you want to learn more about the Update Tools tool, definitely check out this blog post. [00:11:20] The next piece of content the PostgreSQL Person of the Week is henrietta Dombrothoskaya. So if you're interested in learning more about Henrietta and her contributions to Postgres, definitely check out this blog post the Last Piece of Content we had another episode of the Rubber Duck Dev Show this past Wednesday. This episode was on how to get started coding. So if you haven't started your development journey yet and trying to determine where to start, maybe you'd like to check out our show. [00:11:47] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks. Like that's?

Show Notes

Episode Transcript

Other Episodes

Episode 369

Move Off Of RDS? | Scaling Postgres 369

Episode 65

Multi Column, Multi DB, Dos & Don'ts, RUM Indexes | Scaling Postgres 65

Episode 167

Interpreting Explain Analyze, Bulk Loading Data, Limiting Update & Delete, pg_buffercache | Scaling Postgres 167

Saving Space, Working with TOAST, Protecting Data, High CPU Utilization | Scaling Postgres 203