Insert vs. Update, Select * Performance, Debezium Set Up, Standardizing Data

[00:00:00] In this episode of Scaling Postgres, we talk about insert versus update, select star performance, debesium setup and standardizing data. I'm Kristen Jameson, and this is scaling postgres episode 142. You alright? I hope you, your friends, family and coworkers continue to do well. Our first piece of content is insert only data modeling to smooth peaks on slow disks. This is from Cybertechgresql.com. They're talking about a situation they ran into a client where they had a database server that essentially had slow disks and wasn't that performant, but they had a lot of updates that needed to happen within a particular period of time. And with updates, essentially, you create a new row, and then you have another row that needs to be vacuumed up, left around. And they discussed moving to an insert only model so that they don't have more random I O with updates and also don't have to deal with the vacuum consequences of doing that. And they talk about the benefits of choosing to do more inserts versus updates. So, for example, you get more sequential I O patterns when you're doing inserts. You get a lot fewer full page writes after a checkpointing process, so smaller wall is being generated. You get faster single transactions compared to an update. You also get the action history for free, essentially, because it's all there. And the other thing not mentioned is that essentially you don't have locking, usually an insert, you don't have to do a lock, whereas an update, you always have to. And the thing he noted about his example is that they weren't using fast modern SSD disks, which would be easier to do these updates in the random I O four, and also when the active data set does not fill well into the shared buffers. Now, I mentioned the downsides of doing this is that you do need a more complex schema because now you're inserting all the data changes into a table and eventually you're going to want to update or backfill your main table with the data values. You get a bigger total disk footprint because again, you're storing more data. Every update has its own row. Now, you need to schedule some maintenance work over non peak hours, basically, to update that table. And you can't benefit from heaponly Tuple updates. Then he goes through his testing process of where he was testing out 100 million rows and did some random updates of data coming in into those 100 million rows. And he shows his test system here. Four virtual CPUs, eight gigs of Ram, and he's using just hard disk drives, so not SATA drives, so that's going to impact things even more. So but you can see for doing the update method, it took three milliseconds, whereas the insert only method just took one millisecond, which constitutes about a 70% performance improvement. And even the standard deviation is wider. Eleven milliseconds for updates and six milliseconds for the insert only queries, which constitutes a 50% reduction in the standard deviation. So definitely the inserts went out over updates. Now, I see a fair number of applications where developers are essentially needing to keep a counter of something so that constantly gets updated whenever any data changes. But you need to be very careful of that because if that gets updated, say 1000 times a day, well you've generated 1000 duplicates of that row when you do an update because it does an insert and then it flags that old row that used to exist for eventual vacuum. So essentially however many times you're updating it, you're duplicating that full row however many times and then all those rows have to be vacuumed up. So sometimes it is better and more performant to do inserts of all that new data coming in and then periodically rolling an update into the main table that you're interested in and updating it. But definitely an interesting blog post if you want to compare the difference between using an insert only method to maintain your data versus updates. [00:03:55] The next piece of content is reasons why select Star or Asterisks is bad for SQL performance. This is from Tanopoter.com. Now this is not a postgres specific post. Most of the examples here are based upon Oracle, but he's talking about the reasons why you want to try to avoid select Asterisks or select star if you can. And he even goes into very specific reasons looking at bytes processed and a lot of analysis within Oracle itself on why you should be actually selecting the columns that you're interested in. But the areas he covers are it generates increased network traffic because of course, however many rows you're pulling over, if you're pulling over every single column, that's going to be a lot larger than if you only were choosing specific columns to go from the server to the client. You get increased CPU usage on the client because it has to accept all of this data and process through it all you have some query plan optimizations are not possible. You're also going to increase a server side memory usage. You're going to of course increase the CPU usage on the server side because again, more columns that you have to deal with. Now, hard parsing and optimization takes more time. This may be Oracle specific and cache cursors take more memory in the shared pool. Well, that definitely takes more memory in the shared buffers in postgres. So it's an interesting post describing all the reasons why you really should target the number of columns you want to get the best performance out of your relational database systems. So if that is of interest to you, you can check out this blog. Post next piece of content is setting up PostgreSQL for Debesium. This is from Elephanttamer Net and he goes through the process of setting up Debesium which is a change data capture solution for postgres. So basically you can use Postgres logical decoding to read the wall stream and use debesium to grab updates that are happening and he goes through the process. So first you want to make sure your wall level is set to logical. You want to pick a specific debesium version and he's advocating because of a pretty bad bug to do something greater than 1.3.0. Be sure to check your database logs to make sure that it is connecting successfully. He advocates creating the slot and publication manually as opposed to debesium doing it because he's had some issues where it could silently block all rights to the tables where you don't have a primary key because the publication defaults to essentially replicating everything. And he goes through some of the configuration to debesium and then of course advocates monitoring closely because you don't want to stop replicating from the slot because now you're going to have wall buildup on the primary. So you're going to want to make sure that that logical slot is being consumed on a regular basis by debesium and make sure it doesn't fail or it restarts if it does fail. So definitely a good blog post about how to handle debesium on PostgreSQL. [00:06:46] The next piece of content is using Postgres for Statistics Centering and Standardizing data. Now this is a follow up from a number of posts that were done about data [email protected] and here he is talking about generating the z score for a particular set of data. And this is a means of standardizing or normalizing analysis of the data. And he was wondering how to come up with this calculation for each piece of data. And at first he thought about using a generated column so that right next to the column of interest, you could generate this zscore. But the problem is you need to be taking a mean and the standard deviation of each one and you can't really do that from data outside of the row. And then he covers a number of different options he came up with, including transform columns into the original table, creating a whole new table for zscores, creating a view the way that he seemed to like was creating a new table just for the aggregates that are being generated and then use a trigger on the original table to update the averages and standard deviations that exist in the separate table. So then doing the zscore calculation would be pretty fast. But he ultimately chose another path where he did all the calculations in a new column on the table. So if you want to learn about different techniques to set up a zscore for analysis of your data, maybe you want to check out this post from Crunchydata.com. [00:08:09] The next piece of content is Postgres the batteries included database. This is from Crunchydata.com and he's talking about the idea that everything essentially is included in postgres. So the batteries are included and he's covering these in a few different areas. You have a solid database foundation, meaning it has MVCC. At its core, it has a variety of data types such as range types and timestamps and JSON B columns to be able to handle any sort of data you can throw at it. It uses expressive SQL and has support for things like common table expressions and Windows functions and generating your own functions. It also has full text search bundled into it, that is, as he phrases here from another post, it's just good enough. Then of course, it's extensive geospatial support when you're using it with PostGIS, which is an extension to handle that. Then lastly, he covers all the different extensions that help enhance the PostgreSQL database, from the frequently used PG Stats statements to track statements that are running in your database, to PG partman for doing partition tables, foreign data wrappers, a data science one called Madlib as well as others. So basically this is a quick post of all the different features that make Postgres A batteries included database. So check this out if you're interested. [00:09:29] The next piece of content is PostgreSQL high Availability setup and uptime considerations. This is from Higo CA. Now, they're not talking about specific High Availability solutions, but kind of what are the components of High Availability? Number one, you want to have a way to set up a quorum and avoiding a split brain situation when you have a cluster of multiple PostgreSQL servers, because one's going to be the primary and it's going to have to potentially decide which other replicate takes over in case there's a problem with a primary or loses connection. So you're going to want to be able to define a quorum size, generally three or more end being an odd number, as well as the process to promote a standby. And in connection with promoting that standby, if the primary comes back, you need to be able to handle that situation. So that is stoneth or smith and fencing. So Stoneth is shoot the other node in the head, which means to bring down another instance. Or Smith shoot myself in the head, meaning that you are the primary and you determine there's another one that exists, so you should not continue any longer. And fencing is a way to keep the old primary out of the cluster and prevent it from rejoining. He then talks about indirection, whereas how you connect to the primary, you need to be able to flip that easily. That could be domain name reassignment to a new IP. You could use Load Balancers to do it, or assigning an old primary as virtual IP to the new primary's database system. And also talking about configuring Postgres SQL so that you don't have to do configuration after a switch over. In other words, you want all the parameters to be up and ready to go on the replica so that the switching over is not going to require any sort of reconfiguration or restart of the database. So if you want to learn more about the components of high availability, you can check out this blog post. [00:11:20] The next piece of content is making postgres stored procedures nine x faster in Cytus. This is from Citusdata.com and they're talking about a new enhancement they added to Cytus, which is an extension for postgres that allows you to do scale out of your postgres solution. Now, they already had the concept of distributed tables where you can scale out your tables to multiple nodes. Now they support distributed functions, so you can dedicate a function to run on a particular node. And with that, with a benchmarking workload called a HammerDB, they were able to get a nine times performance using these distributed functions within Citus. So if this is of interest to you, you can check out this post from Citusdata.com. [00:12:06] The next piece of content is combining PG Backrest dedicated repository host and streaming replication. This is from Pgstep GitHub IO and he describes a process of setting up a dedicated backup repository for your backups using PG Backrest. So if you're interested in that and setting it up, definitely check out this blog post. [00:12:26] Next piece of content is that PG Bouncer 1150 is released. This prompt primarily contains enhancements and bug fixes, including more secure authentication, failure reporting and less log spamming when monitoring systems. So if you use PG Bouncer, maybe you want to consider upgrading to this release. Next piece of content is waiting for PostGIS three one grid generators. So this is a new feature coming in 3.1, so if you're interested in doing grid generation, you can check out this blog post from Crunchydata.com. [00:12:58] And the last piece of content is the PostgreSQL person of the Week is Pablo Stele. So if you're interested in Pavel and his contributions to postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show Notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube ride tunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 120

Postgres 13 Beta 2, System Stats Extension, Tuned, Join Strategy | Scaling Postgres 120

Episode 168

Managing Autovacuum, Better JSON, Avoiding Updates, OS Tuning | Scaling Postgres 168

Episode 12

The Future, Slow Counts, Avoid Or, Replication | Scaling Postgres 12

Insert vs. Update, Select * Performance, Debezium Set Up, Standardizing Data | Scaling Postgres 142