Postgres Releases, Useless Vacuum, Isolation Differences, WAL Compression

[00:00:00] In this episode of Scaling Postgres, we talk about a new postgres release Useless Vacuum, isolation differences and wall compression. I'm creston jamison. And this is scaling postgres episode 101. [00:00:21] Alright, I hope you're having a great week. Our first piece of content is that new PostgreSQL versions have been released. This is from Postgresql.org and twelve point 211.710, point twelve, et cetera through nine six through nine four have been released. Now this is the end of life release for PostgreSQL 9.4, so if you're still on version 9.4 or even earlier, you should definitely upgrade to one of the more recent versions. Now there was one security issued mentioned, CVE 2020 1720, where alter depends on extension is missing authorization checks. So yet another reason to upgrade to take care of this, although you could probably take some mitigation steps to reduce the risk of that. As well as this post also lists all the bug fixes and improvements for this version, so definitely review this blog post if you want to learn more. And also there was a link to the PostgreSQL roadmap, so if you want to know when they're targeting the minor releases, they do show it on the Developer Roadmap page as well as give a projection when the next major release version 13 will be. [00:01:37] The next piece of content is Useless Vacuuming. This is from the Robert Hoss [email protected] and basically he's talking about a situation where vacuum is running. It completes its process, but there's still a lot of IDs that haven't been cleaned up, so it appears to be doing nothing and he's mentioning a way to recognize the problem. Of course, the worst part is that database not accepting commands due to a wraparound data loss issue potentially, where you've literally run out of your X IDs, your transaction IDs, or you have a warning with a lot of rows that need to be transactions that need to be vacuumed. So basically he says this issue comes from about four different situations and he goes into depth on the Xids, why they exist, the 4 billion possible transaction IDs because it's a 32 bit integer and how that exists. But the four ways that an XID can be in use is that the transaction with the XID is still running. So basically you have long running transactions, which is generally a no no for postgres. You want to keep them generally as fast as you can. Another possibility is that another transaction that cares about the status of an XID is still running. So again, the long running transaction problem. The third issue this can arise is that the transaction with XID has been prepared using prepared transaction, but neither commit prepared or robback prepared has been executed yet. So this is a situation using two phase commits. This is not like prepared statements, but it's prepared transaction where you're going to be doing a two phase commit. Generally, this isn't used a lot in application development at all. So it's only specialized use cases. So I would hope you would not be running into that. And then of course the fourth issue is that there's a replication slot whose X Men or Catalog X Men is that XID. So basically there was a replication slot and essentially it's probably orphaned now and the replica that was using it is no longer retrieving information from that replication slot. So the Xids are expanding now. He gives a few quick queries how you can check cases of one and two long running transactions looking for the age of the backend XID and X Men from the PG Stat activity. If you're using prepared transactions, you can use the PG Prepared XaX table or Transactions table or you could check the PG replication slots table for essentially an orphan slot or some problem with replication. And he goes into monitoring how you could monitor this to set up some queries using these parameters and determine a particular size and be alerted to them. So definitely valuable advice to follow. And then lastly follows up with how to recover from the situation. And he talks about basically long running transactions. Well, you'll need to cancel them if you have prepared transaction, you need to do the commit prepared or rollback prepared to the ones with the high transaction age. Or if you have a problem with an orphan slot you should go ahead and drop that slot. And then he advocates running a verbose vacuum, not necessarily freeze yet, but verbose vacuum to get things under control and then you can move on from there. So definitely really great blog post about useless vacuuming. [00:05:03] The next post is isolation. Repeatable read in PostgreSQL versus MySQL. And he goes over and talks about the concept of isolation, how transactions that are occurring within the databases are isolated and that there's four versions according to the SQL standard. There's Read Uncommitted, which a transaction can see the changes of other transactions before they're committed. And PostgreSQL doesn't implement this mode. He says read committed. So a transaction sees changes from other transactions as soon as they're committed. And this is the default in Postgres a repeatable read. When a transaction reads back a row that's already been read by a previous query, it must read the same values even if the row is changed by another transaction that is committed in the meantime. [00:05:50] And from what he says, MySQL follows this as the default, a repeatable read and then lastly is Serializable and a transaction cannot see or produce results that could not have occurred if other transactions were not concurrently changing the data. So he goes over into some very interesting differences. Number one, that Postgres's default is recommitted, whereas MySQL's default is repeatable read and he goes into examples of this. But even if you set it to repeatable read in both Postgres and MySQL, you get different results depending on how it's executed. So the implementation of the engines is different and he goes on and shows further examples of some of these differences using a third and a fourth example. So definitely if you're considering moving from MySQL to postgres and you rely a lot on transactions, this is a definite valuable post to look at, to look at the differences in how postgres versus MySQL handle different implementation issues. So, very interesting blog post to check out. And I should say, this is from the PostgreSQL verite Pro blog. [00:07:07] The next post is Compression of PostgreSQL wall archives becoming more Important. And this is from Procona.com. [00:07:15] Now, he's talking about literally compressing the wall files using some sort of compression tool such as Gzip. Or I believe he talks about P seven zip here as well, not so much the wall compression setting in PostgreSQL. So there is a wall compression setting that you can set in postgres that does some things to compress. I believe it does a full page image rights to the wall, and definitely that's advocated to go ahead and set that you burn a little bit more CPU in exchange for disk space saving issues. But I've generally found it for large database systems, it's much better to have that enabled and take a very minor CPU hit compared to the disk space savings as well as potential network transfer savings with wall files as you get a highly active database. Now, what he's also mentioning here is that compressing them is part of the archive command. So he says there are ways to do this using PG backrest as well as wall g to be able to do that, but you can also do it yourself just using a Gzip utility. So if you typically copy it to another location, you can use a Gzip to compress to a particular location as well as seven z A, because it does really high compression as fast as possible. And he shows you where you can get it and install it for Ubuntu and CentOS. Then he talks about restoring, he looks at the different ways you can restore with wall g and PG backrest because you have to unarchive essentially the file and then use it again as well as using Gzip and seven za. Now, with some of these compressions, he was able to get down from a 16 megabyte standard file down to 197 KB. So extremely high compression with the seven Za. So if you're wanting to conserve disk space and maybe make some of your wall handling more efficient, definitely investigate this blog post to see if you can add some additional compression to how you're handling your wall files. [00:09:23] The next post is Configuring workmem in Postgres, and this is from Pgmuster.com. And it basically talks about the workmem setting. So it basically defines the amount of memory to retain for each connection for work. Now, it's not that each connection uses up to that point. There are certain transactions you run that will use multiple amounts of work mem. So you need to keep that in mind as you are configuring it now by default. As he says here, the work mem is for megabytes, and generally that can be boosted up depending on how many connections you have to your database and how much memory you have, you can adjust what it is. Ideally, you want to have as many query operations happening in memory as possible, so that's the reason you would want to increase it, but you don't want to increase it so much. Essentially, you run out of memory in your system given how many connections and the complexity of the queries of those connections. So generally, if you have an online transactional processing database, you probably want to keep work memory a little bit lower because you probably have a lot of connections. Whereas if you have more like an online analytical processing load, then you'll probably have much fewer users and you can ramp up that work memory because presumably the queries are going to be more complex. And he says a quote here, for example. Christophe Pettis suggests that 16 megabytes is a good starting point for most people, but he goes into ways you can set up for your session to try out different versions of work memory. So if you're interested in that relatively short blog post to check out, next post is quickly load CSVs into PostgreSQL using Python and Pandas. So basically it's showing you a quick way to load CSV into postgres. And he's using Python and this tool called Pandas. Now it looks like Pandas just basically makes it easier to create a table and you can use it to actually load the data. But he has some options down here working with larger data sets. And option two is the most interesting to me because it basically just creates the table, basically looks at the data set to create a schema, and then it looks like it relies upon the copy command. Again, that's the most efficient way to load data into postgres, to actually load the data into the table that was generated. So if you use Python, perhaps this is a blog post you'd like to check out. [00:11:49] The next post is why dropping a column does not reclaim disk space. Or better, why is it so fast? This is from Luca Ferrari at Fluco, 1978 GitHub IO, and he's talking about how when you drop a column from a table, it doesn't immediately reclaim the space. So he has an example here where he creates a one column table with 346 megabytes of data in it. He then adds a new table with a large default, and the table goes up to almost double the size, 651 megabytes. Then he looks at some of the attributes of some of the system tables to look at the status of the columns. He then drops the column. It happens super fast, 20 milliseconds. So all that data probably wasn't deleted because essentially they just dropped the reference to it. So you can no longer reference that column in queries, but he checks the size and it's essentially the same size and if he looks at the attributes, you can see PG dropped for this particular column so it's not there anymore. As far as the system tables are concerned, it's considered invisible at this point. And then how do you get the space back? You do a vacuum full and then that drops it down to the original size, but if you look at the tables you can still see that reference is there, so it looks like it essentially never goes away. And he says the only issue probably dealing with the potential drawback is that the dropped attributes probably count as normal ones going toward the limit of the table. Now I would think it would be quite rare to have that happen, but it's interesting thing on how Postgres works, so if you want to learn more about that, definitely a blog post to check out. [00:13:28] The next post is migrating from Ms SQL to PostgreSQL uppercase versus lowercase. So this is going from Microsoft SQL Server to PostgreSQL and apparently Microsoft SQL Server has all caps for its table names and its column names and that's typically not done with Postgres, but you can actually set up a couple of queries to actually change your schema over. So for example, he's using a query here to query the system tables to change the table name and rename it to the lower of a table name. And then using in psql the GEC function which basically takes this built query that you're processing here and then enables it to be run for each table. Now he does mention that this first implementation has an SQL injection issue and if you use quote Ident it will avoid it. So definitely you'd want to use this version when working with these types of queries. Now that handles the tables and then this is how you handle the columns. So it gives you a tool that allows you to be able to go through each column and set it for lowercase. So if you have a use case for using this, definitely a very interesting blog post. [00:14:49] The next post is how to Automate PostgreSQL twelve replication and Failover with Rep Manager Part Two. So this is the second version. The first part is linked here and it covers a three node cluster, one primary, two standby postgresqls and then a witness node elsewhere to handle the election process of who's the primary or not. And this post goes over setting up replication failover and it goes into a lot of depth with Rep Manager being able to set it up. So if you're interested in setting up clusters in this way and you want to investigate Rep Manager, definitely a blog post to check out. Now related to that, there's another blog post covering Rep Manager and it's Failover and Recovery with rep Manager in PostgreSQL eleven. So I believe this one was version twelve. And this blog post, I guess they're still on eleven, and it goes through the process of how they set up a rep manager for their use case. So, two blog posts to check out if you are interested in using Rep Manager for cluster management. [00:15:58] The next post is creating a PostgreSQL procedural language. Part Two embedding Julia And this is from secondquadrant.com. Now we covered where the initial setup was done for running it setting up as an extension. And this is actually Embedding Julia because it says that Julia provides an API so that Julia functions can be called from C. So they're going to leverage this to execute the Julia code from its defined function store procedure. So he has an example here of setting up the next phase of it. So again, it's quite short, but it's the second part of creating an extension that will generate a Julia procedure language. [00:16:40] Next post is can PG Bouncer survive transient network events? And this is from Enterprisedb.com, and it's the fifth part of the series of PG Bouncer handling network outages and again using their Enterprise DB IP Manager to do it. So this covers a lot of the conclusions and what they covered. So if you've been interested in this series, here is the last post. [00:17:06] And the last piece of content is how to map a PostgreSQL array to a Java list with JPA and Hibernate. So if you have an interest in doing that, mapping arrays to a Java list, definitely a blog post to check out. [00:17:23] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 77

Timeseries Queries, Arrays, Ranges, C Triggers | Scaling Postgres 77

Episode 235

Postgres 15 RC1, ICU Collations, Listen & Notify, Understanding TOAST | Scaling Postgres 235

Episode 171

Practical Partitioning, Understanding pg_repack, Explaining Buffers, Multi-row Contraints | Scaling Postgres 171

Postgres Releases, Useless Vacuum, Isolation Differences, WAL Compression | Scaling Postgres 101