Insert-Only Vacuum, Settings, Sequence Replication, Improvements | Scaling Postgres 111

Episode 111 April 26, 2020 00:16:48
Insert-Only Vacuum, Settings, Sequence Replication, Improvements | Scaling Postgres 111
Scaling Postgres
Insert-Only Vacuum, Settings, Sequence Replication, Improvements | Scaling Postgres 111

Apr 26 2020 | 00:16:48

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss insert-only vacuums, the Postgres settings hierarchy, sequence replication and desired improvements.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/111-insert-only-vacuum-settings-sequence-replication-improvements/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about insertonly vacuum settings, sequence replication, and improvements. I'm Kristen Jameson, and this is scaling postgres episode 111. [00:00:21] All right, I hope you, your family and coworkers are doing well. And for this week, our first piece of content is PostgreSQL version 13 new feature tuning Auto Vacuum on Insert only Tables. Now, you may be thinking why do you need to vacuum insert only tables? Because generally vacuum is used to vacuum up dead rows that have been the result of rows being deleted or updated. So Postgres actually maintains the old rows for a time to make sure that they are visible for any transactions for other users until they're no longer visible throughout the entire system and then they are vacuumed up later. So basically an update results in a new row being added and the old one being retained and then vacuumed later. So why would you need to vacuum insert only tables? And basically the two reasons are actually mentioned there in the little graph here is that because of the transaction ID age that has a limit and vacuum resets, that essentially freezing rows, allowing it to be available and also updating the visibility map. So this is important for insert only tables. So that's essentially what this blog post runs through is this new feature that's being added to 13 that could be advantageous for insert only or essentially append only tables. So they mentioned it adds a new column in insert since vacuum. So essentially number of inserts since vacuum column of the PG stat all tables. And you can adjust an auto vacuum insert threshold and an auto vacuum insert scale factor so you can make it customized. And apparently they even have one for toast tables as well. Next they talk about use case one which I mentioned is to handle a transaction ID wraparound. In any active transactions, there's 2 billion that are allowed. So once you hit that limit, you need to freeze those transactions so that more become available. Now, what normally happens is when the TXID reaches 200 million, it then triggers a special vacuum to handle it. The problem that you can run into now with say, insert only tables is that vacuum doesn't get done in time, which means you need to take the table offline and do a manual vacuum. And also a mandatory auto vacuum will not give up when it blocks a concurrent transaction. So such a blocked operation will block all other access to the table and process and comes to a standstill. And they give you some options of some configuration changes you can set or recommendations when using this feature in version 13. The second case they mention is index only scans. And this goes to the visibility map, which the index only scan needs to consult to determine whether it needs to go to check the heap tables or not. So if you want more index only scans, that don't have to check heap tables, you want to make sure the visibility map stays updated. Essentially this new feature will allow you to do that for more appendonly tables. Now they do say that if a table receives enough updates or deletes so it's not strictly append only, you can set the auto vacuum scale factor relatively low, but with an insert only table that's a bit harder to do. The last case they mentioned is a hint bits on insert only tables. So basically the first query that reads a newly created row has to consult the commit log to figure out if the transaction that created the row was committed or not. Now, if a lot of rows were recently inserted in a table, that can cause performance hit for the first reader. Subsequent readers won't have to check the commit log because they can just look at the hint bit. Therefore, it is considered good practice to vacuum a table after you insert a copy a lot of rows into it. And essentially with this new feature they say PostgreSQL automatically vacuums these insert only tables after large inserts so you don't have to necessarily worry about it. So definitely interesting new feature and if you want to learn more about it, definitely check out this blog post. Now I should add, this is from CyberTech postgresql.com. [00:04:14] The next post where do my postgres settings come from? And this is from Mydbainok.org and they're talking about all the different places where you can make settings changes in postgres. And they have this convenient graph here and it is in a hierarchy so the top take precedence over those lower. So at the bottom you have the default settings. Up above that the start configuration for the postmaster. Up above that is the PostgreSQL comp file. Above that is the PostgreSQL AutoComp file that you can change with an Alter system command. Up above that for each database has configurations that you can set through an Alter database. The roles have their own settings. Through an Alter role, the sessions have their own settings where you just set and then make a particular setting change. And then lastly transactions you can set local within a particular transaction. So this is a very convenient graph to kind of understand where different settings can be made and their hierarchy. You can also consult the PG settings table and it tells you kind of the source of that settings that exists. Then finally he presents this decision matrix in order to try to determine where a particular setting is coming from. So if you're interested in that, definitely check out this blog post. [00:05:29] The next piece of content is can sequence relation be logically replicated? And this is from Heigo CA and they're talking about logical replication where sequences don't get replicated over. So the data tables get replicated over but not the sequences. Now first a address is what is a sequence. So basically when you create a serial column or you have a primary key that has a sequence associated with it. It's a counter that determines what the next new value is for that primary key. That is an integer or a big int or something of that nature. And these sequence values are not replicated. And it says why they don't do it is actually because of how they save things to the wall. So the wall logs sequences every 32 increments. So for example, if you're incrementing by one and you are at 500, what it's going to log to? The wall is 532 and then it's going to wait until 32 more increments pass before it logs again. And again it logs 32 increments ahead. Now, the reason it does this is because sequences are not guaranteed to have no gaps. For example, you can see this if you do an insert and have an error into a table, you can see that next time you have a successful insert, you will have a gap. You won't have those failed inserts. And the reason that they do this is for performance, so you don't have to write to the wall as much. So they say in theory this could be logically replicated and the fact that gaps are okay, having a far forward sequence should be fine. So they actually went through and looked at all the different code here and investigated how to do it. So I'm not going to go into the detail here, of course, but if you want to check this out, it shows you how this potentially could be added. And apparently they may be looking to do an update for Postgres on this. So if this is of interest, definitely check out this blog post. [00:07:23] The next piece of content is Seven Things that could be improved in PostgreSQL. Now this is inspired by the article that we've covered in a previous episode of Scaling Postgres called Ten Things I Hate About PostgreSQL. And it's basically mentioning proposed improvements. So this mentions seven things that could be improved. The first one mentioned is rigid query planning. Now, he's not necessarily mentioning the absence of query hints that were mentioned in the previous blog post, but this is talking about having a more adaptive planner, because basically the planner just looks at the query, the table structure and the statistics that exist, but doesn't make any determination based upon previous query runs in terms of deciding how to do that query. So he's saying having that type of knowledge could be beneficial to help adapt the plans the query planner is coming up with. So that definitely seems interesting. Number two is some levels of automatic tuning. Like for example, postgres mostly does not look at the operating system at all because it runs on so many different ones. It tries to remain operating system agnostic. But there could be benefits of setting some things based upon looking at, okay, what is the memory that exists or the disk space that exists and things that are logged to the logs, could they perhaps make automatic tuning changes? So that sounds interesting on the surface. I'd be a little bit wary of it and would want of course, be able to turn any automatic tuning off, but definitely an interesting proposal. Number three is lost statistics during a PG upgrade. So when you do a PG upgrade, you basically have to run statistics again to repopulate them. And since a lot of the data doesn't really change from an upgrade, it would be advantageous if we could retain those somehow. So I would say this is advantageous, but it doesn't really take even for large databases a long time for just to run some statistics. Number four is the auto vacuum and transaction ID wraparound. Big, huge issue. Definitely agree that making this easier to deal with would definitely be a great improvement. Now things like the Zheep project that they're working on, where they're actually having a separate area and not just inserting a row with an update and then having to vacuum it up later, basically writing to a separate storage structure so that you could handle rollbacks or concurrent visibility on the state of a particular row. Definitely looking forward to seeing what this looks like once it's complete. Number five is disk footprint in Bloat. So basically look for better ways to optimize the storage of data on disk. Now they're saying Zheep could help contribute to this because you'd have less Bloat. They're also mentioning Z store that could be beneficial for ultra large data warehouses. There's also the C store foreign data wrapper. So that's another potential to look at. And this could be advantageous, but again, you can store it on a more compressed file storage system if you so choose to. So that is another way to kind of get this benefit. Number six is out of the box vulnerability against brute force password attacks. So here they mentioned that there's no built in capability to do this, but you can add the Auth Delay Contrib module to do it. So just be aware of that and add it to avoid this potential problem. And seven, no query cache. Now, he does mention this is incredibly hard to do because anything dealing with Caching is hard to do. But he said it generally would be beneficial if you had insight to the fact that no data has changed to be able to return the same query results for the same input. That could definitely be an advantage. So if you're interested in investigating these a bit more, check out this blog post from CyberTech postgresql.com. [00:11:16] The next post is a beginner's guide to basic indexing. And this is from Bannister Me website. Now this is a very basic guide and it's basically how this primarily a Laravel developer, increased the performance of his application by 100 times. Now this is not surprising if you don't have an index. Adding an index gives you, these kinds of huge performance wins. So if you're a developer and wanting to learn more about databases in terms of improving your application's performance, then this is a good blog post to review about how he did it. But essentially he identified indexes that were needed and added them and even looked into whether to make it an index only scan or not. So again, it is on the basic side, but if you're a developer and wanting to learn more about indexing and the performances they can result in, definitely check out this blog post. And I should mention it's not PostgreSQL specific. [00:12:12] The next post is things I wished more developers knew about databases. And this is from a median.com under Software Engineering. Now again, this is not PostgreSQL specific, but more generally talking about different types of database concerns like Acid Network, Uptime, Auto Incrementing. So again, this is more focused for developers than any DBA and considerations to take into account as you're developing applications. So if you're a developer, you've probably gained some insight looking over this post. [00:12:45] The next post is stored. Procedures also have rights. This is from Higo CA, and it's talking about the rights that you can assign to store procedures. So primarily you can assign invoker rights or definer rights. So that procedure can be run based upon who defined it, or it can be based upon who is running that procedure. And when you're creating them, you need to kind of take into account how you want it to operate. So this goes through an example about how to set it up. Two different users, two different procedures that have each of these procedures. So you can see how it works and how you get an error if one has an invoker right, but doesn't have the permissions to the particular table to query. So if you're interested in learning more, definitely check out this blog post. [00:13:32] The next post is automated rapid switchover with a BDR database cluster in Kubernetes. So this is particularly interesting. So using Kubernetes so you have a cluster of PostgreSQL servers in conjunction with BDR, which is their bi directional replication product from second quadrant.com. So basically you have say, three database servers. They are all masters. So it's multi master and it shows how one server can die in that cluster. It can almost immediately switch to one of the active masters and then proceed to rebuild the missing one. And it does this in conjunction with their cloud native BDR operator that helps operate the Kubernetes cluster. They have a little presentation here titled a Hard to Kill Automated Rapid Switchover with BDR Cluster and Kubernetes. So this I find really interesting. Use case with Kubernetes. Now, of course, the things you have to take into account with a multimaster setup is your application development practices need to be different. So there's considerations you need to take with that. But definitely an interesting blog post. [00:14:45] The next blog post is easy. PostgreSQL twelve and PG Admin Four set up with Docker. This is from Crunchydata.com. So this just runs through setting up postgres Twelve and PG Four admin on a Docker container. So potentially maybe you want to use that with Kubernetes as well. The second poster related to Kubernetes, also from Crunchydata.com, is set up and run PostgreSQL Operator client PGO on Windows. Now it's showing you how to do it on Windows, but it's also saying how to set it up after you have the cluster up and running. So if you're interested in doing this on Windows, definitely check out this blog post. [00:15:28] The next post is oracle to PostgreSQL start with Connect by this is from Secondquadrant.com, and they're talking about how Oracle has the Start With or Connect By to give a particular result. And they're showing you how you can do that. Same type of query in PostgreSQL using a with Recursive Cde. So if you're interested in migrating from Oracle to PostgreSQL, definitely a blog post to check out. [00:15:55] The next post is the PostgreSQL Person of the week is Mark Wang. So if you're interested in learning about Mark and his contributions to PostgreSQL, definitely check out this blog post. [00:16:07] And the last blog post is a chat with Luca Ferrari, and this is from Superhighway Dev, and he is another person in the PostgreSQL community. So if you're interested in learning more about Luca, definitely check out this blog post. [00:16:23] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show Notes. Be sure to head over to Scalingposgres.com where you can sign up to receive weekly notifications of each episode. Or you could subscribe via YouTube or bye. Thanks.

Other Episodes

Episode 8

April 16, 2018 00:15:06
Episode Cover

Scaling Real-Time Analytics, Covering Indexes, 1,500 Upgrades | Scaling Postgres 8

In this episode of Scaling Postgres, we review articles covering real-time analytics at scale, covering indexes in Postgres 11, 1,500 Postgres upgrades and PostgreSQL...

Listen

Episode 43

December 17, 2018 00:09:49
Episode Cover

Logical Replication, Prepared Transactions, Ansible, Partitions | Scaling Postgres 43

In this episode of Scaling Postgres, we review articles covering logical replication, prepared transactions, Ansible Postgres deployment Ansible and tidy partitions. To get the...

Listen

Episode 200

January 30, 2022 00:12:11
Episode Cover

5 Minutes of Postgres, Unnest, SCRAM, Multi-master Review | Scaling Postgres 200

In this episode of Scaling Postgres, we discuss the new series 5 minutes of Postgres, using unnest, switching to SCRAM authentication and reviewing multi-master...

Listen