Schema Change Mistakes, Canceling Statements, pg_rman, Pedantry Removal

[00:00:00] Hello. In this episode of Scaling Postgres, we talk about schema change, mistakes, canceling statements, Pgrman and pedantry removal. I'm Kristen Jameson. And this is Scaling. Postgres episode 217 one. [00:00:23] Alright, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is common DB schema change mistakes. This is from postgres AI, and he's talking about mistakes that frequently happen when you're trying to do a schema change to a database. Now, sometimes these are called DB migrations in different application frameworks. But it's basically you want to add a table, add a column, add an index, you're changing the schema in some way. And typically, as he says here, these changes tend to be incremental. So you perform these changes one step at a time. Typically they're reversible. So application frameworks have the ability to allow you to reverse a change after it's done. And then they're also versionable so you can store them within your source code repository. For example, and this is a great post describing different mistakes that can be made. So he's bucketed these into three primary areas. One is concurrency related mistakes. Either failing to acquire a lock in a particular amount of time, or you acquire a lock but you leave it open for too long, impacting other transactions, or you're updating too many rows at once, which can cause locking and other concurrency issues. The other bucket is mistakes related to the correctness of steps or logical issues. So you have a schema change that happens that's unexpected, or you have a mismatch between what the application code expects and what the database expects, or even just unexpected data. And then a miscellaneous where he's talking about exceeding potentially a statement timeout with a migration. Or you have vacuum behaviors or even primary key links. So this post covers about 18 different cases and we'll just run through them quickly. The first one, case one, is a schema mismatch. So imagine you go to create a table, but then you get an error that says relation already exists. So this indicates there's some sort of problem with your process of applying migrations because something has already ran create table or it was done manually. Basically you would want to try to resolve whatever that issue, how did this get run before it was supposed to be run? And what you don't want to do is case two, using if not exists, because you could use this. But then the problem with this is that it silently fails and you don't know that you have this problem that caused this. Something else created that table before your schema change ran. So basically the appropriate way to resolve this is to find out what caused this to be run in advance or was it run manually. And to remedy that issue as opposed to trying to use if not exists. [00:03:05] Case three is hitting your statement timeout. And he says typically you see this if you run in a testing environment that's not a replica of production. You can be under a statement timeout, but once you move to production, sometimes this gets hit. So you want to do schema changes that you know are going to spend a little bit of time in a production like environment, like have a copy of the production database in some way to be able to test things like that out and they're talking about their product. Actually. The database lab engine allows you to do thin clones of a database for this type of testing. Case four is unlimited massive change. So maybe you're deleting or updating a lot of data in a table and like this query here, that's updating took 76 seconds to do. So that's way too long that you'll want to hold locks on a system typically. And he says some of the problems that could happen, you could run into locking or contention issues just by the number of rows that are being updated. Your checkpointing could happen super frequently given the updated amount of data that needs to be done and that could actually lead to bloat issues. So how you would typically want to update a lot of data at a time is do it in batches and he recommends limiting the actual active work to about 1 second. So maybe you would put this in a loop, do some work for 1 second or approximately that amount of time, then take a break and then apply the change again. And then what you could also do is vacuum between changes as well to make sure that you're avoiding bloat issues and even doing analyzing to make sure that your statistics are kept up to date. Case five is acquire an exclusive lock and wait in transaction. So basically you're doing alter table command, adding a column, but then you do something else that causes a long delay and basically you want to avoid this if you can, anything that requires an exclusive lock because this is a very fast operation, but it requires an exclusive lock, so you want to keep that separate. So basically break out your migration. Usually it's always better to do complex operation in terms of multiple steps, particularly from one step can take a long time to do. And they're using an example here in case six of where say you're copying a lot of data after you add a column. So basically you would want to break those steps out and do the copy command. Maybe you wouldn't want to do that in batches as well. Case seven, waiting to acquire an exclusive lock for long blocking others. So basically you don't want to run into a lock queue where things are waiting for a lock on a table and then you have other locks that are waiting for that lock behind it. So one way to avoid that for your schema migrations, what you want to do is apply a lock timeout and a statement timeout as well. But basically the lock timeout should avoid a queue of locks building up. So basically if a statement is unable to acquire a lock within the lock timeout, it'll go ahead and cancel it. So a canceled migration is usually better than one that would cause locking issues. [00:06:13] Case eight careless creation of a foreign key. So basically it needs to check for consistency in the foreign table to make sure that all the values are there. So the best approach to this type of migration is do it in two steps. The first step is create the foreign key with the not valid option. So basically that creates the foreign key it's present but all the historical values are not valid yet or haven't been checked. And then as a second step you do alter table and validate that constraint and that shouldn't necessarily block but it will validate that constraint and make sure everything's good. [00:06:50] Case nine is careless removal of a foreign key and basically the problems you can run into here is if you're not using a lock timeout. So generally you would always try to want to use a lock timeout for so many of your migrations other than when you're creating indexes to avoid any kind of lock queue issues. [00:07:08] Case ten is careless addition of a check constraint. Again, this is similar to the foreign key adding and what you would want to do is create the constraint as not valid and then as a second migration or a second step, alter the table, validating that constraint and that'll allow you to add that check constraint to the table. 11th is careless addition of Not Null so not nulls are a little bit different in that they can lock up the table to verify things are not null and there's a couple of different ways to handle it. Based upon your version of postgres like before eleven, you don't really have a good way of dealing with it and maybe you would want to go with a check constraint instead of not null on the column. But if you're at eleven there's a way to do that and then if you're on version twelve, the easiest way to deal with it is you go ahead and create a check constraint that's not null and you can create it as not valid and invalidate that constraint. But once that constraint is in place, you can then set not null for the column and version twelve and higher will allow you to apply that relatively quickly. [00:08:12] Case Twelve careless change of columns data type. Now there's also mentioning renaming objects, so I think these are related. So careless change of a column's, data type or renaming the objects. You want to do something similar in this case that he recommends and that is actually creating a new column. Write to both columns, probably using a trigger, backfill the data in batches, switch your application to use the new column and then drop the old column once it's fully done. So I would do that for both the renaming objects and a need to change the column's data type. [00:08:47] So case 13 is careless create index. Basically you never want to create index, you always want to do create index concurrently. Similarly with drop index, you want to drop index concurrently because that avoids locking the table during the index creation. [00:09:04] So case 16 is add a column with the default. And the great thing about this is that if you're on postgres eleven and later, this works without any issue K 17 leftovers of create index concurrently. So this is a downside of create index concurrently is that if it runs into issues or has a failure, it will actually leave around an invalid index and you can actually inspect the table and you can see the index is invalid. And if that's the case, you need to drop the index concurrently and then add it back concurrently. Case 18 is four byte integer primary keys for large tables and basically recommends using the eight byte integer types for those. But this is a great post, covers mistakes related to database schema migrations and gives very good recommendations on how to get around those issues. [00:09:54] The next piece of content help I cannot cancel a PostgreSQL query. This is from CyberTech postgresql.com. And basically he's saying when you're running a query, there are times where you can't cancel it. So what's the problem with that? And he says there is a secret key that's sent with a new connection that you can send a cancel request using the library. I actually haven't used that. What I typically do because I'm the sysadmin is I use PG cancel backend. And if that canceling of the statement doesn't work, then resorting to terminating the whole backend by using PG terminate backend and you just put in the process ID of whatever process you want to terminate. Now the cancel backend sends a SIG int signal and the terminate backend sends a SIG term signal. Now, it doesn't immediately just stop the process or cancel the statement, it actually finishes work before it cancels or terminates. And in the code he says it actually looks for a check for interrupts to check for these, and then it goes ahead and finishes its work. Now some of the reasons that you could get an issue where a query doesn't get canceled is that execution gets stuck in a loop that doesn't contain this check for interrupts. And he says this would basically be some sort of a postgres bug. The other possibility is you're stuck in a third party C function that doesn't have the ability to give control back. And basically this would indicate another bug in this particular function. And then third, it's stuck in a system call that cannot be interrupted. So this may be some sort of problem at the operating system or hardware level. So this kind of example should be relatively infrequent where you can't cancel it, I would hope. Now, he does say do not use kill nine unless you're desperate. Because the issue, when you go to use a kill nine on a process, that process has not had the opportunity to clean itself up. And he says the postmaster detects if one of its child processes did not shut down cleanly, and then it kills all other postgres processes and initiates a crash recovery. And depending upon when the last checkpoint was, that could be a fair amount of time to bring the database back up. So you never really want to do this even for just a postgres process, unless you absolutely have to. And never, never kill the postmaster process basically the main postgres process. Now, he did have a trick here to cancel a stuck query without crashing the server, in that he created a custom C function that basically did an infinite loop. And then he called that function and then he tried to cancel it. It didn't work. He tried to terminate the back end. That didn't work either. What he actually did to that process, he attached a debugger and sent a process interrupts command, and that process for interrupts command enabled that process to stop. Now, how you can create an infinite loop within a C function is you can actually use that check for interrupts function as long as you have it in there, then you can actually cancel this query at some point because the system will check for any interrupts that are happening. But this is an interesting blog post and if you want to learn more, definitely encourage you to check it out. [00:13:18] The next piece of content how to do backup and restore for PostgreSQL using Pgrman this is from Higo CA. Now, from my previous work with Oracle, I know Rman is their main backup and recovery tool, and I haven't really used Pgrman for postgres, but apparently it's very similar. I guess the DSL that it uses is similar to Rman, but this post describes how to use Rman to do a backup and restore of a postgres system. Now, I didn't see a lot in here, nor on the Pgrman site itself, about what are the advantages or disadvantages relative to something like PG based backup, because that's what I tend to use. But if you come from Oracle and you're looking for a utility similar to Rman, maybe check out Pgrman. [00:14:06] The next piece of content, removing some pedantry off postgres. This is from mydba notebook.org, and they're talking about an issue where when you're doing a select and you're looking for a count of all rows from a table, they say that there's no real reason to use count star instead of just count, because when you use row number, you don't have to use an asterisk. And an asterisk means something entirely different when you're selecting all rows from a table, like, it doesn't mean select all rows necessarily. So basically they think it doesn't really need to have this asterisk. So if you run a select count without the asterisks from Test, you actually get an error. So what they did is they went in and made a patch for Postgres so that they removed it so that you can do a select count without the star. So if you want to find out more about how they did this, you can definitely check out this blog post. [00:15:00] The next piece of content. Five minutes of postgres episode 19 speeding Up Sort Performance in Postgres 15 this is from PG Analyze.com. And they're talking about enhancements to postgres 15 that have improved Sort performance. Now, we covered a similar post posted to the Citus Data blog last week. This one covers one that was done in the Microsoft blog, but it's pretty similar in terms of the content. But if you want to learn more about PG Analyze's perspective on this post, definitely encourage you to check it out. [00:15:34] The next piece of content, the PostgreSQL Person of the week is Masahiko Sawada. If you're interested in learning more about Masahiko and his contributions to Postgres, definitely check out this blog post and the Last Piece of Content we did have another episode of the Rubber Duck Dev show this past Wednesday evening. This one was on typed or untyped ruby. So we were talking about the advent of static analysis tools for Ruby Three, but in general, we were also talking about static versus dynamic typing. So if you're interested in that type of content, we definitely welcome you to check out our show. [00:16:09] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 61

Constraint Triggers, Simple Shards, Data Streams, OIDs | Scaling Postgres 61

Episode 27

Multiple Databases, Slow Queries, Sorting Performance | Scaling Postgres 27

Episode 324

Four Million TPS | Scaling Postgres 324

Schema Change Mistakes, Canceling Statements, pg_rman, Pedantry Removal | Scaling Postgres 217