Migrating Millions Of Databases | Scaling Postgres 374

[00:00:00] About2 weeks ago on scaling Postgres, we talked about multi tenancy. This was an episode by Postgres FM where they talked about multi tenancy in databases. So this is a practice typically followed by software as a service companies to firewall one customer's data from another customer's data. And at the time I said the only clients that I've worked with actually store one table for a given set of data and then segment that table by the tenant. So, so there's a tenant ID that denotes this data belongs to this tenant. The other way you could do it is have a tenant per schema or a tenant per database. And I couldn't imagine having a growing SaaS company and having hundreds and thousands of databases. [00:00:48] Well, the first blog post this week is talking about a software as a service company that actually has 4 million databases it has to manage. [00:00:58] And be sure to stay until the end in my consulting corner segment, I'm just going to give a few more details about the summer of upgrades and how that's going because that seems to be taking most of my time. But I hope you, your friends, family and coworkers continue to do well. Our first piece of content is migrating the JIRA database platform to AWS Aurora. This is from Atlassian.com and and they migrated 4 million Jira databases to AWS Aurora. So clearly how they do their multi tenancy is each customer has their own database that they manage. And they say they've gotten so good at it that migrating databases is just a regular day's work. And they say on average we migrate 1000 databases every day. [00:01:46] Now, I'm assuming they're trying to rebalance the number of databases on given instances. [00:01:52] I'm assuming that's why they need to migrate that much. And they have one database per JIRA tenant, and they have millions of JIRA databases. And those 4 million databases are across 3,000 PostgreSQL servers in 13 AWS regions. Now this is a lot of overhead, but I guess you're trading one set of engineering problems for another. So you have to get really good at operations when you have this many databases compared to having really deep knowledge to Shard 1 data set across multiple databases. Oh, and they said the migrations are due to server rebalancing to help maintain a nice even spread of load. And they either do a backup and restore of the databases if they're small or logical replication if they're larger. And basically they wanted to migrate all of their instances that are just RDs postgres to Aurora and, and how they did it is they had a multi AZ database instance, postgres RDS postgres with an EBS volume. And they did replication to an Aurora read replica. So this is a new read replica and you can sync the Aurora cluster volume, apparently. And then at the promotion point, you basically break the link and basically promoting this Aurora read replica to a write replica. And this is the process they used. First they locked out the users for a given database, then they promoted to the new database. Basically Aurora became the active database and then they unlocked those users and they used AWS step functions to do this procedure. But they did run into an issue with the number of files during the migration. [00:03:30] So if you have a lot of databases and those databases have a lot of tables or other types of objects in the schema, that takes a long time to do the migration the way they want it to. And AWS actually recommended for them to reduce those number of files. Well, it's hard to just drop a table from a schema. So basically they said, all right, let's cull, or I think they use the term drain some of those databases from the source RDS instance, do the promote event and then you refill those databases onto the new Aurora cluster. So presumably they drain off these tenant databases to other existing clusters, then do the promote and then they rebalance, refill them back and they talk about some of the details with regard to that. [00:04:16] So they had 2,400 database instances, and on those instances they had 2.6 million databases that were part of the conversion. And the total number of database files used by JIRA in total they said is over 27 billion, which is insane. [00:04:35] So I'm definitely not one to choose this path versus having all your tenants within one table segmented by a tenant ID column, for example. But it definitely presents interesting engineering challenges. And if you want to learn more about that, feel free to check out this piece of content. [00:04:54] Next piece of content. When SIGTERM does nothing, A Postgres mystery. This is from Clickhouse.com and he is a member of the Clickpipes team, I believe that used to be on the PeerDB team that was acquired. So basically they need to move a lot of customer data into Clickhouse Cloud and a lot of those customers have postgres. So they need to get their postgres data into Clickhouse Cloud and how they do that is CDC or change data capture. So basically they set up replication slots, they that decode the wall into values that can be inserted into an analytical data store. Such as Clickhouse Cloud, but they ran into a problem with a customer because they set up a pipe to replicate data. And this query was hanging, basically trying to create the logical replication slot was just hanging, its state was active and it was holding up the wall, but nothing was happening. And when they tried to cancel the back end or even terminate the back end, it kept running. [00:05:58] And this was a hosted service, so you can't just use a direct kill command. Basically you'd probably have to restart the instance to get it to go away. They said they've seen some of this before, but they definitely investigated this time due to the issue. So they were able to get an ACE trace output on the back end process running the broken query. And basically the back end is in a loop of sleep calls. So basically every one millisecond sleep, it was waiting for something. [00:06:26] Now, another thing to note is that this replication slot was created on a read replica. So the problems that they're experiencing here only happen when you try to create a replication slot on a read replica. And that capability is new as a Postgres 16. But apparently in the primary you won't have this particular issue. And what postgres does is it's trying to find a consistent start point, and that's relatively easy to do on the primary. [00:06:51] So it just has this type loop to try and find that consistent point. But that apparently is much harder to find on replicas because of how they apply the wall, apparently. And this blog post goes into a lot of detail about how the primary is different from the replica. But the result is, with this tight 1 millisecond loop and trying to find a consistent point, the replicant can be stuck there for hours waiting to find that point because it doesn't have real time lock information. [00:07:21] And then why it can't be canceled or killed is because this tight loop doesn't check for interrupts. So he said, all right, first let's at least be able to cancel this process if the creation of the logical slot becomes hung up. So he submitted a patch to do just that, adding a check for interrupts. And he said this postgres patch was accepted, but he didn't mention what version it's in. So I don't know what version this is already in or will appear in, but once this is in, you should be able to cancel it if it gets in the state. And he says there's another patch someone did to actually introduce a new wait event. Because if you remember up at the top, the wait event for this query was active when it really wasn't doing anything. So someone's going to add a new wait event so people can see, oh, it's waiting for, you know, this consistent point. And there's more discussions about changing the way that standbys wait for transactions to complete on the primary Anyway, if you want to get into all the detail of this blog post, definitely encourage you to check it out Next Piece of Content why Cursor is about to ditch Vector search and you should too this is from tigerdata.com now a lot of this isn't postgres specific, but if you are using vectors in your postgres database this can be relevant. And there's one thing he says throughout this post is that AI is just search and up until recently I would have agreed with that. But the kind of things it can do now with actually generating code or do other agent like functions. Don't know if that's really the case. I mean someone could say that the code writing is just doing search and replace on steroids or autocomplete on steroids. [00:09:03] But anyway, and if he says AI is just search and not all this vector DB rag, it's context engineering, it's prompt engineering stuff, then it's important not to forget lexical search because really you want to return relevant results that could be an exact or similar lexical search, or it could be a semantic search. And he says here, of course similarity does not equal relevance. [00:09:29] And also not all text has semantic meaning. Like if you're trying to search through a code base for get user by id, you don't want find user by email, you want the exact method or attribute. So here lexical search is better than semantic search. And here's a quote, Vector search should not be applied to text where semantic similarity is irrelevant. And he talks about how claude code actually uses more lexical search and less semantic search. But I have been using cursor and claude code recently, so I personally found this a pretty interesting read and feel free to check it out as well. [00:10:08] Next piece of content Mastering postgres replication slots, preventing wall bloat and other production issues. This is from morling.dev and this is a good guide on the things to watch out for when you're wanting to set up logical decoding. So you want to do some kind of change data capture from your database. So one of his first recommendations is to use the PGoutput logical decoding output plugin. So you have to choose an output plugin when you're using logical decoding. And he recommends using PGoutput. Now this does come in a binary format and it's not as convenient as JSON, but it is more efficient and it says it provides fine grained control over the replicated tables, columns and rows. I don't know if the JSON outputs don't do that. And the next recommendation is, he says, to define a maximum replication slot size. [00:10:57] So this protects your primary for running out of disk space if your replicas are unable to keep up with the load. So basically once the slot hits 50 gigabytes, in this case, the database invalidates the replication slot, drops the older wall segments and basically renders the slot until unusable. So basically you need to start over again with your logical replication if this happens. Next recommendation is to enable heartbeats. And he says this can be beneficial if you have multiple logical databases on one postgres host, because there is one shared wall for the entire instance. But sometimes what that can mean is that all the activity is in one database and there's not enough traffic in the other database that will allow the wall to advance. [00:11:44] So the trick is basically to create fake traffic in that second database that allows its replication slot to progress. And he shows a way to do that in Debezium with this code down here. [00:11:56] The next is being aware you can do table level publications. You don't have to logically decode the whole database. You can do particular tables and you can even do specific columns and rows of those tables as long as you're using Postgres 15 and higher. [00:12:12] Next recommendation he says is to enable failover slots. So this enables you to keep track of that slot on a replica as well. So if that replica or hot standby needs to be promoted to the primary, it will know already the last LSN logical replication decoded. Next thing he mentions is to consider using replica identityfull. So basically, if you have data in TOAST columns, when there's a change happening, you don't get the old values that were in the toast columns. So so you have to do a costly state materialization operation. [00:12:45] So if you need to decode those, you may want to set that table to identity full, because then if a table's row gets updated that has toast, it will give you a comparison in the wall of all the old data and the new data. [00:12:58] And lastly, monitor, monitor, monitor. And he goes into a lot of depth of this and different queries you can use and techniques to keep track of of the logical replication as it's happening. So I thought this was a great post and I encourage you to check it out. Next piece of Content Handling Unique Indexes on Large data in PostgreSQL this is from Volodymyr Potichek. Forgive me if that's incorrect. So he was working on a release where he needed to add new functionality to a feature, remove duplicated entities related to that feature, and then set up a unique index to prevent future duplicates. So he said parts one and two went fine, but when he tried to apply the index he got this error message. Index row size exceeds the B Tree Max of 2704 for the index he created. So basically the value of indexes need to be within the B tree. There's no such thing as toast for indexes and each value that can be stored is limited in size. [00:13:58] So how can you handle this? Well, you can use a hashing function. Now you can create a B tree index with a hashing function, basically an expression index. But what he chose to do is actually add a generated column to the table that had the hash function and then he applied an index to it. And that hash function basically gives you a short value, but it should be hopefully unique amongst the values in your table if you pick a good hash function. [00:14:27] But this blog post goes into a lot more detail about this topic and feel free to check it out. Next piece of Content Active Active Replication the Hidden Costs of Complexities this is from percona.community and this is a follow up to a post where he talked about some good reasons you could use active active databases or multimaster. Here are some bad reasons. Number one is you want to try to scale write throughput and he basically said this is probably not the way to go about it. I would say scaling out with multiple writers, not trying to keep them in sync, but actually separating your data into separate shards like Citus does with their product or like pgdog or other scaling poolers do. I would say that would be the way to go about scaling writes. [00:15:15] You shouldn't do it for performance because actually performance may be hindered as you're trying to replicate the data between these multiple databases. [00:15:23] And if you want performance generally it's better off to have multiple read replicas. But if it's actually write performance then you're back to the sharding or splitting up that data. He talked about load balancing and again if it's a reads better to use read replicas. If it's writes, that's going to be probably a little harder to do. Or the fourth reason is because other systems did it. But he talks about some of the pains with active active1 is conflict resolution is a problem of its own, so it's definitely something you have to deal with when you're using an active active implementation. Next is Net infrastructure networking costs, so all the data constantly being transferred across regions. Next is operational and management overhead, so particular changes can break replication or trigger conflicts. And debugging gets really hard. But check out this blog post if you're interested. [00:16:16] Next piece of content so why don't we pick the optimal query plan? This is from vondger Me and this is from a post he did last week on why postgres doesn't often pick the optimal plan and he gives more details so that people can replicate what he was looking at before. And a lot of it comes down to random page cost because that is a lot of the trigger point for what he was looking at. When do you do an index scan versus a sequential scan? And the random page cost definitely has a huge influence on that. [00:16:46] Now for his tests he used random page calls of 2.2 and he debates some of the advice that it can be close to 1 like it should be 1.1, which I know that's personally what I usually used, but he said their arguments could be made that it could be higher. [00:17:03] So like anything, it's definitely something you should test out with your own application and what it's doing. But check out this blog post if you want to learn more. [00:17:11] Last piece of content Setting Sail installing cloud native PostgreSQL or CNPG this is from cybertech postgresql.com so if you want to walk through a process of setting up Cloud Native pg, you can definitely check out this blog post. [00:17:25] And now for my consulting corner. So the summer of upgrades continues. [00:17:30] Someone did ask me a question. What were the rough sizes of the databases and what kind of techniques were you using to do the upgrades? So on average the size of the databases is about 13 terabytes across the four different databases. The median was about 8 terabytes. So some of those four are a little smaller, one of them is fairly bigger. But those are generally the ranges of the databases that are being upgraded or migrated. And really one is being migrated into RDS using logical replication, two of them are being upgraded using a logical replication technique, and then one is in Aurora. And we're planning to use the blue green deployment that AWS provides. [00:18:11] Other than that, I hope your summer continues to go well. [00:18:14] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode there. You can also find an audio version of the show, as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 364

Scaling For High Volume | Scaling Postgres 364

Episode 365

Here Comes Async I/O! | Scaling Postgres 365

Episode 75

Combining Indexes, Regular Expressions, Parallelism, PostGIS | Scaling Postgres 75