Scale Through Sharding | Scaling Postgres 308

Episode 308 March 24, 2024 00:18:15
Scale Through Sharding | Scaling Postgres 308
Scaling Postgres
Scale Through Sharding | Scaling Postgres 308

Mar 24 2024 | 00:18:15

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss high availability's future, a custom sharding solution by Figma, sharding pg_vector queries, and PG17 logical replication failover.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/308-scale-through-sharding/

 

View Full Transcript

Episode Transcript

[00:00:00] It. Sharding is one of those things that you reach for once you've reached your limits of scalability. So if you've optimized your queries or your statements as much as possible, if you're distributing your reads across many read replicas once you need to continue scaling, sharding is typically looked on as the last resort. Maybe you're doing application sharding, where maybe you're breaking off different functions into separate databases, or you're doing, say, customer based sharding, where certain customers get colocated to a particular shard, so all of their data resides on that shard, making queries for that individual customer easier. Well, we talk about that in a few posts this week, but I hope you, your friends, family and coworkers continue to do well. Our first piece of content is PG Friday redefining postgres high availability this is from bonesmoses.org and even though this is a PG Friday post, it was posted just this past week. So I think this might be a late PG Friday post because I think they talked about high availability a number of weeks or if not months back. And this is tangentially related to sharding, but this post is really good for the potential future picture it paints. And I kind of align with the vision that it's, I guess, hoping for. And basically he's saying there's a lot of problems with the solutions that exist today for dealing with high availability. Like he calls out rep manager EFM stolen PG auto failover or Petrone. And personally I think it's a hard problem to deal with anyway. But I think it is true. Quote it's easy to get wrong and even the experts occasionally have trouble here. And he brings up postgres Kubernetes operators as well, and most of them use Petrone to manage their high availability, except for cloud native PG, which actually uses a lot of Kubernetes functionality. But he says, quote, the operator has to do so much work to maintain state, handle failover, provide services for mutable endpoints, and maintain replicas that the operator YAML definition alone is over 850 kb. But what he brings up is so what can we do instead? And he's basically looking at what some of the cloud providers have done in terms of breaking apart storage from compute. Because the thing about handling replicas now they're all pretty much relying on streaming physical replication, and when you do that, you have to make an entire copy of the database. So if you want five replicas, you're going to have five copies of the database files around, so you can't really scale your compute up and keep your storage consistent. One is tied to the other and he's suggesting if we can break compute from storage like AWS has done with Aurora, I think Google has done with AlloyDB and of course neon DB as well. They've all broken out the compute nodes from storage, because once you do that, then you can replicate the storage using new and different tools and then just create as many nodes as you need to interact with that storage. And he thinks that would be a great path forward to handle availability better. The other part of that puzzle is introducing some sort of coordinator node, and this is kind of what Citus is doing, where they have a coordinator node that receives the request and then it distributes it to the worker nodes to do the work. Well, having some sort of coordinator or endpoint node to know this is the point at which an application talks to the database and it can route or proxy that off to a given compute node to execute that statement, be it an insert, an update, a delete, or just a select. And once that happens, the postgres database truly becomes a cluster. It's a cluster of compute nodes interacting with a common data store. And he thinks this would be far more easier to manage and resilient to issues. But I thought this was a particular interesting post. I don't necessarily agree with 100% of it, but I think heading down this particular path would be really interesting because it would solve a number of problems that exist now. But I definitely encourage you to check out this post. [00:04:22] Next piece of content how Figma's databases team lived to tell the scale this is from figma.com, and they're talking about their quote nine month journey to horizontally shard Figma's postgres stack. So they say that their database stack has grown almost 100 x since 2020. So over the last four years they said initially they were doing what they're calling vertical partitioning, which I'm looking down here, and they have an image that says vertical partitioning, where I'm assuming they're breaking up a monolithic database into multiple databases, each doing separate roles. [00:05:03] I wouldn't necessarily call that vertical partitioning. I mean, this seems more like microservices sharding or application sharding. Basically, you're segmenting the duties of the database into multiple databases, each having their own duty to follow. So this handles users and authentication. This one handles files and images. Or maybe you could separate files into one database and images into another database. But what they were really wanting to get to was horizontal sharding, which basically takes this user table and breaks it up into shards based upon users. Now, I typically see this with regards to accounts or tenants, so it's a typical multitenant application. You just separate each tenant into their own shard. But that doesn't seem to be the path that they chose for this because they talked about not having the same shard key, but they had a handful of shard keys, so they actually sharded some tables by user ID, some by file ID, some by ID. So it already starts to get a little bit more complicated. So their goal was to shard this way, and then when I looked at what they were doing, they're basically rolling their own sharding solution, including building an entirely new proxy system that includes query rewriting to the proper destination or the proper database node. And I hate to say it, but as I was reading this post I was thinking, good crease. That's an awful lot of work to do because I know there are other proxies that you can route to other databases that exist. Now could you simply use some of that open source code and write your own solution? Or the most obvious one is how hard would it be to use Cytus, which is an open source solution for automatically scaling out your database? But then I saw that they said end to end. It took their team roughly nine months to shard their first table. So I'm an engineer, but I am also a business owner too, and looking at it taking nine months to shard our first table, alarm bells are going off for the business owner side of me that there has to be an easier way to do this. So they definitely built a lot of technology and the engineering is respectable, but the business side of me kept saying why didn't they just use an existing open source solution or modify it to meet their needs? [00:07:28] I mean, they probably couldn't do the sharding exactly the way they're describing here, but it seems like it might save an awful lot of work. But definitely feel free to check this out if you want to learn more how they did it. Now this post was also covered by Lucas's five minutes of [email protected] how Figma built DB proxy for sharding postgres and he actually compared it to how say notion did their sharding and also talked about Citus and said this is a very good option as well, but feel free to check out his piece of content too. [00:08:01] Next piece of content distributed queries for Pgvector this is from JCaTs five. And another way to do scaling out of postgres is to partition tables, and each partition be a foreign table that is queried via the postgres foreign data wrapper. So that's a way to roll your own scale out postgres or sharded postgres. And that's pretty much what Jonathan Katz does in this. And that was a way to get more memory and more CPU resources to be able to run calculations on super large embeddings. Basically, can you do scale out? So he experimented doing this. So he got two instances that are essentially the worker nodes and one head node, or you could call it a coordinator node, configured a node one and two appropriately. He created some random data on each of the worker nodes, set up the indexes, then he set up the database on the head node, creating pgvector. Postgres foreign data wrapper created the two worker nodes as foreign data servers did. The user mappings that you need to do created a table, partitioned it by a list, and that list was each node. So essentially created a partition table of vectors, that is a foreign table of node one, and then did the same thing, created another partition that is a foreign table pointing to node two. And he was able to query it and get information coming from both nodes from a single query from the head node. So for test one, it worked, and then he said two, should we be doing this? In other words, what's the performance for doing distributed queries like this? And is the recall high? And from what he was able to determine, the recall was actually better on two nodes. He's not quite sure why yet, but the performance wasn't as good. And these are different variations of EF search. [00:10:03] One reason that the performance was probably not as good is because it was contacting each foreign server serially. So it got the results for one and then went and got the results for two. It wasn't doing them in parallel. And he also suspected that if the single node was more memory constrained, it would have to go to the network attached storage more frequently, and that may impact performance. And it did a little bit. So it did normalize the performance a bit more, but the first node was still more performant. So it looks like it works, but there are some things that are holding it back from achieving the performance of a single node. But you can check out this blog post if you're interested in that next piece of content. Postgres 17 highlight logical replication, slot synchronization and this is awesome. So apparently there's a commit that has added a new slot sync worker to synchronize logical replication slots. So this is one of the features that I've been waiting for, basically allowing logical replication slots to fail over to a replica when you have a promotion event. So thanks to emit capilla. Oh, and I should say this blog post is from Beadruvo GitHub IO and this blog post is great in terms of showing exactly how this feature works in postgres 17 and hopefully will be released this fall. So basically you create a logical replication slot now on the primary, but in the last parameter here you indicate true, which is the failover flag. So you want this to be a failover slot and you can query that slot to see yes, failover is true, and then on your standby you go ahead and point to the primary database, reload the configuration. But you also have to set a new configuration parameter, sync replication slots and make sure it's true. And then you also have to turn on hot standby feedback. So this is a requirement now as well as have a primary slot name. And he says you can query the PG replication slots on the standby and you should be able to see this logical replication slot that it is failover and it is synced. And he went ahead and did some activity on the primary and he can see that the logical slot on the standby is in sync and keeping up with the primary. And then he also created a publication on the primary and subscribed to it from the standby. But be sure to set with failover true. Now using this it automatically creates the logical replication slot needed on the primary with Failover set to true. And you can see that everything is in sync and data gets transferred. And if you promote a replica to be the primary, you then have to alter the subscription on the subscriber to be able to start to follow that new primary. So this is an awesome addition. And if you want to learn more, definitely check out this blog post next piece of content look ma, I wrote a new JIT compiler for PostgreSQL. This is from Penderaf Info and he's talking about how right now postgres uses an llvm JIT compiler. But he says but LLVM has a big downside, it's not built for JIT compilation. So basically the JIT compiler can really impact costs. And he says right here, that's why a lot of people, including me, end up turning off the JIT compiler. Most, if not all queries on my production system will not get enough from the performance boost to compensate the llvm optimizer cost. So I'm interested in hearing are other people seeing this or have you turned off the JIT compiler? But he says he has a potential alternative with a technique called copy patch, which lets you write stencils and see. So it's a different way to do on the fly compilation, and he actually did an experiment with it to see how well it would work. Now he said this is really early days and far from ready for prime time, but the results he's gotten so far have been promising. So I'm just going to look at the medium and the standard deviation and just selecting from a table JIT versus no JIT. I would say everything's within a standard deviation, so no real change. But using this new type of jit takes an insignificant amount of time. But using it for this where clause here, it looks like there's a statistically significant improvement, but more importantly, negligible costs with the JIT compilation. What he says is that even if the JIT is an extra slightly time consuming step, it takes so little time that even these queries go a few percent faster. So I thought this was particularly interesting. I definitely encourage you to check it out. [00:15:01] Next piece of content postgres performance boost hot updates and fill factor this is from crunchydata.com. They're talking about heaponly tuple updates. So if you update a particular column in a database and it doesn't have an index, you can get a heap only tuple update. So it's only going to update the heap. It doesn't have to update the index because you did not update a column that was covered by an index. But you also need to have sufficient space for that to happen as well. And therein comes the fill factor. So by default, postgres tables have a 100% fill factor. So all the data is filled, but you can create a table with a lower fill factor. And they say the ranges vary from ten to 100. And I remember seeing some previous posts about fill factor and it seemed like 90 was the sweet spot, 80 and 70% started to get even worse. But this post goes through setting a fill factor if you want to leave enough space for doing some of these hot updates. But they even say down here fill factor settings in the 70, 80 and 90 range can be reasonable. But quote, there are folks who go down to 50 and even lower, but those are definitely special case circumstances and they tell you the system views you can look at to be able to track how many hot updates are happening. So check out this blog post if you want to learn more. [00:16:19] Next piece of content waiting for PostgresQl 17 add returning support to merge this is from depeche.com and the merge capability in postgres is now getting the returning clause. So when you're doing inserts updates deletes as part of the merge, you can now have it return the rows impacted to you and it even has a function called merge action that tells you the result of the rows that are being returned. So in this particular merge, the row returned is from an insert. In this particular merge, the row returned is an update. So check out this blog post if you want to learn more. [00:16:59] Next piece of content what the hell is transaction wraparound? This is from depeche.com and this is quite a long blog post because transaction wraparound can get a little complicated, but he goes through and explains everything about it, talking about freezing. So if you want to learn more about transaction Id wraparound, you can definitely check out this blog post. [00:17:21] The last piece of content there was another episode of Postgres FM last week. This one was on superuser. So Nikolai Michael, talk about the superuser privilege that postgres has, what you can do with it, how it can be dangerous, and they had a long discussion about service providers and whether they provide it or not, and their rationale for doing so. So I definitely encourage you to check out this piece of content. You can listen to it here or watch the YouTube video down here. [00:17:49] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links for all the content discussed, as well as to sign up for email notifications of each episode on the website. You can also get an audio version of the show as well as a full transcript. Thanks and I'll see you next week.

Other Episodes

Episode 130

September 06, 2020 00:15:15
Episode Cover

External Compression, Parallel Vacuum, JSON Aggregates, JSON & Arrays | Scaling Postgres 130

In this episode of Scaling Postgres, we discuss external compression, vacuuming in parallel, working with JSON Aggregates and the JSON & Array datatypes. To...

Listen

Episode 135

October 11, 2020 00:19:42
Episode Cover

Connection Scalability, Benchmarking Performance, Community Impact, zheap Progress | Scaling Postgres 135

In this episode of Scaling Postgres, we discuss connection scalability, benchmarking Postgres performance, the community impact of an acquisition and the progress of zheap....

Listen

Episode 151

February 07, 2021 00:18:10
Episode Cover

Cleaning Up, Function Performance, 11 Million IOPS, Change Data Capture | Scaling Postgres 151

In this episode of Scaling Postgres, we discuss cleaning up your database, function performance, 11 million IOPS and change data capture. To get the...

Listen