A Roadmap To Scaling Postgres | Scaling Postgres 361

[00:00:00] There's a lot to scaling postgres, as is evidenced by the number of scaling postgres episodes that exist. But I used to say that I could never write a blog post about scaling postgres because it is so vast. So so many topics about postgres are really vast when you get into it. But someone did write a blog post about a roadmap to scaling postgres and we will talk about that this week. Also, be sure to stay till the end for my consulting corner where we're going to be talking about a few best practices with foreign keys, but I hope you, your friends, family and co workers continue to do well. Our first piece of content is a roadmap to scaling postgres. This is from kmople GitHub IO and this is his roadmap. I kind of expected a blog post that was saying okay, first do this and then do this and then do this. There's a little bit of it in there, but it's a very eclectic mix of things. So he says first you need to have some basic postgres internals knowledge, which is never too bad to have. [00:01:03] Next, he talks about data modeling and he does have some recommendations with regard to that. But it is true that the wrong data model can severely limit the ability for you to scale postgres. He talks about indexes and he talks about just the right amount of indexes. But I guess I was looking for a little bit more concrete advice. He said basically don't be afraid of just scaling up your hardware. Then that is a path you can go on if you have the budget to do it if you need to scale. He mentioned increasing statistics when necessary, including adding custom statistics through extended statistics. He mentions configuration tuning. He mentions partitioning. So basically try to separate your hot data from your cold data when necessary, look into compression. He does mention the timescale DB third party extension here pre aggregate data when you need to to improve better performance. [00:01:58] Use read replicas if that can assist in scaling your app by scaling out the reads mission phlogical replication replicas and I think this is sending data over separately for running specific reports. He mentions alternate or hybrid storage engines. He mentions utilizing external caching like Redis, Valkey or memcached. He mentioned Sharding is possible and he does mention pgdog which we covered last week, as well as Citus as well as postgres derivatives and other products. He talks about Oriole, EDB, YugabyteDB. But to me this is a very eclectic mix of different things you could do and I thought about this post and I said okay, what would I communicate if I wanted to give a really basic roadmap for scaling postgres and actually thought of five different points I would cover and the first would be the data model. So it is something he mentioned up here, the data modeling I would just go with third normal form or something similar to that. Maybe you have some JSON B fields when necessary when you think it would help optimize performance or you just want a key value column in a table and at this point think about partitioning. So if you have data you know you're going to get rid of at a certain point, partition partitioning is a great use case for that or if you have streaming event tables. So thinking about partitioning at the data modeling stage could be beneficial in that respect, but it's not necessarily a requirement. The next step I would do would be indexing and I would just say just index your primary and your foreign keys and based upon what your application is doing you probably want to index some dates for performance. So you'll have a few more indexes that way and I really wouldn't worry getting into all the different index types to start with. Maybe you have a multi column index if you know how you're going to be querying that data will always use the multiple columns. Third, I would look into configuration. Now if your database is being hosted by someone like aws, they do a lot of that configuration for you. But you do want to change that random page cost to around 1.1. Otherwise you can look at plenty of guides online to find say the top 10 things you can change for your postgres configuration to get started. Things like shared buffers, effective cache size, things like that. And the fourth area I would look into would be observability because these are just some baseline recommendations. But then you want to get real time feedback from your application on how it's performing. How is the CPU performing given your database size, what are your disk reads and writes because they could indicate problems and looking at an analysis tool of the different queries coming through. Like for example if you're on AWS using RDS or Aurora, they have performance insights. But there are also other open source tools and paid for tools. You can get similar metrics that give you information about the queries going on. Or you could do it old school and just look at pgstatactivity and pgstatstatements to tell you what queries are taking the most time in your database because that highlights the areas you need to change. And sometimes it's only necessary to look at this once a week or maybe once a quarter depending on how fast your application is iterating. But that observability piece is definitely important so you know where to focus your efforts. And then my last one, Step five is basically connection pooling. So if your application is scaling you're going to run into that sooner rather than later. So definitely investigate some type of connection pooler. It could be pgbouncer. Pgcat was actually mentioned in this blog with reference to the read replicas I believe. Yeah, so that's another connection pooler. Or there are also hosted ones like AWS has RDS proxy, but I would say that would be my scaling roadmap. But if you want to learn more you can check out this blog post Next piece of content life altering PostgreSQL patterns this is from mchugh.dev and we'll just go through this hopefully quickly and talk about things that I agree with and maybe some things I don't agree with. So the first thing he says is use UUID primary keys. And this first one I disagree with because I've never seen more write amplification, especially in the wall with applications using UUID primary keys because of their random nature. Now if this said use UUID v7 primary keys, I'd be like, okay, I can see that. The next thing is give everything. I think he means tables created at and updated at, which this is a practice of my orm, so that's fine. Next is onupdaterestrict on delete restrict. So he prefers to do this versus cascade, probably because it just magically cascades all the changes through. But he has the adage, you know, storage is cheap, recovering data is a nightmare. So I think I could agree with this. But it kind of depends on the use case and the data you're using. Next is use schemas, and I don't necessarily know about this. I mean a lot of developers I think are not as familiar with schemas, so that might confuse them in the database. So I guess personally it depends on the use case. Next is enum tables, which to me these just look like lookup tables. So you have a pet kind table and a kind column in the pet table that references it. So sure, that makes sense. Next is name your tables singularly. I don't agree or disagree with that. I mean, I would just go with whatever your framework tends to use if you're using an ORM like Ruby on Rails uses plural. So that's what I tend to use. Next is mechanically name joined tables. So basically if you have a person table and a pet table, don't try to name the table something like pet owner, but just concatenate the two table names together. Person pet and I could get behind that. Next is almost always soft delete. So this is having like a deletedat column that soft deletes it or revoked at. But again this goes back to his Storage is cheap, recovering data is a nightmare, which definitely agree with. And I have specific use cases that I do soft deletes, but not necessarily for everything. Next is represent statuses as a log. So don't just have a single column that updates the value, but has a separate table that tracks that log. That way you can actually get the dates of the transition. So this makes a lot of sense, especially for postgres because you would avoid a lot of updates to a table changing the status of a single row. Next is markspecialrows with a system id and this seemed really odd to me that you would have special rows in the database. So I don't necessarily agree with that. Next is use views sparingly. So he says they're amazing in some respects, reducing complexity, but they're terrible because for example, removing obsolete columns requires a drop in recreation. But I definitely agree with using views sparingly. Next is JSON queries. So basically using JSON types and it allows you to format complex JSON that looks like this. So that can make sense. And that was his last recommendation. But what's also interesting reading is all the 34 comments of this particular post. So you can see different people agreeing or disagreeing with the points that he made above. But what do you think? Feel free to read this through and give your comments. [00:09:11] Next piece of content making postgres better with Aureole DB this is from Mistachi GitHub IO so aureoly DB is basically a nude storage engine for postgres that basically pushes the envelope in terms of performance and it uses an undo log as opposed to storing changes within the actual heap data itself. So basically as it says, it eliminates bloat introduced by how updates are handled, the need for garbage collection, which is vacuum, and also transaction wraparound. And this is an extension, but in order to use it you still need to apply patches. So we mentioned those patches here and he actually did a test with 25 million records on a 20 core system with 64 gigabytes of RAM. So this is using something called a Sysbeg graphing test on his GitHub site and the Max TPS for stock Postgres was looks around about 3000, where it was around 9000 or so for OLEDB. And what's more interesting is the memory used by the postqhraz heap was rather high and the CPU usage was around 90%, I guess maybe 85%. Whereas with Orlydb the memory used was very minimal and the CPU used was very minimal as well. So not a ton of detail, but it definitely suggests the promise of what this new storage system can offer. So if you want to learn more, definitely check out this next piece of content. There was another episode of Postgres FM last week. This one was on Performance Cliffs. And in this episode Nikolai and Michael were joined by Thomas Vondra and performance cliffs are basically your systems operating well. Things are fine, maybe things are gradually slowing down as your system gets a little bit more active. But then suddenly things go haywire and performance drops off a cliff. And that's where the name performance cliff comes from. Something happened in the system and suddenly that happens. Normally where I've seen this is that enough data has been entered in a particular way that it actually triggers a plan change in Postgres. So given the statistics it has at hand, it decides to change the plan. And they mentioned in this essentially adding one more row suddenly causes the planner to behave differently and choose a different plan, maybe going from a linear search to a hash search. But they did discuss a lot about optimizations to the planner and even potentially at the execution level, what kind of changes could be done to optimize performance and avoid some of these performance cliffs. Now, with regard to a plan change, a lot of times if you keep your statistics up to date as you grow like you extend the amount of data that gets collected for statistics as your data grows in size, that could help avoid some of these performance cliffs because the statistics will be that much more accurate. But Michael made an interesting point that a lot of times people may choose worse average performance to avoid falling off a cliff. So one example of this was Nikolai mentioned, you know, always turning jit off. So jit or just in time. Compilation can be pretty good for certain use cases, but sometimes it causes things to be even worse. So a lot of times what he says he does is he turns jit off so he doesn't get the benefit in those cases where it's better, but he doesn't get hurt as bad where the performance gets much worse. So the overall performance maybe is a little bit worse, but he doesn't get so much random variability in the performance at times. But if you want to learn more, you can listen to the episode here or watch the YouTube video down here. Next piece of content Time to better Know the time in PostgreSQL this is from boringsql.com and this is an interesting post. All about time, all aspects of it, talking about times and time zones, how to work with time, talking about the use of intervals, the different storage of timestamps, how to get the current time as well as time ranges and how to work with those. So definitely a comprehensive blog post about this if you're interested. Definitely check it out. Next piece of content A Practical Guide to Taming postgres Isolation Anomalies this is from Dan Svetlov me and this is a really long blog post. I I dare say this is a mini book because it's look how comprehensive it is and how long it is. So if you want to learn about postgres and isolation anomalies, this is definitely a mini book. You can consider reading about it. Next piece of content Creating histograms with Postgres this is from CrunchyData.com and this gets into the SQL queries that will allow you to generate histograms. So you can see some of the SQL gets quite complex as they're building. They're basically doing binning of data into particular buckets and then charting those results. And the SQL capability they're using the most is with bucket to be able to bucket those different values. But if you want to learn more, you can check out this blog post. Next piece of content PGAI importing Wikipedia into PostgreSQL this is from Cyberdeck PostgreSQL.com and PGAI is an extension. I think Timescale developed this, but it allows you to do all sorts of AI work within postgres. So he goes through the process of using a PGAI to load data from Wikipedia into it, generate embeddings, and then looking at the results. He didn't actually query it yet, so this is basically just loading the data in. I assume at some point he's going to be querying against this data in a future blog post. [00:14:54] Next piece of content PostgreSQL 18 support for asynchronous I O this is from dbi services.com it looks like patches are in to support Asynchronous I O so he wanted to check out the performance of it. So he mentioned some of the different configuration parameters and he did some testing on a T3 large system on EC2. Unfortunately, in his testing he didn't see much of a performance difference from Synchronous I o, but I'm assuming he's using an EBS volume. I wonder if that is masking some of the performance changes that might have happened. Basically the features have been added, but in this particular example case he built he wasn't able to see performance differences yet. But if you want to learn more, definitely check out this blog post and the last piece of content. There was another episode of Talking Postgres, I think it's a monthly podcast and this one was on Open Source Leadership with Bruce Momjiam. So if you want to learn more, you can listen to the episode here or watch the YouTube video. And now it's time for the Consulting corner. So this one I'm going to talk a little bit about foreign keys. So if you remember when I was talking about my path to scaling postgres is you basically put indexes on your primary keys and your foreign keys. Well, it's very important to put an index on your foreign key if you're actually making a foreign key relationship with it. Because for example, so let's say you have a product table and an orders table and there's a foreign key relationship on the orders to the products. Well, if you're going to delete that product, the system needs to find every order where that product exists and how it does that is through the foreign key. So you're going to need an index on that foreign key to efficiently do the the delete. And this can get even more complex if you're talking about partition tables, because usually when I set up partition tables I'm using multi column indexes. Generally the first column of that index is the partition key and then the second column is a more unique identifier. Well, if I'm trying to create a foreign key relationship with a partition table, I'm going to actually need to reference two keys so it can find the partition and the unique row it needs to check. But just something you need to keep in mind when you're using foreign keys. [00:17:10] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content discussed, as well as to sign up to receive weekly notifications of each episode. There you can also find an audio version of the show as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 259

Recovery Time, Lost Data, Production Ready, PG16 Highlights | Scaling Postgres 259

Episode 57

Checking Vacuum, Fast Distinct Count, Group By Speed Up, Health Check | Scaling Postgres 57

Episode 362

Optimizing For Time-Series Data | Scaling Postgres 362