Companion Databases? | Scaling Postgres 292

[00:00:00] So I'm actually recording this the Wednesday before Thanksgiving here in the US. But by the time you're seeing this, Thanksgiving will have already passed. So if you celebrate Thanksgiving in the US, I hope you had a great one for everyone else. I hope you, your friends, family and coworkers continue to do well. Now, before we get to the first post this week, as I've mentioned in previous episodes, Scaling Postgres, my course is available for early adopt and as you can see, the title is Ludicrous Speed Postgres. So basically this is my Performance Optimization course and this is for early adopters. Because the course is not complete yet, it's actively being worked on. The projected release date is January 29 of next year, 2024. But if you want a 50% discount as well as to provide feedback as I am completing the course, I'd love to have you join. The link will be in the show notes on the website, but keep in mind the offer ends the end of day Monday, which is 1127. So if you want to join, you do have a limited time to do that. So you can use the link or you can just go to the Courses section and it will be available here as another course by the time this video is posted. But this week's first piece of content is companion databases. This is the episode of Postgres FM that happened last week and Nikolai and Michael were talking about databases that you use alongside your Postgres database. So maybe you have a redis cluster that you've set up for maybe doing job queues, or maybe you've set up an elastasearch engine for doing searches, or an analytical database, or a vector database for storing embeddings. So they talked about, does this make sense, does it not make sense? And their conclusion I definitely agree with where you can probably go longer than you think just using Postgres for all of these different features. Because Postgres can do full text cert, it can store your embeddings and be able to query them. It can do some analytical workloads. It's just when you're really pushing the envelope of what a single instance can do, that you need to start choosing a different path. So maybe that's using a separate type of data store for your analytics events. Or maybe you need help scaling out and use something like Citus. Or you have a lot of time based data you're working with and want to use something like TimescaleDB, or you just don't have all the search features that you want. Or you need a little bit more performance. Maybe you do need to set up an elasticsearch cluster. It's just the downsides of doing that is that you're going to have to have more management, you're going to have to learn a whole new to set up technologies and how to use them, as well as keep up with data synchronization across all these different database systems. But I thought this was a really good episode and encourage you to check it out. You can listen to the podcast here or you can watch the YouTube video here. [00:02:46] Next piece of content designing your postgres database for multitenancy. This is from Crunchydata.com. They're talking about the three typical approaches for doing multi tenancy in a database. So what that basically means you have multiple accounts that are going to be using the same database and their data needs to have firewalls between them. So data should never cross over one another and get mixed up. The first way to do it is have one database per customer. And this is definitely the hardest to do because imagine you have ten customers and then 100, and then 1000, and then 10,000 if you are so lucky. Well, now you got to manage 10,000 databases, which is no easy feat. So this probably only makes sense if you're going to have gigantic enterprise based customers. The next option is a customer per schema. So you set up a schema for each customer. And this is only a little bit better than every customer getting their own database because you still have to run separate migrations in every schema for every customer. They said if you're using cytis, it does have a few tools to kind of help with some of it. They say you can join across customers for cross tenant analytics. But again, I would think that would be really hard with a lot of schemas. But basically at scale this breaks down as well. The third way is to have what they're calling a tenant discriminator within each table. And that is say, having an account ID or a customer ID or a tenant ID. In their example here, they're showing an.org ID, just whatever Identifier you use for each independent customer of your service. That Identifier needs to be placed in every table so that you can identify who owns what data. And this is a technique that I've seen every software as a service application I've worked on use. I have not seen one that's used a schema or a database per customer. They've all used this technique and they recap the approaches down here. Basically a database per customer, maybe you have tens of tenants, but it's a very high overhead for management. Database per schema you have hundreds of tenants is maybe the max you want to go. You still have overhead of schema migrations, the tenant discriminator. You can have millions of tenants. The disadvantage is the lack of strong tenant isolation. Now, they do say there are a number of libraries that kind of help with this tenant isolation, like Ruby on Rails and Django have different libraries to help you handle multi tenancy. And then lastly, the blog post covers up with Citis as a way to scale out your database once you reach a certain level. But check out this blog post if you want to learn more. Next piece of content is SQL good? This is from Momgm US, and he's referencing a 77 email thread from two years ago where that was titled The Tragedy of SQL. That was saying we could be much further along if we didn't have to use SQL. And we had another alternative query language, and his proposal was using Data Log. Now, with that, I did look up Data Log and here's the Wikipedia entry, and I started looking at the syntax of it and I was like, Is this supposed to be easier? Because colon dash is supposed to mean if so, it's a function that says if X is a parent of Y, then X is an ancestor of Y. Then I looked down at some other syntax and looking at something like this I hate to say it, but I'll take SQL any day just for readability. So SQL may have its problems as they're proposing, and maybe this is a better alternative for particular use cases or for certain reasons, but I would think that the Usability definitely has to increase significantly to have a shot at approaching SQL. But they had other people comment on it. I thought they were some pretty interesting perspectives, but check out this post if you want to learn more. [00:06:40] Next piece of content. Also from Momgm US is transaction block isolation levels. And here he's talking about how isolated your particular transactions are as they're running. So by default, postgres uses read committed. That means every read that's been committed from the time you start a query, you can see everything that's been committed thus far. But in a transaction block, you may not want that to happen. You may want to have a consistent view of the database at the point the transaction started. If that's the case, you can use repeatable reads, so you can change the isolation level to be able to achieve that. Or if you want to assure that everything happens serially and avoid transactions stepping on each other's toes, you could use serializable isolation. Just be aware it has the potential to cancel particular queries. So you need a retry setup in this case. But this is a quick post about different isolation levels you can check out if you're interested. [00:07:37] Next piece of content river a fast, robust job queue for Go and Postgres. This is from Brandyear.org, and he at one time was using a job queue on Postgres and ran into a lot of issues, but apparently it was on maybe version 9.4. So a lot has changed since then. But he's been working with Queues recently and been frustrated with a redis based queue because of its inability to handle transactions. So things would get in different states where a job was admitted to redis, but the data needed for it hadn't been completed yet, so the job would fail and retry, or you have transaction rollbacks. That means the job will never be able to start, so it's continuously failing when trying to start and numerous other issues with that. So he actually wanted to create a transaction based job queue to avoid some of these problems. So he created river for this purpose, and it is based upon Go using PGx driver, and he discusses a little bit about it. Here how it works, as well as some of the enhancements he appreciates in postgres that has happened to make it easier to create queues in postgres number one being skip locked, being introduced in 9.5, having reindexed concurrently in twelve or even partition tables in postgres 14. Postgres 13 handling B tree deduplication better and postgres 14 being able to skip B tree splits. So if you want to learn more about this queue, definitely check out this blog post. [00:09:06] Next piece of content exploring a postgres query plan. This is from Notes Eatonfil.com and this is another tour de force post where he's walking through how a postgres query plan works. So he goes really into detail on this and shows how things work. So if you want this type of depth, definitely encourage you to check out this blog post. Next piece of content postgres plus citis plus partman your IoT database. This is from Crunchydata.com and he's coming up with a scenario saying hey, you want to set up an Internet of Things data set. So basically you have a bunch of sensors you want to monitor, in this case cars. So he's tracking cars, you have the sensor data and the location data and the first thing he does, he wants to shard the data set. So he's going to be using Citus to do this. Now, he does reference Device ID in the text but in the actual code he's saying Car ID. There's no device. ID, so I think they're essentially equivalent. But he runs these two commands to create distributed tables across the different shards. Shows an example query here, but of course this will run across all 32 shards, which is not ideal. So you need to do the same technique you would use in partition tables by whatever key you partition by, you need to query on same thing with sharding, however you're sharding the data, you need to use that same key. So when you specify the specific Car ID you're looking at, it will then be localized to that particular shard and ideally on the particular node that that shard is running on. And as long as your same data is colocated for this particular shard, you can do joins intra node within the same node and not have to do cross node joins which would be much less performant. And then they add a time series component. They've added Pgpartman, although you don't have to use this extension, you can just partition them however you like yourself. And here they're partitioning by a timestamp. The reason being they want to be able to archive older data. [00:11:04] They show example of setting this up and then followed up with being able to do columnar compression with some of the old data to really compress down the older data. But when using Citis for columnar compression, the thing you need to keep in mind is that you can't do updates and deletes on it and logical replication or decoding doesn't work with it. But in terms of compression, they got about a ten x compression when they went to column or storage for the older data. But check out this post if you want to learn more. [00:11:34] Next piece of content. Cytus live query inspection using Cytus stat activity. This is from Cyber Postgresql.com and if you know anything about postgres, of course it has a system view called PGSTAT Activity to look at the current running queries on the system. Well, Cydus has Cytus Stat Activity, so it does the same job, but it does it across the nodes. So you can query I think it's the coordinator node and be able to determine what queries are running on which nodes. So he shows an example of that here. So check this blog post if you want to learn more about that. [00:12:09] Next piece of content. Teach Kelvin your thing. High Performance PostgreSQL for Rails. This is from Andyatkinson.com. And Andrew was on the YouTube channel Teach Kelvin Your Thing, and he talked about his upcoming book, High Performance PostgreSQL for Rails. So you can watch the YouTube episode here, it's about an hour in length, or you can look at the Q and A questions that he posted down below. [00:12:34] Last piece of content learn PostgreSQL second Edition this is from fluca 1978 GitHub IO, and this is the second edition of his book that Luca wrote along with Enrico. So if you want to check out his new editions, you can check out this blog post and even the link for the book. I hope you enjoyed this episode. Be sure to check out Scalingpostgres.com, where you can get the links for all the episodes discussed in the show, as well as the podcast version and eventual transcript of the show. Thanks and I will see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 284

23 Or 1.1 Times Faster Performance? | Scaling Postgres 284

Episode 281

HNSW Indexes, Vacuuming Bloat, Watch Me Now, Connections | Scaling Postgres 281

Episode 248

Pagination Solutions, Return Modifications, Insert Deletions, How To JSON | Scaling Postgres 248