100K Tenants, Extensions, Query Planning, Crosstabs

[00:00:00] In this episode of Scaling Postgres, we talk about 100,000 tenants extensions, query planning, and cross tabs. I'm creston. Jameson and this is scaling postgres episode 19 one. [00:00:21] All right, our first article is called Options for Scaling from One to 100,000 Tenants. And this is from the Citusdata.com blog. Now, this kind of blog post is interesting because it kind of goes over a story or scenario, presumably that Citusdata has heard from customers where someone is scaling their software as a service application because accounts are typically called tenants. So this talks about the scenario where you start off with one customer and you get another and another and another, and your database needs to scale as well. [00:01:01] And some of the questions they propose that you may be thinking about should we move off of the cloud and back to an on premise solution, potentially a proprietary database appliance? Because you're looking at 480gb of a memory instance on AWS, or you're looking at other cloud providers to see what can give you either a better deal or perhaps they offer a larger memory size. And then they talk. About? Do you want to consider manually sharding at the application layer to be able to send queries to different database instances and all the issues that can potentially come from that and how difficult it can be to do? [00:01:49] Or do you want to migrate from a relational database to a NoSQL database because presumably those scale better. But of course they have a lot of downsides and of course their ultimate solution is adopt a solution that takes care of sharding for you at the database layer. And that's pretty much what Citus does. It does sharding of your PostgreSQL instances. So it is definitely a blog post about the Citus data solution, but there are some other considerations that should be taken into account or other potential ways of doing it. So for example, if your limit is the 488gb of memory, well, your target is you want to try to keep a cache hit rate of around 99%. [00:02:37] Can you do things to reduce the in memory working set? So, for example, can you do things like partitioning your table data? Can you create summary tables that allow users who are using your application to be but only pull back summarized data quite efficiently? And then that can be a much smaller size of data versus having all your history that has to be aggregated when that type of data is requested. So there's some other steps you can use using stock postgres to be able to extend the point at which you will have to do sharding. Now, I think there is a stage at which you'll have to do it. But one thing this blog post does not go into which it makes sense because they're wanting people to use their platform. But there's other steps you can take to be able to extend your runway essentially before you have to do sharding. But this was an interesting blog post and I myself was thinking through, okay, what are some of the things I would do before choosing to take the step to do sharding? Or what other options are there available other than Cytus? And we're going to look at some future blog posts that kind of talk about some of the different solution using farden data wrappers and some things of that nature in some of the future stories that you could potentially put together a semi sharding solution. Another way to potentially scale that wasn't mentioned in here, that people typically do is set up multiple read replicas because generally your master or your primary database takes all the writes, but most applications are read heavy. So if you can have multiple read replicas, that can help again extend the runway before the point at which you have to do sharding. [00:04:17] The next post is actually a YouTube video. Again, this is from Citize Data and it's from Osgun Erdogan and the title is SQL Scaling and what's unique about PostgreSQL. Now, this was an interesting title and the content of it I didn't really feel reflected what's in the title. Essentially the content from my perspective is that he was talking about primarily postgres's support for extensions and how that really empowers Postgres to fill a lot of use cases. [00:04:55] And they have firsthand experience with this because I believe initially Citus, their Sharding solution was a fork of postgres, but then they re engineered it to be a extension of postgres. So essentially they tie into the postgres ecosystem using these extensions and you can add more extensions to your system to be able to expand its capabilities. So it's a very interesting presentation and I definitely suggest you learning about extensions because if you like postgres and you have a particular need that you need it to fill but it's not quite there yet, perhaps you could develop an extension to fill that need yourself. Now, some of the other areas that it talked about in additions to extensions is they talked about the concept of relational databases can't do this and they talk about process semi structured data. And of course the answer to that is their support for XML data types, JSON data types, and particularly JSON B data types. The binary storage format for JSON, the one number two run geospatial workloads well. There's PostGIS, which itself is an extension and it adds all these functions and capabilities to be able to do geospatial analysis. Then they're talking about non relational data storage and a lot of this can be handled by extensions or foreign data wrappers. For example, they mentioned, I believe, a foreign data wrapper they worked on that does support for column storage. But you can also, I believe he said that there are over 100 foreign data wrappers. So you could do access to S three, you can interact with Oracle, you can interact with MongoDB DynamoDB. Basically you can interact with all these other databases using these foreign data wrappers, bringing data in or potentially pushing data down into those storage systems to be able to work with them. And there's a future blog post that talks a little about the foreign data wrappers and some considerations with regard to them coming up in a post or two. And then Ford talked about scale out for large data sets. So of course they're emphasizing the Citizens Data solution for scaling out. But again, whenever I see this, you also have to think that it is on the horizon for core PostgreSQL to do sharding or scaling out. Again, utilizing the concept of foreign data wrappers and pushing out the computations down to foreign PostgreSQL instances. But overall, this wasn't too long a YouTube video, so I definitely suggest you check it out. [00:07:36] The next article is the Software Engineering Radio episode number 328 featuring Bruce Mom Jin on the Postgres query Planner. So this is about an hour long, over an hour long episode that talks about the query planner. I would say about the first 18 minutes were pretty basic. So if you are more experienced with or at an intermediate or too advanced level, maybe you want to consider starting listening around the 18 minutes mark. Or if you are more of a newbie, you could start from the beginning. But it talks about considerations of the query planner and how it works in the internals. So basically talking about SQL is essentially a declarative language rather than an imperative language and therefore it's the planner's job to interpret what you want to achieve and then putting it into specific instructions and he goes into and talks about how it does that and some considerations. So it's definitely an interesting episode to listen to, particularly if you want to kind of get a little bit more insight into how the, I guess magic works. [00:08:46] The next post is called Planning Queries involving Foreign PostgreSQL Tables. And this is from the Walking with the Elephant blog and this is a good complement to the previous Radio episode because he talks a little, it goes into more depth about cost based optimization in terms of what the query planner is doing. So it's a great article to read after that to kind of cement your knowledge of what was discussed. And then he goes into talking about the things that make foreign tables different. So you don't want to necessarily constantly collect statistics from these foreign data sources because that would incur a pretty large cost in terms of network traffic and having to do it periodically. So he goes over considerations you need to take into account with regard to that. And then he also goes over a few settings. So there's use foreign estimate, use remote estimate, and just some considerations, how you set these different parameters and the effect on postgres's behavior in terms of at this stage of development. One scenario gives you a very small planning time but the execution time is a little longer. But when user remote estimate is enabled you actually get a much longer planning time where your execution time is shorter. [00:10:10] So there's not an ultimate solution to getting optimized. It's mostly a blog post talking about some considerations and things to think about. And potentially this resolution will be coming in a future postgres version, but it's definitely a blog post talking about some of the existing issues or potentially gotchas you have to be aware of if you're going to be starting to use foreign data wrappers. But definitely an interesting blog post to check out. [00:10:37] The next blog post is static and dynamic pivots. So it's basically talking about a pivot table or what I've frequently heard them called is a cross tab table, basically a table. So for example, maybe you have a table like this that has by year and rain days and you want to pivot it so that your column headers are the years and then it shows the rain days down here. Historically in other databases I've used, I've used essentially case statements to achieve this and that's what he's talking about, a static pivot because you can use case statements to achieve this in PostgreSQL, although they also offer the filter function as of PostgreSQL 9.4. So you can use some type of aggregate like a sum or account and then use this filter command to develop a static pivot or a cross tab. [00:11:31] And he also mentions there is a table function extension that provides a cross tab function as well. So that is a potential solution you could use. But of course this article also talks about dynamic pivots and he actually goes into detail about how to potentially do dynamic pivots with your data. So if you are interested in cross tab reports and having it generated on the SQL side as opposed to trying to do something on your client, definitely a blog post to check out. [00:12:03] The next post is gracefully scaling to 10,000 PostgreSQL connections for $35 a month. Part Two So last week we talked about part one. This is part two of the article. Again, the article is essentially about Pgbouncer and using it again doesn't really mention what the $35 a month is, but this second version actually goes more into the configuration of PG bouncer and the main areas that you need to be concerned with. Generally you want to set up your databases section so that you can funnel the connections to the correct PostgreSQL instance, suggesting enabling the transaction pool mode and then configuring max client connections, MaxDB connections and some additional parameters that you want to adjust. So it's a bit shorter than previous article but they are saying there will be a part three article coming up. But if you are in the process of configuring PG bouncer, definitely an article to check out. The next article is architecture and tuning of memory in PostgreSQL databases. This goes over a little bit of memory architecture. First talking about how you essentially have your database cluster and that has one common area of shared memory and your shared buffer pool is located there, your wall buffers there and your commit log is located there in terms of things that are memory resident. And then each process, so you could imagine there's multiple processes, has a temp buffers, work memory and maintenance work memory. So after describing each of these areas and their purpose, it then goes into okay, how would you configure these and gives you recommendations for shared buffers which is generally 25% of the system memory, how to configure work memory appropriately, the maintenance work memory, effective cache size and the temp buffers. So not only does it give you a little bit of insight to how memory is structured and works in PostgreSQL, but also gives you suggestions on how to configure those. [00:14:11] The next post is upgrade your partitioning from Inheritance to declarative. And this is from the Walking with the Elephant blog. Now they're talking here. Prior to PostgreSQL Ten, the only way to do partition tables was through inheritance. So you had a parent table and you manually created the child tables and triggers to be able to direct data to the appropriate partition. However, in version ten, they introduced declarative partitioning. So like I've set up the partitioning through inheritance and this blog post goes about how you would want to transition that to declarative partitioning because it just makes the management easier and there's less that you would need to do. So this blog post talks about moving through it from taking a backup, starting the transactions, doing the things that you need to do to efficiently alter your partitioning scheme from an inheritance basis to a declarative basis. So it's definitely a great blog post to check out if you have partition tables and you want to make the transition or the switch. [00:15:18] The next post is PostgreSQL extended statistics. [00:15:22] So this is a new feature in postgres Ten. So most of the statistics by default are collected at the basis of individual columns. But you may have cases where there is dependencies between columns. For example, in the table that mentioned here you have a day column, a quarter column and a year column where there are dependencies between each of these. So this basically talks about the new way to collect statistics across columns with extended statistics. So basically you can use create statistics to create dependencies between particular columns and you can use the special view PG underscore statistic underscore ext to look at the data about this statistic. And this should enable the planner to give you better query plans for these particular columns. So if you're wanting to use this new feature, definitely a blog post to check out. [00:16:17] The last article is PostgreSQL concurrency. Data modification language. Now it talks about a concurrency, but a lot of the post is basically about DML or Data Modification Language. So this is a pretty basic Post, so it goes into developing a Data Model, inserting Data using Insert Into or Insert Select, but it goes into a little bit of the background in terms of how PostgreSQL works under the COVID So they're talking about the Update statement and then its impact on concurrency because PostgreSQL uses NVCC and how it does Row locking in. Essentially every update creates a new Row that it then has to vacuum the old one and then addressing the delete command and how that works. So this is definitely a more introductory type Post, but if you're wanting to learn more about the Data Modification Language and how it works with PostgreSQL, definitely a blog Post to check out that does It. For this episode of Scaling Postgres, you can get links to all the content presented in the Show notes be sure to head over to Scalingpostgres.com, where you can Sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 218

PG14 Index Bug, View Permissions, Logical Replication Conflicts, AlloyDB Under the Hood | Scaling Postgres 218

Episode 35

Terabyte Scaling, Postgres 11, Connections, Backup | Scaling Postgres 35

Episode 77

Timeseries Queries, Arrays, Ranges, C Triggers | Scaling Postgres 77

100K Tenants, Extensions, Query Planning, Crosstabs | Scaling Postgres 19