Multiple Databases, Slow Queries, Sorting Performance

[00:00:00] In this episode of Scaling Postgres, we talk about scaling with multiple databases, detecting slow queries, sorting performance, and using foreign data wrappers. I'm Kristen Jameson, and this is scaling postgres episode 27. [00:00:22] Alright, we have a lot of content to go over this week, but I'll try to move through it quickly. [00:00:29] The first post is Scaling at Instacart, distributing data across multiple postgres databases with Rails. So this is from the company Instacart and this is from the tech Instacart.com blog. [00:00:44] So they had a challenge, basically, they had a lot of data and they were going to be running out of space. So they were going to be hitting a six terabyte limit raised to twelve terabytes before they finished on Amazon RDS. So basically the solution that they actually came up with was segmenting their data into two separate databases. So the post describes it as they have a four sided marketplace, essentially customer shoppers who pick up and deliver the grocers to customers, as well as they work closely with retailers and consumer packaged good companies. [00:01:22] So basically, they examined all the tables in the database and they only found a few dozen tables that were needed by multiple domains. And they're saying, for example, when a customer places orders, the orders are written to the customer's database in the customer domain. However, that same data is also needed in the fulfillment domain. As this diagram demonstrates, they wanted to separate their data into two databases along a domain boundary. So this would be the customer domain, say, and this would be the fulfillment domain, say, and each database would have ownership of specific tables. Now, if you're looking at this, you're probably thinking about microservices. However, they did this a little bit differently. So they used one application that actually talks to both databases, depending upon what table they wanted to fill information in on. So they weren't using a microservices architecture, but they were having separate databases and they would call it to the database based upon the information that they needed. So essentially, they have a monolith application. And this code shows how they're using two database connections with Rails. Now, one of the databases is essentially the master of a particular table, and because they wanted to avoid master to master, they basically assigned this particular database would own this table, and if the other database needed that data, it would be copied over. And then they put some controls in place, as demonstrated in this code, at both the database user level to prevent writes as well as in the application to prevent writing to essentially what should be a readonly table. And then basically they did one way sync between the databases, copying the table information that was needed from one database to another using what they're calling an application level data pump. Now they said they experimented using foreign data wrappers in conjunction with materialized views. But they said, quote, the experiment wasn't successful, but we found materialized view refreshers to be potentially problematic at scale. But it's interesting. I wonder if the foreign data wrappers could have worked. But basically they didn't use foreign data wrappers. They didn't seem to use logical replication, nor did they use logical decoding. Those are some postgres feature they could have used, but they decided to do replication at the application level and they said it gave them greater logical control. And they're using something called Hub, which is their own Pub sub implementation. And they go over some of the things that they're doing whereby they are choosing an eventual consistency model. So eventually those copy read only tables will be brought over to the next database, they do some things to verify data integrity of course and then they went through their whole process of actually migrating a table to reside in this separate database. So this is definitely an interesting approach to take and again it harkens back to looking at microservices where each of those services tend to have their own database but here they have a monolithic application. But I guess due to the constraints they were seeing with RDS, they decided to actually shard their database solution into two or potentially more databases, I'm not sure, but they use that one monolithic application to write two specific tables in each of those databases and then sync data that's needed from one to the other. So definitely an interesting blog post to check out. [00:04:56] The next post is Three ways to detect slow queries in PostgreSQL and this is from the CyberTech Postgresql.com blog and basically number one is using the slow query log. So this is in the PostgreSQL configuration. You can set the log min duration statement and any statement that is longer than this many milliseconds will be logged to the PostgreSQL log. So any queries that happen will show up in there and you can look for slow queries in there and they note here you can also do it for specific databases. So one database could have a setting that's different from another. And then if you happen to see a slow query pop up you would basically use an explain analyze on that query to see what potentially took it so long to run. And what was the query plan? Now, another alternative with regard to query plans that they mention is that you can use auto explain. So this is something that you can load into your session and turn it on and it will automatically log queries that are longer than a specific duration and he says you can even in the configuration do set session preload libraries to enable it. So that way if you have a slow query it will not only log that query but also give an explain plan for what plan was taken to execute that query and the third is checking PG stat statements. So the first options are great for finding slow queries. However, when you have many many thousands or millions of short running queries, but it's because they're running for so long. Even something that's taking 100 milliseconds, if you're running it thousands, millions of times, that could actually be the bottleneck. And if you go to reduce the speed of that query, even though it is pretty fast, if you can shave off 50% of how long it takes to run, that could have a huge benefit if you're running it a lot. [00:06:46] And basically to enable PG Stat statements, you just add it to the shared preload libraries of your postgresql.com file, you create the extension and then it allows you to collect statistics on queries. Now, he doesn't go into depth about using this in this post, but there is a link post about using PG Stat statements in order to track how often these queries are run and the resources used. So, good blog post to check out. The next post is from the same blog. It's PostgreSQL improving sort performance. And this is again from the Cybertech.com blog. He starts off creating some sample data and then determining how to sort. Now, in his example, he sorted a lot of data and it basically had to go to an external disk merge. And he says, of course, the most important performance to change in order to speed up sorts, not to be on disk if there are large sorts, is to increase your work memory. So in this example, he had four megabytes of work memory, he sets it up to 1GB and now that sort happens in memory and it goes from taking six and a half seconds to 2.7 seconds. So definitely work memory is something that you can increase to give more memory to sort operations and other types of aggregations. However, the thing to keep in mind he doesn't necessarily mention here is that work memory is used per user and even sometimes multiple times within a user's query. So you do need to keep a ceiling on that. If you have a lot of connections to your database, if you have a dedicated analytical database that has not that many users, then you can feel free to crank it up. But if you have like a transactional work will load with many, many users or many connections, you're going to want to keep a ceiling on how much work memory you're actually using. And then of course, one of the main ways to improve sorting is indexing. So applying an index where he's using like a limit ten, it blows the doors off the other examples. So definitely having an index to assist with sorting is definitely the way to go. [00:08:52] Now, related to indexes is our next post. So it's indexes on rails. How to make the most of your postgres database. This was from the Carol Galanchak, my apologies, I'm not sure how to pronounce that, but this is just a good overview of indexing in general, particularly for postgres. So he goes over the index types terms of Btree hashes Bren, Gin, gist and then when you want to get to the stage of optimizing queries, how do you do that? Use explain or explain analyze and then he goes on to additional advice in terms of the sequence of the columns in a B tree index matter. In other words, you could do a query that just uses the first part of the index or both, but you can't use a query, won't use it if you're just looking at the last column that's indexed in a multicolumn index. He talks about partial indexes which are actually my favorite performance improvement for queries because used in certain situations they can drastically increase the performance of queries. He talks about expression indexes which have their uses if you're always querying by or always querying in conjunction with a function. And then he goes through about optimizing like queries. He basically says quote forget about Btree indexes for this case, so what do you use instead? You basically use a gen index and taste. And he also says to take advantage of trigram matching provided by the PG trigram extension. And lastly, like the previous post covers ordering and how that can benefit from indexes. And of course, once you have a large database, or even not so large, you always want to add your indexes concurrently. And throughout this he gives this code in Rails so he tells you how to implement it using Rails Migrations. So if you're looking for an index review, definitely a blog post to check out. [00:10:43] The next post is Transactional Data Operations in PostgreSQL using common table expressions. And this is from the Rob Connery blog and he talks about essentially his own store that he has and how he has set up some different tables. And basically when an order comes in, he wants to record the order, record the items in the order so that's a separate table as well as record the downloads that someone can access as well as potentially products. And he goes over and shows you how he tends to like to set up his database in terms of using on delete cascade for foreign keys, trying to avoid nulls where he doesn't want them, when he has a not null, always trying to set a default. And he actually generates random UUIDs in certain cases in certain columns using the PG crypto extension for postgres. Now, what's interesting about this post is he's used a technique I haven't seen used a lot where he actually uses CTEs in order to do transactions. So for example, CTE is a common table expression essentially I like to think of as a many function for a subquery. And the advantages he says is that CTEs execute within a single transaction. So he can do an insert into his order table here with the value he's interested, return all that information and then in this new items essentially he's calling or I should say retrieving the information returned and then using that to insert into his order items table, returning that value. So essentially he can run this transaction with select all from new items and he assures that both get entered and he builds this implementation incrementally. And in this example he says, all right, well, what have you do about the downloads table? And then he shows a downloads implementation just by adding another common table expression. Then he says, okay, what happens if you have multiple order items? So here he's basically showing that example where basically for the insert statement, he's using a select sub query to do it. Now then he goes into all the examples he showed above, he's hard coded the solution and how do you handle that coding? How do you handle data integrity and validation? So what he says is his first inclination is to basically put all of that in a function that you can just call and then pass in the data needed as part of it using a JSON B variable. And another solution he shows is potentially using a prepared statement. So it's an interesting use case. I'm not necessarily wedded to doing this particular type of implementation, but it was something I really hadn't seen before, so I thought it was definitely a good thing to share and show. [00:13:22] The next post is Simplifying recursive SQL queries. So if you didn't get enough learning about CTEs in the previous post, this one goes even deeper because this actually uses recursive CTEs. And this is where a table essentially references itself. So say you have a list of employees and one of them is the manager of another row in that table, and you have like a Manager ID and how best to show who that person manages and queries. Now again, this starts off slow, talking about what are CTEs? And then showing an example of a basic common table expression. And then it goes into what's a recursive CTE basically showing where you do a union with basically have a start point and then at the test at which point it exits. And when you call that, here's the results you get basically print one to and add one to it every time until it exits, once the point is less than ten. So this post actually goes into some of the actual use case he's dealing with. And I'll let you review this post to see if this type of hierarchical use cases is something that you need to work with. Now, we also covered this in a previous episode of Scaling Postgres, and this is from the citedstata.com blog. In their Fun with SQL series. They did recursive CTEs and Postgres. This was from back in May. In this example they're using essentially employees with a Manager ID. So again, if you have a use case where you have a table that has a hierarchy within it, one row reference than another, these are two posts definitely to check out how to work with that using Recursive CTEs. [00:15:01] The next post is Foreign Data Wrappers in PostgreSQL and a closer look at Postgres foreign data wrapper. And this is from the Procona database performance blog. So this goes over what a foreign data wrapper is. It basically enables you to contact a separate database from within postgres and do queries against it or insert or update data. And specifically the Postgres foreign data wrapper enables you to communicate with postgres instances. Now, what I find interesting about this is that the future of Postgres Sharding is probably going to involve these foreign data wrappers as we've mentioned on previous episodes of Scaling Postgres. So I'm always interested to find content that is talking about using foreign data wrappers. And this is a fairly long post that actually goes into how you set it up to communicate with a separate postgres database. They go into some of the advantages of the foreign tables and talk about query optimization and even how to set up writable foreign tables. And then they also discuss issues with regards to pushing down joins predicates aggregates and push down is basically where the work is pushed down to the foreign database server as opposed to the one that you're actually logged into. So if you're looking into foreign data wrappers as a scaling initiative or you need to access other databases from your PostgreSQL instance, definitely a blog post to check out. [00:16:26] Now, related to that is they have PostgreSQL accessing MySQL as a data source using MySQL foreign Data Wrapper. So basically this is another post by the Procona Database Performance Blog and it covers connecting to MySQL from a PostgreSQL server and it goes through all the different options to configure it. So if you use MySQL, definitely a blog post to check out, the next post is actually a YouTube video. And this is from the EuroPython conference YouTube channel and the title is Craig Kirstens or Kirstein's Postgres at any Scale. And this was presented at EuroPython on July 23 to the 29th in 2018 that this YouTube video was recently posted. Now, this presentation has been given at other conferences, but this is the most recent version that's come up in the areas I've monitored. So if you haven't watched it yet, definitely check it out. The latter part is related to Citus Data, which is a Sharding solution for PostgreSQL, but a lot of the advice is directly related to basically the community postgres version. So definitely a YouTube video to check out if you haven't watched a previous version of it yet. [00:17:39] That does it. For this episode of Scaling Postgres, you can get links to all the content presented in the show notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode. Or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 112

Explain Analyze, Sequence Counts, Best Books, Partition Migration | Scaling Postgres 112

Episode 323

PgBouncer Breaking Change | Scaling Postgres 323

Episode 373

Planetscale Postgres! | Scaling Postgres 373

Multiple Databases, Slow Queries, Sorting Performance | Scaling Postgres 27