Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about Aurora versus postgres. Surprising transactions, write only and read only, and indexing advice. I'm Kristen Jameson, and this is Scaling Postgres, episode 195.
[00:00:20] One all. All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is, is Aurora PostgreSQL really faster and cheaper than RDS PostgreSQL benchmarking? This is from Megops.com, and they're talking about how they were hearing from some of their customers that they were seeing pretty big CPU utilization spikes on their Aurora instances. And maybe not spikes, but just high CPU utilization that caused them to get bigger instances, which of course, has a budgetary impact. So they actually wanted to do an analysis to benchmark Aurora PostgreSQL versus RDS PostgreSQL. Now, RDS PostgreSQL is basically postgres that Amazon hosts on your behalf. Aurora PostgreSQL is not PostgreSQL. It doesn't use the Postgres engine to my knowledge. It's its own proprietary technology that Amazon has developed. But it provides a PostgreSQL interface, so you can do configuration of it that way. But it's an entirely separate engine to my knowledge, so it can behave differently than Postgres. And they mentioned here that Amazon indicates that Aurora can give you three times the performance of Postgres. But of course, the question comes in what instances does that happen? Because in their analysis, they actually saw the RDS Postgres outperform Aurora given a similar size instance, and the resource utilization of Aurora was much higher. So the tool that they used to do these benchmarks is PG Bench, and they found what I think is some interesting findings. So they used a DBR Five Xlrge for both RDS and Postgres, same zone, so same instance type. The storage is handled a little bit differently. It looks like Aurora Auto scales to do the storage limits, whereas you specify an RDS. They initially started with IOPS at 900, just based upon the size of their storage. But they increased it to 3000 because when they initially did their test using PG Bench, postgres was about half the performance of Aurora. So when they started to dig in to say, okay, what's causing this problem? They were seeing a number of weight events related to data file reads, Wall Syncs, as well as Wall writes. So clearly it looks I o bound. So what they did, they looked at how much their billing was, and it looked like the Aurora billing here, they say, was equivalent to $170. So they went ahead and upgraded the I Ops to get up to 3000 I ops on the RDS instance to try to make it equivalent pricing, I'm assuming. But as soon as they did that, the RDS PostgreSQL started outperforming Aurora. So at the four client, eight client, and twelve client loads across different runs, it outperformed Aurora. And the transactions per second were relatively consistent in the tests. But something else that was also interesting is that the CPU utilization was quite different. The RDS CPU utilization was maybe around 30% during the runs, whereas it reached almost 70% in Aurora. Similarly, the I Ops Utilization RDS versus Aurora. The RDS looks like it was peaking around 4000 I ops, which is interesting because I thought they had set at 3000, but it was dramatically higher. It looks 40,000 versus 4000. So tenfold higher I ops utilization on Aurora, which of course may result in increased costs. And then they looked at the Ioq depth where it seems relatively consistently under ten except for some spikes. It was spiking all over the place on Aurora with some spikes over 100, but looks other spikes around the 75 mark. So, definitely an interesting set of tests comparing Aurora versus RDS or essentially postgres. And I frequently tell my clients Aurora is not postgres, it's an entirely separate engine. So you're going to expect to see performance differences. The question is, are those performance differences going to be better or worse than PostgreSQL? So clearly, using PG bench in this example it looks to be worse, perhaps, but maybe there are other benchmarks that will show Aurora outperform postgres. Just based upon how the database engine has been designed and the differences between postgres, but definitely an interesting piece of content. I highly suggest you check out the next piece of content. What developers find surprising about postgres transactions. This is from blog Lawrencejones Dev and he's talking about transactions and the case of repeatedly reading something. So a lot of transactions are used to ensure that you can do an atomic update of say, two tables. Like you want to decrement a balance in one account and increment it in another. You want those to happen within a transaction and those updates to either happen or not. In whole you don't want say, money decremented but not incremented in the other account. This example where he's looking at here is that you're doing within a transaction a select to get some information and that even though you're doing a transaction in postgres, if another connection has changed the information where the ID equals bananas in the organization's table. When you do this next query, it's possible you're going to get different information. The reason being because the isolation level in postgres by default is read committed. So as soon as any read is committed, you can read it and it doesn't matter if you're in a transaction or not. So as soon as another connection updates the information you selected, if you go to select it again, you're going to get that up to date information. It's not going to return the same data just because you are in a transaction, because it's showing you things after reads are committed. Now, this post talks about a way to avoid this is to use locks. So he shows an example where you're essentially locking for share this example and then you'll get the same information with the line three query. So basically that is an explicit row lock, as he says here. However, there is another way to do this, changing the isolation levels. So in addition to Read Committed, there's also repeatable read. So that would enable you to query this data as many times as you want to and you will get the exact same information as it existed at the start of the transaction. So it's going to keep a version of the database existing for this transaction to allow you to retrieve the same information. Even if another session has committed changes to that record, it keeps that old version around for you to reuse. Now, one disadvantage of using an isolation level change and making it repeatable read is that there is a chance that the query has to be canceled. So if you look at the postgres documentation, it talks about the repeatable read isolation level, so it is possible to get an error and it says it could not serialize access due to a concurrent update. So if another session is updating it and you're doing specific updates or changes within that transaction, it could have to roll back your transaction and basically you have to start the process over again. So locking as he's suggesting here won't result in this possible failure, but you're potentially reducing your concurrency because no one can make changes to this record while this is going on, whereas with a repeatable read, all the updates can still happen, so there's no blocking from a lock. However, there are cases where you will hit a serialization error and you'll have to redo your work if you choose this path. But definitely interesting post. If you want to learn more, encourage you to check out this piece of content.
[00:08:23] The next piece of content taking advantage of write only and read only connections with PG Bouncer in Django. This is from Enterprisedb.com and they're talking about setting up an environment where your application specifically Django in this case. But there are other application frameworks that can do this as well. Basically the fact that they can speak to multiple databases. So they've set up an environment here where your application framework or your application server can speak to two different databases, a write only database and a read only database. And then what they do is they send them through PG Bouncer and this acts as a proxy to point to the primary database. The reads then go through an Ha proxy to speak to one or more Standbys so that you're going to get a read from one of the standby databases. So this post doesn't go into how to set up the high availability here, but it talks about configuring your application with PG Bouncer and HAProxy to be able to send writes to one database and reads to another. So the PG Bouncer configuration simply sets up a reads database that talks to HAProxy and sets up a writes database to talk to the primary. And then HAProxy does this it receives the connections and sends them to one of the two standby servers using a balance round robin.
[00:09:46] And then you configure your application framework to speak to both databases. Basically, you configure your databases such that your primary goes to the Writes database here and your standby goes to the Reads database and then Django. In particular, you can set up DB routing so it automatically knows to send reads to the Reads database and writes to the rights database. But of course you could also do this for any other framework. So this is a very simple implementation of showing you how you can separate your reads and your writes and send them to different databases and then scale out at least your reads. So definitely interesting blog post. I suggest you check out the next piece of content. Some Indexing best Practices this is from Pgmuster.com and this is just a numbered set of some best practices. Number one, don't index every column. Definitely true. Every index has a price to be maintained when you're inserting or updating records. And you would avoid a lot of heap only Tuple updates because those indexes have to be updated. Second recommendation is index columns that you filter on. So anything in a where clause, you're generally going to want to have an index that can access it, only index data that you need to look up. Now here specifically they're talking about utilizing partial indexes when necessary. So if you look at rows in the table where deleted at is null. So when it gets deleted, this column is updated with the date of the deletion. Generally it is a good practice to create the indexes as a partial index where deleted at is null. You could also do this with statuses where say, a status is complete. Maybe you only want to index those values where the status is complete or the status is active or inactive. The fourth consideration is think about other index types. So the B tree index is the standard index type, but if you have text search you need to do if you have a JSON B column, you're generally going to want to use some gen indexes for that. Or there are certain use cases when you have a lot of data. You may want to use a block range index, a Brin index. Of course, Spatial uses just indexes so you can check out the different index types to see how they can benefit you. Five use indexes to presort data I was a little confused on this, but I think what they're talking about is that if you presort the data you're going to index, you can get some advantages as long as that data doesn't change after the fact. Like they have a link to loading this data in a sorted fashion and you avoid a bitmap index scan if the order of the index matches the order of data in the table. Now that can work very well for things like data warehouses, data marts, things of that nature. But if you're constantly updating and inserting data, that may not work out as well, but that's definitely something you take advantage of. Six is use multicolumn indexes, but sparingly. I see a lot of performance benefits from using multicolumn indexes. So I may not necessarily agree with sparingly, but when I do a first passive indexing a table, I usually don't use multicolumn indexes unless I absolutely know I need it. I basically put indexes on the foreign keys and other columns I know I'm going to be using in a where clause. And the other thing not mentioned here is that when you're doing ordering, an index is also used. So you should also consider using indexes when you're ordering by something because that can definitely lead to some performance improvements.
[00:13:11] Number seven, look after your indexes. So basically, yes, if they get Bloated, it's a good practice to reindex them periodically and particularly always do reindex concurrently. So these were some pretty good recommendations. And if you want to learn more, you can check out this blog post next piece of content, postgres and JSON. This is from Atlabra.org and the title wasn't necessarily 100% clear to me what it was discussing. I think a more appropriate title would have been creating an Audit Trail using Postgres and JSON B fields because that's essentially what this post is. So they created an archive test table that has a number of columns in it and then they created a companion delete table. So this is going to hold any deletions that happen to this table. Then they created a function that takes the old data and uses the function row to JSON. So it converts that row of data that's being deleted into JSON and then it stores it into the delete version of the table, the archive test delete table. Then it creates the trigger and it executes that procedure, the archive record procedure that was created after a delete. So the advantage of this is that this can now persist through schema changes. So if you change the schema of archive test, it doesn't matter because it's storing the whole row as a JSON B field that they called a Record field. So they give an example of it where they insert three rows into the table and then they delete a record and you can see what gets inserted into the delete table. You basically have a Record field with all the values in A, JSON b and they showed where you could add a column and it still works as normal. It just adds additional information to the JSON B. Now what's interesting is their technique to basically get data out of the audit table. They do a select from the table, but then they do a lateral join and from the Record field they do a JSON B populate record that basically takes all that information and turns it back into an SQL record, essentially. So definitely an interesting way to set up auditing for particular tables.
[00:15:19] The next piece of content extracting and substituting text with regular expressions in PostgreSQL. This is from Crunchydata.com, and this describes exactly what it's doing. It's learning to use regular expressions in postgres. So they go over a quick refresher of how to use regular expressions and then how you can do searches within the database using them using the tilde operator, as well as how you can extract specific text using substring and regular expressions, as well as do substitutions. So if you want to learn more, you can definitely check out this blog post. The next piece of content pgwrite PostgreSQL Table Partitioning this is from CyberTech Postgresql.com, and this is a new open source tool and extension that helps you rewrite a table into a partition structure. So they show you how to get access to the code for this, how to install it, and how to enable it. You need to make it a part of your shared preload libraries, and you do need to set the wall level to give it enough information to set up the partitioning scheme. Then you create the extension within your database because you do need access to the partitioning command. And they give an example of a table that they wanted to convert and they made two partitions out of it. And essentially you just run this one command select Partition Table, and then you specify the table you want to partition, what you want to partition it to, and then what the old table should be renamed to. And after you run it, it should result in a structure like this where the old table will have been renamed. It used to be T Number, now it's T old number. The new partition table is no longer named t part number because they rename it to t Number, what the original name of the old table was. So essentially it's a swap in replacement. And they said that this utility should work without locking things and you should be able to do it online, but they didn't go into a lot of details about how it works. But since the source code is open source, you can definitely check that out. So if you're looking to convert a table into a partition scheme, maybe you want to check out this new utility.
[00:17:29] Next piece of content PG GraphQL a GraphQL extension for postgres. This is from Supabase.com. They're talking about a new open source tool release called Pgrfql, and they do say it is a work in progress, but it's something that they're working on and they've just released it. Basically, it uses Postgres to provide a GraphQL interface. So if you're interested in something like that, you can definitely check out this blog post.
[00:17:56] And the last piece of content the PostgreSQL Person of the week is Emrih Hasigali. So if you're interested in learning more about Emery and his contributions, to postgres. Definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode. It or you can subscribe via YouTube or itunes. Thanks.