BRIN Indexes, Partition Enhancements, Tips, Standbys

[00:00:00] In this episode of Scaling Postgres, we talk about Brin indexes, partition, enhancements tips and standbys. I'm creston. Jameson. And this is scaling postgres episode 73. [00:00:15] One. [00:00:20] Alright, I hope you're having a great week. Our first piece of content is Brin index for PostgreSQL. Don't forget the benefits. And this is from Procona.com blog. And basically they're talking about a Brin index, which is a block range index. So basically it stores ranges of values. So this type of index is optimal for things that are going to be stored incrementally. So for example, you're storing data in a table by date or by a primary key that's incrementing. That's what is best used for this type of index. And it says basically that brain usage will return all the tuples in all the pages in the particular range that's chosen. Basically the index is lossy, so it doesn't have an exact representation, it stores ranges and then it can pull up that range and then find specific data that you're looking for. So sometimes it does take a little bit more to do a search, but the benefit of it is the size is incredibly small and that can help it with caching. So they basically created a table, inserted about 8 million rows into the table and then they did a query on the table with no index and it finished in about 1.8 seconds. They created a b tree index. Now that query instead of 1.5 seconds takes 2.6 milliseconds. So super fast with the B tree. And the size of that B tree is 171 megabytes. So now they take a look at the Brin index. So they use the using Brin when creating the index. And that same index that was as a B tree, 171 megabytes. As a Brin index it is 64 KB. So vastly smaller, that may be 0.1% of the size, so vastly smaller than the index. So that has a higher probability of being cached, definitely. Now the query finishes in 87 milliseconds. So again, the original query was finished with no index was 1.8 seconds, so 1800 milliseconds, whereas this one finishes in 87 milliseconds. So not bad, a pretty good improvement. However, the B tree still beat it with 2.6 milliseconds. So this example here, it's about 40 times slower. So you get a huge space savings at the cost of not the necessarily best performance, although it probably depends on the query. Now they also go into a section talking about storage and maintenance of it and that you can specify pages per range. So this is basically how many pages are going to be in a particular range. So basically you can make it less lossy and presumably you can improve performance a little bit at the cost of taking up more space. So this is think of it, I kind of think of it like a JPEG image. You can have a higher quality image that's a lot larger or a lower quality image that is a lot smaller. So I think the brin is kind of that same concept. So if you're interested in potentially using a brin index, definitely a blog post to check out. [00:03:28] The next post is partitioning enhancements in PostgreSQL twelve. And this is from the Second Quadrant.com blog and they basically take an example from their previous post on generated columns in PostgreSQL twelve. And they're doing it like a data warehouse example and using partitionings. So they create a media calendar, breaking out all sorts of different date times as integers in a single table. And then they have a time dimension where they look at time of day, hour of day, minute of day, second of day, and then they're going to do a cartesian product of joining these two together and placing the results into a partition table. So the intermediate table is an hours to days table and they partition by range day. And then they put all of these values into these partitions as defined here. And basically they're going to be generating 6.3 billion rows. Now, getting into the first enhancement, it's basically that quote, you may now have a foreign key relationship where the partition table is the child. So here they created a sale table and referenced as a foreign key one of the child tables. So this is definite benefit for integrity purposes. And then they also mentioned some new introspection tools, whereas there's a PG partition tree, PG Partition Ancestors and PG Partition Root. So you can do a D plus and they even have an S plus to talk to display all the partitions within PgSQL. But they also have these functions that you can use that shows you the partition tree. So for example, the hours today's is the parent. It defines if it's a leaf or not and what level it's at. So you could have essentially multiple partition levels and it references what the parent rel ID is for it. There's also the PG partition root to identify what the root table is as well as Partition Ancestors. The other thing mentioned is that altering a table to attach a partition now does not require an exclusive lock. So that's great. Now the detached partition still does, but at least the attached partition does not require an exclusive lock anymore. So that's definitely a benefit. And they list three other benefits here. One, the copy command has reduced a bit of overhead allowing for faster loading. Two, the table space specification for a parent is now inherited by the child. And three, PG catalog PG Indexes now shows indexes on partition children. So definitely some great enhancements for partitioning coming in version twelve. [00:06:01] The next post is postgres tips for the average and power user. And this is from the Citusdata.com blog and the first one they mentioned is actually one I haven't heard or used before in that in psql you can actually tag your queries and use Control R to do a search. So you can just use Control R. You don't need to necessarily tag them, but you can just type part of a query and it will pull up the different queries that you've run historically and you can just hit Enter and run that command. So that's pretty interesting. But he's talking about using a tag with a comment here to be able to find a particular query you're looking for. And then he talks about a better psql output. So there is the backslash option that he talks about, something called a PGDG that you can install that he likes better and gives you some configuration you can use to your Plqlrc file in order to enable it. The next is getting a stack trace for an error, so he goes through the process of being able to do that, talking about inserting fake data. Again, if you've watched previous episodes of Scaling Pro Screws, there's been many blog posts about Generate series, so they're talking about it here. They mentioned a better PG stat activity. So give some suggestions on how to better read the active statements going on in PostgreSQL as well as a process to manage multiple versions of postgres that you may be running using PGNs. So some tools that you could use in your postgres environment and another blog post to check out. [00:07:31] The next post is Standby in Production Scaling application in the second largest classified site in the world. And this is on Avivotech on Medium and they have a very large postgres installation, greater than 20 terabytes on over 100 nodes. And this is Avito, which is the biggest Russian classified site and the second largest classified site in the world. So this post is very, very long and pretty comprehensive and it goes into basically all about standbys and talking about how they can be used from a high availability and a scaling perspective. And he talks about physical replication, which is you're streaming the wall files essentially and talks briefly about logical replication, but primarily it's about physical replication and using standbys to be able to horizontally scale reads. And he goes a little bit into the history and then some specific things that they've run into and how they've been able to resolve them. Some of the main things he covers is how to avoid stale reads from a replica and some techniques they use compared to some other people, how they have implemented some caching. They discovered some issues with regard to deadlocks they were having on a replica, discussing issues related to excessive wall files and how they handled that. They talk about standby pools. So this is a really comprehensive blog post that talks about a number of different items that they have encountered as they're scaling postgres. So if you've run into problems with scaling as it relates to standbys, this is definitely a blog post to check out because he does cover a lot of different information that's a little bit too long for me to go into comprehensively in this episode, but definitely a blog post I encourage you to check out. [00:09:18] The next post is the mysterious backend flush after configuration setting. And this is from CyberTech postgresql.com. And there is this setting that I haven't heard about or looked into, but they heard Andres friend that basically said, quote, if your workload is bigger than shared buffers, you should enable back end flush after parameter for improved throughput and also jitter. So they said, well, let's take a look and try it. So basically it communicates something to the OS about how to handle some F syncing. That's my interpretation in a nutshell. So he decided to go ahead and test this. He used a four CPU, eight gig system and PG bench. And with it set at its default, it finished in zero 67 milliseconds. When he set that to 512 KB, it wasn't in significant change, less than 1%. But then increasing it to megabytes, he got close to a 4%, close to a 5% difference in terms of better performance. So this is another one of those configuration settings I would say probably depends on your load and what you're doing. So this is one where if you want to try to squeeze a little bit more performance out of it, take a base measure of your system, alter this parameter and see if you get a difference in the performance of your system and your workload to see if it could be beneficial. So if you're interested in getting a little bit more performance, definitely a blog post to check out. [00:10:42] The next post is suggesting single column primary keys almost automatically. Now, this is a continuation of a blog post previously about how we had to Luca Ferrari. And this is from fluca 1978 GitHub IO. Now, he talked previously about having to generate primary keys for a great number of tables in a database with a lot of different tables that didn't have any. And he's come up with kind of like an automated method to identify which is probably the best primary key, again by looking at the system tables. So he has all his code right here and he's consulting the PG Stats table in order to determine what are some good candidates. And he goes through his methodology here and then runs it to suggest certain primary keys for different tables. So if you have a particular use case for this, definitely blog post to check out. [00:11:35] Next post is also by Luca Ferrari. That's titled checking. PostgreSQL version in scripts. So basically he had a psql script and he wanted to be able to check the version as it relates to partitions because sometimes he was using something that's version ten that he wanted to use this particular type of syntax. And then version eleven, a slightly different one that had a default partition. So he came up with this method to determine which postgres version is running. So that he can conditionally choose it during the query execution. So if you have a use case for this, definitely blog post to check out. [00:12:12] The next post is Per User Encryption with Postgres, and this is from the startup on Medium. [00:12:19] And this is a pretty simple post to start off in regards to Postgres, but he's talking about basic encryption that you can do using the PGP SIM Encrypt and PGP SIM Decrypt to encrypt and decrypt data within the database. And he talks about how you can basically define an environmental variable on your probably application server and then be able to encrypt and decrypt data without having the keys on the database. And he takes it a step further and talks about Sandwich encryption, where, for example, maybe your users have their own key and the server has its own key, and basically you double encrypt it with both of those. So that way even you can't decrypt the data. Only the user with their known decryption key can do it. So if you have a need to do that, definitely a blog post to check out. The last post is PG backrest s three configuration. And this is from Pgstep GitHub IO. And PG Backrest is a separate backup tool for PostgreSQL, and you can configure it to use S Three to store and retrieve data, and this blog post shows you how to do that. Now, it doesn't use S Three, but he actually uses a tool called Minio, which is a server that's compatible with Amazon S Three's object storage. But nonetheless he goes through and shows you how to configure it to use PG Backrest and does a backup with PG Backrest and shows how to configure it and get it running. So if you're interested, definitely a blog post to check out that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show Notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 121

Fantastic Data Types, Domain Data Types, Insert Performance, Row Differences | Scaling Postgres 121

Episode 299

Build A GPT In SQL | Scaling Postgres 299

Episode 157

auto_explain Overhead, Postgres Data Lake, Citus Shard Rebalance, SSL Authentication | Scaling Postgres 157

BRIN Indexes, Partition Enhancements, Tips, Standbys | Scaling Postgres 73