Column Performance, BRIN Win, Unique and Null, Parallel Distinct | Scaling Postgres 224

Episode 224 July 18, 2022 00:13:57
Column Performance, BRIN Win, Unique and Null, Parallel Distinct | Scaling Postgres 224
Scaling Postgres
Column Performance, BRIN Win, Unique and Null, Parallel Distinct | Scaling Postgres 224

Jul 18 2022 | 00:13:57

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss how too many columns in a table can affect performance, at what point BRIN indexes win over btree, and Postgres 15 supporting unique nulls and parallel distinct queries.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/224-column-performance-brin-win-unique-null-parallel-distinct/

View Full Transcript

Episode Transcript

[00:00:00] Speaker A: In this episode of Scaling Postgres, we talk about column performance. Brin when unique and null in parallel distinct. I'm creston. Jameson and this is scaling postgres episode 224. I hope you, your friends, family and coworkers continue to do well. Our first piece of content is column order in PostgreSQL does matter. This is from Cybertechyphenposgresql.com and it's talking about column order. But more specifically, I think it's talking about when you have too many columns, it can really cause an impact with performance in certain cases. Now, the order comes in when you have a lot of columns on the table and you're looking for something specific in one of the later columns. But we'll get into this post. So basically, they had some clients that had some very broad tables. So he simulated one. He's using Generate series and the Geexec command to create a really wide table, essentially 1500 columns. So that's a lot of columns. And then he inserted a million rows into this table for doing these tests. Now, the size of the table was about 4GB, so not incredibly large. Now, first each just wanted to get account all the rows from this particular table and it returned in about 400 milliseconds. So there's no index. It's just doing account of the rows that exist there. And it's doing it in a parallel sequential scan here. Now, when you choose the first column and do the same thing, you have something similar. It's a little bit slower because it does need to check for nulls for the data in this column. So as opposed to being 416 milliseconds, this one was 432 milliseconds, but still not bad. But look what happens if you select the hundredth column where you're doing account on that. Now suddenly it's twice as slow. It's 857 milliseconds. So I was pretty surprised it gets that drastic at only 100 columns. And then when it goes out to the 1000 column, it takes over eight and a half seconds to return account from those rows. So that's crazy. That's 20 times slower than before. Now, he says why this is happening is that basically he has to pull back a row and then use the data types of that table to figure out where the data is located within that row to be able to return it. Now, this can be relatively easy to do if there's a bunch of integers because they're essentially just four bytes. But with text based ones like variable character data types, the computation gets a little bit long. And having to do that for every row that you're counting really adds up. So basically, as he says here, having that many columns in a table quote does create some non trivial overhead. So just be careful about making your tables too wide because you may run into performance issues. But definitely check out this blog post if you want to learn more. Next piece of content, postgres indexing. When does Brin win this is from Crunchydata.com and he's talking about brin indexes which are basically a block range index. And as the name implies, it indexes ranges of values. Unlike a B tree index, the typical index that indexes every value, a brin index does not do that, it does ranges of values. Now as a consequence it's going to be a lot smaller, this type of index. And one of the disadvantages you can imagine it's harder to pull out a distinct value, whereas with a Btree it's quite easy to pull out a distinct value, whereas with a brin you have to pull back the range and then find the individual value you're looking for. Now the other consequence is that it does really well with sequential data, but not so great with random data. So he has some examples here. If you have like a log table with a timestamp and you put a brin index on the timestamp where you're pulling back ranges of data, that could be a good use case or different types of sensor measurements or GPS tracking points, et cetera. Now he goes into describing a little bit about the brend index and how it's structured. But basically there's a value range that index holds and tells you where in the pages to find those values. So again, the more that your values are grouped together physically in the heap, the more efficient a brin index will be because you don't want it to have to go to multiple pages across the heap to find a particular range because then the index becomes very ineffective. Now, he created a table here where he had some random data and some sequential data and a value and he created four different indexes to Btree to brin and for each type he did the random column and the sequential column on the table. Now again, because the brin index is only ranges and not distinct values, you can tell how much smaller the brin is compared to the B tree. Essentially it's 1000 times smaller in the example here. So the brin random size is 24 KB whereas the B tree random size is 21 megabytes. So it's 1000 times smaller, which is phenomenal. But let's look at some of the performance. So I think this was a million row table and he's pulling back different numbers of rows from that table. So 100 rows, 1000, 10,000, 10,0000 rows. So how many rows you're pulling back from the table. Now the B tree, whether it's random or sequential, the timings are pretty much similar but the sequential is more performant by maybe a twofold difference in some of the cases. But it's pretty similar, but no more of a difference than that. So be tree, whether random or sequential, the sequential is a little bit faster but that's pretty much the result. Now, the random brin index is just terrible. Its performance is worse than any of the B tree and it's consistently worse because it's essentially just like doing a sequential scan, it doesn't help at all. Now the brin sequential scan, as noted before, because it's indexing ranges, it can't pull out distinct values as fast. So actually the B tree is about 20 times faster when pulling back 100 rows versus the brin index. At 1000 rows, the B tree is only about twice to four times as fast and it's at the 10,000 rows you're trying to return Mark, where the brin is pretty much in the pack and matching what the B tree does. And then it starts performing better with row counts above 10,000, like the 100,000 it beat both B tree examples. And then keep in mind this is with an index that's 1000 times smaller than a B tree. So given the size, these are great results. And you can tweak the number of pages in a given range. In the brin index, by default it's 128. But he did test down to four different performance levels. So maybe by tweaking it you could get a doubling at most, maybe a tripling in some performance for low row count levels. But you can check out this table for yourself. So basically in the conclusion, brin indexes are great when you have an insert only pattern and the data is being added to the heap in a sequential manner. So think of a timestamp and they're great for very large tables where you're wanting to pull back large amounts of rows. If you're wanting to pull back a single row or ten rows out of a million a billion rows, the B tree is going to be much more efficient at doing that. But if you're pulling back like a year's worth of data, or a month or a week out of a table with millions billions of rows, then a brin could give a lot of advantages. But if we want to learn more, definitely check out this blog post. The next piece of content, postgres 15, improves unique and null. This is from Rustprooflabs.com and normally when you create a unique index or a unique constraint across columns, you can insert multiple nulls. And in this example that he created, he has a unique constraint on val two. And as you can see, you can insert as many null values as you want to into val two. And it's fine because a null by definition means an unknown value, so you don't know if it's unique or not. So that's acceptable. But what's coming in postgres 15 is that you can now use this additional clause called nulls not distinct. So what that means is consider null to be the same value, which of course means you can only insert it once. So in this example here, using that nulls not distinct, you can insert one null value of course, but when you try to insert a second one, it's going to result in a duplicate key violation. So now, depending upon your use case, you can choose to consider nulls to be the same or to be different. If you want to learn more, check. [00:08:41] Speaker B: Out this blog post next piece of content. [00:08:44] Speaker A: Introducing PostgreSQL 15 working with Distinct this is from Procona.com and coming in the next version of Postgres is the ability for distinct to operate in parallel. So in this example here, they created three different types of tables a one column table, a five column table and a ten column table, inserted a fair number of rows into each one and then did a distinct from all the rows in each of these tables. And they actually tested in every version since Postgres 9.6. And as you can see, Postgres 15s performance is better than any prior version and at the ten column level it's more than double the performance of even postgres 14. And when you actually look at the explain plan, you can see that it's actually doing the distinct in parallel. Further, you can also adjust the max parallel workers per gather to even get even greater performance. So if you want to learn more about this feature, definitely check out this blog post. The Next Piece of Content five Minutes of Postgres Episode 26 new in Postgres 15 null handling in unique Constraints in parallel distinct. And this is from Pganalyze.com and Lucas covers the previous two blog posts that I covered here Scaling Postgres. So if you want his perspective on it, you can definitely check out this piece of content. The Next Piece of Content PostgreSQL 14 Internals this is from Postgrespro.com and this is an announcement of a book called PostgreSQL 14 Internals that's been written. However, it's in Russian but it is being translated into English and part one of the book is complete already and it's freely available here as a PDF. So if you're interested in learning more about specifically the internals of Postgres, definitely encourage you to go ahead and check out this PDF. The Next piece of Content PG Agent a Postgres exporter for Prometheus focusing on query performance statistics. This is from Corout.com and they're talking about a Prometheus which does metric exporters to monitor different metrics of your system. Now, there exists a well known open source one called the Postgres Exporter that collects all sorts of statistics about postgres to be able to query or to present to users, maybe using Grafana. But they had some issues with it because they wanted to ask specific performance questions such as a CPU intensive query is affecting other queries, or a server has reached max connections because most of the connections are stuck in idle and transaction. Or an I O intensive query is affecting other queries, or a query is blocking other queries by holding an exclusive lock. Now, they assessed that the number of changes required to get all of this information from the Postgres exporter would be, as they say, quote, too massive. So what they did is they decided to build their own exporter called PG Agent so this blog post talks about that and also goes into some of the queries that they developed to be able to focus in on these performance pain points. And in the example here, they're showing a top query time per second for different queries query I o time per second, a report of client connections in some of their state and also looking at different locks, and they give you a quick start on how you can get started using this new prometheus exporter. So if you want to learn more about this new open source tool, definitely. [00:12:06] Speaker B: Check out this blog post, the Next. [00:12:09] Speaker A: Piece of Content state of PostgreSQL, how to contribute to PostgreSQL and the community. This is from Timescale.com. They're going back to the State of Postgres survey and they're revealing questions with regard to contributing to postgres and looking at the number of people that have contributed to postgres and in what way, along with some particular quotes with regard to it. And then it goes into how you could potentially get started contributing to postgres. So if you're interested in that, you. [00:12:35] Speaker B: Can check out this blog post, The. [00:12:37] Speaker A: Next Piece of Content petrone and PG Backrest combined. And this is from Pgstaff GitHub IO. And this post talks about if you're using Petroni to create a cluster of postgres machines, how you can use PG backrest to essentially backup that cluster. So if you want to learn more about how to do that, definitely check. [00:12:56] Speaker B: Out this blog post, The Next Piece of Content. [00:12:59] Speaker A: The PostgreSQL Person of the Week is Charlie Batista. If you're interested in learning more about Charlie and his contributions to postgres, definitely. [00:13:07] Speaker B: Check out this blog post and the. [00:13:10] Speaker A: Last Piece of content. We did have another episode of the Rubber Duck Dev show this past Wednesday evening. This one was actually interview with Brittany Martin, who is the host of the Ruby on Rails podcast. So we talked a little bit about podcasting, a little bit about project management and some other topics. So if you're interested in a developer interview format, we welcome you to check out our show that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 25

August 13, 2018 00:13:14
Episode Cover

generate_series, PG11 Partitioning, GPU Queries | Scaling Postgres 25

In this episode of Scaling Postgres, we review articles covering graphing with generate_series, partitioning in Postgres 11 and GPUs for queries. To get the...

Listen

Episode 107

March 30, 2020 00:10:48
Episode Cover

Showing Plans, Atomic, Configuration, Migrations | Scaling Postgres 107

In this episode of Scaling Postgres, we discuss how to show live query plans, the importance of atomic operations for scaling out, configuration options...

Listen

Episode 270

June 18, 2023 00:14:16
Episode Cover

Data Skew, Monitoring, Remote Access CSV, Documentation | Scaling Postgres 270

  In this episode of Scaling Postgres, we discuss how to handle data skew, tools for Postgres monitoring, how to load data remotely and Postgres...

Listen