100x Optimizations, Auditing, Document Schema Designs, Checksums | Scaling Postgres 206

Episode 206 March 13, 2022 00:17:51
100x Optimizations, Auditing, Document Schema Designs, Checksums | Scaling Postgres 206
Scaling Postgres
100x Optimizations, Auditing, Document Schema Designs, Checksums | Scaling Postgres 206

Mar 13 2022 | 00:17:51

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss 100x optimizations, how to audit table activity, designing schemas and how to set up table checksums.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/206-100x-optimizations-auditing-document-schema-designs-checksums/

View Full Transcript

Episode Transcript

[00:00:00] Hello. In this episode of Scaling Postgres we talk about 100 x optimizations auditing, document schema, designs and checksums. I'm creston. Jameson. And this is scaling postgres episode 206. [00:00:23] All right, I hope you, your friends, family, coworkers continue to do well. Our first piece of content is how we optimize PostgreSQL queries 100 x. This is from towardsdatascience.com and he's talking about some optimizations that he did for his application that led to up to 100 x in performance improvements. Sometimes it was ten x, sometimes it was two to five. But he goes through several different things he did to improve query performance. Now he makes a note here that some of the queries that he was optimizing had a lot of joins. Three to ten yielded up to a million rows. So that's quite a lot. And then he had to do it within specified UX time limits because I believe this is a web application. So here's some of the things that he discovered. One is that in his cases sometimes a left join can be more performant than an inner join, particularly when there are some columns that are correlated. So just changing an inner join to a left join wasn't sufficient but he also had to include correlated columns. So for example, the two tables that are joined here, they have a repository ID is the same. And when he was able to add that in the left join was giving some greater performance based upon choosing a different plan. So the inner join a lot of times was using a nested loop whereas the left outer join avoided that and did some things that could be cached better. So that was definitely a pretty good win. The next area he looked at is a hashing value. So instead of using an N with a bunch of IDs, he says 9000 IDs, he used any and the values and that resulted in a greater performance improvement. So sometimes this worked, sometimes it didn't, but when it worked he was able to get another ten to 100 fold improvements. So this looked a little bit hit or miss but it is possible. The next area he looked at was extended statistics. And so he had an example query where the repository ID was in this particular set of values and the merge commit ID was in this particular set of values and those were highly correlated. So as he created extended statistics using this command here, create statistics on those two columns and this is as long as you're using greater than or equal to PostgreSQL version 13, you can add these. And that resulted in the planner doing hash joins as opposed to nested loops, again resulting in about a ten to 100 speed up. The next area talked about is primary keys. So initially they were using a natural primary key but source of that primary key unfortunately was from an external source and suddenly they decided to change the format of that and they had to redo all of their keys. But what was interesting is when they changed it to be a simple integer versus a text value, they noticed a two to five performance improvement with joins on those columns. Something we'll notice maybe in a few posts this week, is that integers are generally ideal for performance in terms of joins and using for keys. The next area he covered was cluster, and this is a command that allows you to order the data on the table close to how it typically gets queried. Now, the unfortunate side effect of cluster is it locks everything out of the table, so it's not something you can run on a regular basis, clearly. And I would probably only use it if it's for a data mart or a data warehouse purpose, so not anything that's constantly being acted upon in a transactional manner. I probably wouldn't use cluster in this case, but that can give you a performance improvement by aligning the data to how it's being queried. And he was noticing when they were using cluster it was a two to five x improvement, so not as much as 100, but it gives you a little bit. The next error he talks about is plan. So he was using this extension PG hint plan, and that basically kind of goes around what the planner decides to do and you give hints to say, okay, do an index only scan here or just use a regular index scan and a number of other hints to the planner as to how to execute the query. Now, interestingly with this part, he mentioned different changes that were made, but he never mentioned the performance improvement, so I imagine it wasn't that significant, but he didn't really say. But this is a pretty long blog post and I definitely encourage you to check it out to look at kind of what he discovered in terms of his performance improvements. [00:05:00] The next piece of content postgres auditing in 150 lines of SQL. This is from Subabase.com and they're talking about setting up an auditing system. And they have an example table here they called Members that just has an ID and a name. And what they came up with was a relatively simple API where you can just call this function enable tracking and it automatically starts auditing the transactions on this table. So basically if something gets inserted, updated, deleted, it will track that and it looks similar to the layout here. So you have a unique record Identifier. They're using a Uuid for this. And when a record is inserted, the record ID will of course be there. When it's modified, both the record and the old record will be there. And when it's deleted, you'll only see the old record reference in these columns. The rest of the comms, it has what operation it was insert, update, delete a timestamp the table OID. They did this for more efficient querying because again, integer is more efficient than text for querying or for indexing. The table schema table name and then as well as what was in the record and what was in the old record. So you could see the changes from an edit, what gets newly inserted and then what gets deleted. Now, in terms of constructing this, they created a separate schema to store it in called audit. And then they created the table that we just looked at. What the results were is what's in that table and then they created an index on the timestamp because they figured that's going to be queried pretty frequently, what time did a certain change happen? And they actually used a block range index, which is great for timestamps that are not going to be adjusted. Basically it's kind of like an append only table, which is what this is. Each change in any table is going to just be inserted constantly into that audit table. So they went with a Brin index on the timestamp column and then they did another index on the table OID. So they didn't do it on the table name, they did it on the OID. So that's how they expect you to query this table. Now in terms of creating that unique identifier, they actually used a Uuid. So how they did that, they appended the table OID plus the primary key into N to generate what that UUID is, so that the UUID identifier will always be the same to designate that particular row because the primary keys shouldn't be changing or that row's identity will change if you change the primary key. So this is the function they use to identify the primary keys and then another function to actually consume and build the UUIDs. So you can look at this function here and how that does that. And then they added an index on the record version and the old record version. And this is the generic function that will work for any table that basically reads the values in there and then outputs them and inserts into the record version table in the audit schema. So that's the function that gets fired. But if you've used triggers, you know there's a trigger that actually fires a function. So what we just described was the function the triggers get added by this enable tracking function. So this is the actual API that turns on tracking or turns it off. And so when it's on, it'll add those triggers and they will be fired and insert into the audit table. When it's disabled, it'll turn those off and you can see those functions here and that's pretty much it. And they said they did it with two lines of code to spare, so they did it in less than 150 lines. Now, one caveat they do mention about performance is that this will be a hit to performance because you replaced triggers and any op change to that table is going to fire these triggers to be able to do inserts in this other table. So you are going to have a performance hit with doing this. But they do say for a higher write frequency, you can look at the extension PG Audit. And then lastly, they also have an extension they created called super audit. So basically, you're able to install this extension and then just start using the code here. But if you want to learn more, check out this blog post. [00:09:07] The next post is actually related to this previous post because it was the one covered in five minutes of postgres. And this is five minutes of postgres. Episode Eight postgres Auditing table triggers with Superbase versus the PG audit extension. And he does compare and contrast this a little bit with the PG Audit extension and how it actually relies on sending information to the log files of postgres as opposed to a table. So there's no triggers. But the burden there with PG Audit is the amount of logging that gets done. So you have to be prepared to handle that increase in log usage for your postgres installation if you're going to be doing a lot of auditing with PG Audit, but definitely encourage you to check it out this week's episode as well. [00:09:51] The next piece of content, database design using documents. This is from Endpointdev.com, and he's talking about a scenario where you have a table of websites and you're going to be scanning them and you want to record the URLs for those pages that you've scanned. Now, he has an example schema here, and with the goal of this article, it looks to be to find the smallest size that is possible for storing this information. So it doesn't really look at performance. So that's one caveat to think of. Now, generally, a smaller table will be faster to scan or smaller indexes will be faster retrieve data, but it all depends on how you're querying. But in this example, he's looking at size. He says, okay, let's take this schema and change it to a normalized condition where we're not using natural keys, but we're actually using integers that should allow our largest tables to be very narrow, so not recording a lot of information. [00:10:49] So we did that. The next scenario was looking at a composite type. So we created a composite type. And you could have done this without a composite type and just did it as an array of text strings for the URLs. So basically it's the website page and then each URL is an array. So I think you could have done it that way. The next he looked at a JSON column and then also a JSON B column to again, compare the sizes. And looking at here, the original table with the natural keys was 436 megabytes. The normalized table where he made it more narrow was 387 megabytes. Going to the composite type or the array was 324. Now, the reason that he's explaining for this size difference is the amount of overhead for many rows. Because if you think about it, the normalized table has many rows and there's an overhead for each row for that. The composite type with the array has much fewer rows, but each row is much larger. Of course, then he looked at JSON and then the JSON B column, which are even a little bit smaller than that array, which I'm a little bit surprised by. I wonder because he was saying there is some overhead for using a composite type. So I wonder if just a straight up array would have been the smallest version. But it's an interesting investigation at different methods of setting up a schema and the size differences. But again, to me it all depends on what queries you're going to be using as to which format is more efficient. Based upon how you're querying, would it be more efficient to have the normalized tables or in the JSON format? To me that's more important than the size differences. But this is an interesting piece of content and I definitely encourage you to check it out if you're interested. [00:12:38] The next piece of content PostgreSQL creating checksums for tables. This is from Cyberpusgresql.com and they're talking about checksums which basically outputs this fingerprint based upon a file. But instead of doing it for a file, like they're doing an MD Five sum checksum here, they're saying can we do it for a whole table or even particular rows? And that's what they did in this post. So the plan of what they're trying to do is they want to do a hash of each row and put them together. So in order to do that, they wanted to make sure they had things in a proper order. So they wanted to use an ordered set to do that. So they're showing an example of an ordered set here. And this is a case where ordering becomes more important. Like when you're calculating an average, an order doesn't matter, but if you're trying to find a median, like the middle value, the order becomes more important. But for what they're trying to do, where they're trying to do a checksum upon checksums, then it becomes definitely critical. So what they're going to do is they're going to use an MD Five function that's available in postgres and they're going to create a custom aggregate. So they're going to call their aggregate MD Five aggregate and they're going to design this function that basically hash a row, get another row, hash that to it, get another row, hash that to it until it runs out of rows. So they created their test table here and using this new aggregate function that they created for the whole table, they are able to get a single MD Five hash. Now, they didn't say how long this took to run. I imagine it might take a while doing an MD Five hash of every row but they also show how you can get hashes for subsets of the data using this example query. So definitely an interesting technique to compare the results of, say, two different tables to see if they match. So if you're interested in learning more about that, definitely check out this blog post. [00:14:33] The Next piece of Content PostGIS versus GPU performance and spatial joins this is from Crunchydata.com. There was a post this week on someone using a GPU assisted platform to plot out parking infractions in the city of Philadelphia, and they used a custom Python execution engine to do it. And he said I wonder how PostGIS with Postgres how that could compare calculating the same data. So that's what he did here. He set up his system, he downloaded the publicly available data to load it in, set up his tables and made appropriate changes to get them working some indexes and set up this query. But he noticed it wasn't using all of this parallel workers. So we actually adjusted the Min parallel table scan size, and with that he was able to get a parallel execution plan. And with running four workers, he was able to process the data in 24 seconds. Now the GPU custom solution did it in 8 seconds. So he said sure, we didn't beat it, it was three times as slow. But the thing about it is, this is a custom design solution that they did, so they don't have all the overhead that PostGIS and Postgres adds, but with that you get a lot of power. So for example, you can now come up with other queries about this data, be able to publish it, and remotely query it using their Pgtileserve and PG featureserve extensions. So I thought that was a pretty interesting blog post in comparison. [00:16:09] The next piece of content CQL filtering in PG Featureserve this is from Crunchydata.com, and they're talking about an enhancement that they've done to PG Featureserve, which is an extension for accessing PostGIS data over the web. And they've add CQL capabilities, which is a common query language, I think, related to geographical information systems. So it lets you set this up and do querying with it. So if you're interested in that, you can check out this blog post. [00:16:40] The Next piece of Content queries in PostgreSQL statistics this is from Postgres.com, and this is a very in depth document about statistics. So if you're interested in learning more about that, definitely check out this blog post. [00:16:56] The next piece of content, the Postcard Girl Person of the Week is ibrah Akman. If you're interested in learning more about ibrah and his contributions to Postgres, definitely check out this blog post and the last piece of content we did have another episode of the Uru productductive Dev show this past Wednesday. This one was on reviewing the top six through ten web application security risks according to OWASP. So if you're interested in that, definitely encourage you to check out our show. [00:17:26] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgrows.com, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 0

December 20, 2020 00:14:06
Episode Cover

PgMiner Botnet, Collation Index Corruption, postgresql.conf, Custom Data Types | Scaling Postgres 145

In this episode of Scaling Postgres, we discuss the PGMiner botnet attack, how collation changes can cause index corruption, managing your postgresql.conf and implementing...

Listen

Episode 138

November 01, 2020 00:16:56
Episode Cover

Scaling Connections, TimescaleDB, Time-series Gaps, Red & Golden Signals | Scaling Postgres 138

In this episode of Scaling Postgres, we discuss scaling connections, the release of TimescaleDB 2.0, how to find time-series gaps and monitoring RED &...

Listen

Episode 199

January 24, 2022 00:12:17
Episode Cover

Shaped Sample Data, Version Changes, Missed Bottlenecks, Indexes for Newbies | Scaling Postgres 199

In this episode of Scaling Postgres, we discuss how to generate shaped sample data, track changes between Postgres versions, identify missed bottlenecks and use...

Listen