Performance Impressions, Mystery Solving, GiST Performance, K-Anonymity | Scaling Postgres 89

Episode 89 November 11, 2019 00:15:42
Performance Impressions, Mystery Solving, GiST Performance, K-Anonymity | Scaling Postgres 89
Scaling Postgres
Performance Impressions, Mystery Solving, GiST Performance, K-Anonymity | Scaling Postgres 89

Nov 11 2019 | 00:15:42

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss Postgres 12 performance impressions, mystery solving, GiST and SP-GiST performance and k-anonymity.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/89-performance-impressions-mystery-solving-gist-performance-k-anonymity/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about performance impressions, mystery solving, gist performance, and K anonymity. I'm Kristen Jameson, and this is scaling postgres episode 89. [00:00:21] Alright, I hope you're having a great week. The first piece of content is PostgreSQL version twelve, initial query performance impressions. And this is from Cyberdeck. Hyphen postgresql.com. And they basically have done some performance checks that they mentioned that they've done with previous releases for version twelve. And they're comparing version eleven to version twelve, predominantly using PG bench and I believe a number of other analytical queries that they've come up with. So they go through their testing methodology, what the different setups that they used, the hardware software, the configuration options that they set, and then they get into the results, of course. Now they had a scale indicator here. This indicates whether a particular query or work was small enough to just fit in the buffers, whether it would consume more of the ram of the machine, and whether something was so large it would probably spill onto the disks. So those are the different scales that they've done here. And then they have the meantime for version eleven, version twelve, and then what the meantime difference was, as well as these standard deviations. And there were some cases where twelve was faster. Those would be the negative percentages here, but overall it was a little bit slower than eleven. So he took a total average and it was 3.3%. But it was interesting looking at some of the queries here. Now, a lot of these are very fast, like less than one millisecond, so it's hard to gauge speed at that type of resolution. But even some of the analytical queries here, so this one was 0.8% slower, this one was 1% faster, this one was 8% slower and then 3% slower. So definitely looks like it just looking at the analytical queries that he had developed, it's a little bit slower. And the other interesting thing mentioned was the inserts being slower like this is 20% slower for the PG bench default insert, and the PG bench extra indexes insert is 12% slower. Now, we mentioned that perhaps this is due to the space improvements to the index, maybe it takes a bit more time to get the data inserted. But what I also found interesting is the standard deviation is huge. Like this is 177%. So looking at the average time, it's 0.4 milliseconds for this insert. But the standard deviation was 0.2. So that's a huge standard deviation. The same thing with the extra indexes. It's zero five in terms of the speed of it, but the standard deviation is zero two. So a huge range in terms of the inserts. So not quite sure what that is. And he doesn't have any theory from what I looked at in here. Now, of course, he does mention these are predominantly artificial tests, so you have to take everything with a grain of salt. But definitely perhaps investigate the insert performance as you're using version twelve to see if there's any differences from version eleven. So overall, very interesting blog post and I encourage you to check it out. [00:03:40] The next post is making mystery solving easier with Auto Explain. [00:03:45] So they had a situation where they were working with a customer and they tried all these variations to get better performance after migrating from Oracle to PostgreSQL. So they said they adjusted shared buffers, tried swapping the JDPC drivers they were using, they perhaps thought it was the replica, but just using the primary database had no impact. So what they did is they set the log men duration statement to zero, to log all the queries and get some timing. And it looked like it was an update statement. So we took one of the queries and ran it through Explain Analyze and didn't see an issue. Now, I don't know if it's the exact query that was hitting the database as we'll see lower down here, but they took one of the example queries, ran it through Explain Analyze and it was using the index. There was seemingly no problem. So what they did is they used Auto Explain, which is an extension that you can put in your shared preload libraries and you can configure it. They set the long men duration to be zero and to turn on Analyze. So it gives you the actual real count as well as the cost estimates. [00:04:55] And when doing that, they noticed that the filter that was being applied, it was converting the integers into doubles. So it would take the column, make it a double afloat, as you were a double float, and then take the number that was input, make it double precision as well, as well as the second column and the second value. [00:05:18] So of course this caused a sequential scan and it was no longer using the index. And they finally traced it down to a third party interface, was adding an execute update method for database connections and basically it was, as they say here quote, basically converted all the int arguments to double precision which prevented it from using the index that was there. So basically the additional software between the actual application and the database system was adding this to it and causing performance problems. So definitely something to be aware of. And also a tool, an extension, Auto explained to be able to try and help diagnose it. I wonder though, with this, the logman duration statement, I wonder if it was possible to capture more of it, to be able to diagnose it earlier without having to use the Auto Explain. But definitely something to keep in mind if you suddenly run into performance problems. And again. This is Richard [email protected]. [00:06:19] The next article is PostgreSQL twelve. Implementing kneeest neighbor space partition generalized search tree indexes. This is from secondquader.com and basically they're looking at performing a nearest neighbor. So a simple linear distance using this operator here and they're seeing what you can get the best performance with based upon indexes. So the first they didn't use any indexes at all and they just retrieved five results. The closest location to these coordinates with the data that they had on hand, it basically took looks like 2 seconds to be able to return the data that you see here. And then they added a Gist index to the location column and it took about three minutes to build. But after they did that, when they ran the query, it ran in one millisecond, so definitely a lot faster with the Gist index. And then they tried an SP Gist index again adding it to the location. And this one ran in 00:35 eight milliseconds, so even twice as fast. And they have a table of the comparisons here, so unindexed, I'm going to ignore the estimates, but the actual query time was 2.5 seconds. With a Gist index it was about a millisecond 0939, and with the SP Gist index it was 00:35 eight. In addition, the index size was smaller with the SP Gist index compared to just the Gist index. And of course the creation time of it, it took a minute and a half. So half the time of a Gist index. So definitely if you're using K nearest neighbor searches or looking for location distances, definitely use the SP Gist index as opposed to just a Gist index. So if you want to learn more about how they set up the data and did this test and the results definitely in blog post to check out the next post is achieving K anonymity with PostgreSQL. So this is talking about the PostgreSQL anonymizer and this is from the blog Tadeen. Net and they say that they use various different strategies to anonymize data to make it easier to look at without being able to identify a single individual. Like they mentioned using randomization faking, partial destruction, shuffling, noise addition. Well, now they're talking about generalization. So in terms of generalization, it's taking certain data and making it more general. Like for example, in the case of a zip code, give a zip code range. So it's not specific, but it's a zip code range. They live somewhere within this area, or a birth date, maybe you give a range of years that that birth date would fall, but not the actual birth date. Or you could even say the year that they were born would be another way to give something more general. So they give an example here of using this generalization technique where you have an SSN, a first name, zip code, birth disease, and then what they did is redacted the first name, got rid of the well, I should say they got rid of the Social Security number, redacted the first name, so it was removed. Then they used ranges for the zip code and ranges for the birthday, but then kept the disease so you could analyze this disease based upon the ranges of birth and zip code without being able to identify the specific individual that had a given disease. And then they talked about being able to do the K anonymity, which as they say here is an industry standard term used to describe the property of an anonymized data set. So basically it gives you the probability that someone can identify an individual from the set of data that it's given. So here you can specify which columns are being used as an indirect identifier. So in this case the zip code and the birth. And you can run this function to give you a K factor. So in this case it's three, which means for any value of the zip code and the birth, you will find at least three individuals in the generalized patient table. So any one grouping has at least three individuals. Now, as you get closer to one, that means that you can identify a given individual based upon those combinations. So the more columns you have in a table, for example, you would be more likely to drop down your K to where you could actually identify an individual from the data given. So basically you want to keep that value above one to keep the data anonymized. So definitely interesting blog post and an interesting tool given the different logs that are being enacted with regard to privacy. So definitely a blog post to check out the next post is Gzip in PostgreSQL, and this is from Cleverelephant CA Blog by Paul Ramsey and he's talking about a very simple extension that he set up that supports doing Gzip. Now this is not something to use to compress the data you're typically storing in PostgreSQL, but it just does as he says here, a binding of Zlib to SQL. So basically if you have a process where you need to extract data from PostgreSQL and then zip it up for like consuming as a part of an API or you're receiving Gzip data that you want to insert into PostgreSQL, this extension gives you a means to do that. So if you're interested in it definitely a blog post. And a new extension to check out the next post is PostgreSQL Query Plan Visualizer. So when you're analyzing a query plan, it gives the representation to you in text, but there's actually a PostgreSQL Explain Visualizer tool that's been developed, it's an open source, so you could install it yourself, but on this actual blog post they did that and they're calling it the Explain Plan Visualizer. So you could paste your plan in here, place the query that resulted in this plan. It will give you a graphical representation of it. So presumably something easier to read. So if you're wanting to look into that, maybe check out this blog post as well as maybe the separate PEV tool that you can install on your website. [00:12:36] The next post is Streaming Replication setup in PostgreSQL twelve how to do it right. And this is from the highGo CA Blog and this blog post just basically goes through how to set up streaming replication in PostgreSQL twelve. Given the changes that have been made to the recovery.com file, the fact that it's gone and they're now using the two different signal files. So this is another piece of content if you need to set that up. Second related post is replication failover with PG rewind in PostgreSQL twelve. Again, with the recent changes to PostgreSQL twelve, this goes through the process of replication failover or with PG Rewind. I call it failing back. So once you've failed over to a new primary, so have a replica be the new primary. If you want to fail back to the old primary, PG Rewind helps you accomplish that process and they go through how to do that in detail. So if you're interested in that, definitely a blog post to check out. [00:13:36] The next post is Configure HAProxy with PostgreSQL using built in pgSQL Check. Now, we've covered the previous article related to this HAProxy using X in ITD, and I believe they didn't use pgSQL Check in that case. They use a custom script to do it, which to me looking at that post seems like a better way to do it compared to this one. But this one their objective was to use the built in pgSQL Check, but in order to do that, it doesn't really do authentication to verify that it can connect successfully in order to check whether a database system is in replication mode or not. So they actually set it up to use this tool using modifications to the pghba. Comp file, which I'm not necessarily a fan of, but it is a way to do it to use it. And this blog post explains how to do these changes so that you can do failover switchover procedures using HAProxy and the built in pgSQL Check tool. So if you're interested in doing that is the blog post to check out. [00:14:45] Next post is Introduction and how to etcd clusters for Petrone. So if you use Petrone and want to use etcd for your consensus holder, this is a pretty long comprehensive blog post that explains etcd and how to set it up for that purpose. [00:15:02] And then the last post is PostgreSQL twelve generated columns. And this is from fluca 1978 GitHub IO, and it's another blog post that talks about the new generated column features in PostgreSQL twelve that does it. For this episode of Scaling Postgres, you could get links to all the content mentioned in the show Notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 247

January 11, 2023 00:14:25
Episode Cover

Generate Test Data, Faster Archiving, Date Statistics, Useless Indexes | Scaling Postgres 247

In this episode of Scaling Postgres, we discuss how to generate test data, how WAL archiving is faster in PG15, how to set date...

Listen

Episode 273

July 10, 2023 00:15:35
Episode Cover

Debian / Ubuntu Packaging, ivfflat Indexes, In vs Any, View Logging | Scaling Postgres 273

  In this episode of Scaling Postgres, we discuss how Debian & Ubuntu package Postgres, how ifflat indexes work, in vs any performance and how...

Listen

Episode 167

May 30, 2021 00:18:42
Episode Cover

Interpreting Explain Analyze, Bulk Loading Data, Limiting Update & Delete, pg_buffercache | Scaling Postgres 167

In this episode of Scaling Postgres, we discuss how to interpret explain analyze, the best ways to load data in bulk, how to limit...

Listen