Statistics Speed Up, Synchronous Commits, TLS Updates, Avoiding Cursors | Scaling Postgres 128

Episode 128 August 24, 2020 00:16:56
Statistics Speed Up, Synchronous Commits, TLS Updates, Avoiding Cursors | Scaling Postgres 128
Scaling Postgres
Statistics Speed Up, Synchronous Commits, TLS Updates, Avoiding Cursors | Scaling Postgres 128

Aug 24 2020 | 00:16:56

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss speeding up performance with statistics, setting synchronous_commit, updates to TLS settings and how to avoid cursors.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/128-statistics-speed-up-synchronous-commits-tls-updates-avoiding-cursors/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about statistics, speed up, synchronous, commits, TLS updates and avoiding cursors. I'm Kristen Jameson, and this is scaling postgres episode 128. [00:00:22] All right, I hope you, your family, friends and coworkers continue to do well. Our first piece of content is how we used postgres extended statistics to achieve a 3000 x speed up. And this is from Affinity Co. And so they had a query here that was giving them some poor performance and then they did an analyze of it and they noticed that the statistics thought it was going to pull one row as a result of this join. But when the query was actually run, it was pulling over 2000 rows and it was taking a lot longer than they anticipated on the order of minutes to run. So whenever you have a difference between what the planner thinks the results should be and the actual results, you want to check out your statistics. So that's basically what they did. Now, this post then goes into all the basics of statistics on how it keeps statistics on each column. It doesn't do cross column statistics, which we'll mention here in a second, but it does it per column and it takes a sampling of it. And you can set statistics for the database itself, but you can also set them for individual columns as they indicate here. Now, extended statistics actually allows you to define additional statistics where you're defining relationships between columns. In the example that they're using here. Say you have artists, albums and songs where there's a relationship between each of these. A song belongs to a particular album, typically maybe more than one, and that belongs to an artist, although maybe more than one in that case too. But there's a different relationship. There's a much smaller number of albums, artists giving a particular song. So that's not a randomized distribution, there's correlation between them. And when you have that, you can actually, in postgres ten and above, create extended statistics to indicate that fact that the query planner can take into account. So they give an example of the command here where they use creating statistics based upon album ID and artist ID from songs. Now, one thing they did also suspect is that they were using an inefficient join. So they go ahead and review a nested loop join, merge join and hash join. And they noticed that the nested loop can run into problems if there's too many rows in the left relation. So for example, if it has to do a repeated process to look on the right side of the relation for a join, it has to execute it each time. So I believe they were hoping if they updated the statistics, it would potentially choose a different join strategy. Now, what they observed is that the most costly step of their analyze was looping over index access to this particular index, and they did it over 13,000 times so even if that index access is fast, it's going to take forever if you have to do it over 13,000 times. Plus, you know, there's a problem if there's a difference between what the planner thinks an estimate is and the actual counts, like we saw with the rows. So what they did is that they quote maxed out per column statistics for all the columns involved. [00:03:27] I think these are relatively generic but they're org ID list ID in one table and ID entity attribute ID in the next table and then they created dependency statistics for each of these as well. Now as a result of that, they got a different query plan that actually finishes in 42 milliseconds. So vastly faster than the two minute one. And therein lies the 3000 fold speed increase. Now, it's interesting they didn't mention what their improvement would be just maximizing these statistics for each column versus using the extended statistics or the correlation statistics that they added. Because I would assume that increasing the sampling rate of each column would allow the planner to give better estimates. But I'm not sure how much of an additional benefit the extended statistics would do. But that would have been an interesting thing to know. But if you run into a scenario where the planner expects one result from a query but the actual results are quite different, maybe check out your statistics to see if you can improve how many times you're sampling it or potentially look for correlations between columns that may be able to get you better performance. And if you're interested in learning more about that, go ahead and check out this blog post. [00:04:40] The next piece of content is PostgreSQL synchronous commit options and synchronous standby replication. This is from Procona.com and they're talking about the synchronous commit setting. Now first they talk about wall propagation. So wall is the write ahead log. It's the log of all activities that are happening on the database in terms of object changes and inserts updates deletes and they go through this flow diagram here. So this is assuming you have one local PostgreSQL system and then a replica the remote PostgreSQL system. So first the wall inserts are written to the wall buffers in memory. They are then written to the operating system page cache and then on some interval they are flushed to the actual disks of that local system. Meanwhile, if you have replication set up the wall sender sends it to the wall receiver and it does a remote write to the page cache on the remote system. Then at some interval it flushes it to the disk and then usually sometime later it then applies those changes that it's received to the database system itself so that it is accessible for others to read the data that's been inserted. For example. Now with this flow you can determine what setting you want to use for synchronous commit. So when the setting is off, essentially you're just writing to the wall buffers and you have no guarantees that things actually make it to the disk. So clearly that's fast. But you're opening yourself up to data loss even if you're operating on a single server. So if you care about your day, you probably don't want to choose off. The next option is local. So local is assuring that the wall records actually get flushed to the disk so everything is safe on the disk, at least on that local system. The next option is remote write. This is assuring that the wall is written to at least the OS page cache on the remote system. Now this gives you a fast acknowledgment so that there's no slowdown in the system, but you can't be sure if the replica fails, that it was written to the wall files. For example, the onsetting for synchronous commit ensures that it gets written to the disk on the replica. And this is the default setting for Postgres, so you have to change it from this setting if you want to choose a different synchronous commit option. And then the last option is Remote Apply. So this takes the longest but is the most durable and it assures that the information written on the local database appears on the remote database. So it gets written all the way to the database and is available for selects. So it's the most durable but has the greatest performance penalty. And then the next section of this article explains how you can change the scope of synchronous commit. So you can of course set it at the cluster level, but there's also multiple other levels that you can set it for. So for example, you can do a set local of the setting which does it at the transaction level and it enables you to customize how durable a particular transaction you want to be. Or you could also set it at the session level or at the user level so you can have certain users where synchronous commit is off, for example, or some users where synchronous commit is on. So you can definitely tailor this so that certain transactions or certain work you want to do is more durable than others and other work is maybe faster and less durable. So if you are interested in learning more about how to adjust your synchronous commit settings, definitely check out this blog post from Percona.com. [00:08:08] The next piece of content is TLS related updates in PostgreSQL 13. This is from Higo CA and they're talking about additions to the PostgreSQL comp file where you can now define an SSL minimal protocol version and a max protocol version. In previous versions of Postgres, the default was TLS version one, which is not considered secure anymore. In postgres 13 it's increased to 1.2. But now these settings enable you to customize your TLS settings to match what is acceptable for your environment. The next thing they mentioned is that they also allow minimal and max protocol versions for the client as well, so you can pass in these values when you connect as a psql client. The next area is they mentioned channel binding when using Scram, so we've covered this in a previous episode of Scaling Postgres, but this covers how you can set up channel binding to help to authenticate the server that you're connecting to from the client's perspective. And they also mentioned a new SSL password function when trying to do a verification of the SSL mode. So definitely some great changes to take into account with PostgreSQL 13. [00:09:18] The next piece of content is Oracle to PostgreSQL cursors and common table expressions. This is from secondquader.com. So I actually found this quote interesting. In a 23 year career with PostgreSQL, I have only actually found a need to use cursors twice and I regret one of those. So he's basically communicating that with Oracle you apparently use a lot of cursors when doing programming, whereas with PostgreSQL you don't tend to do that and there's other ways to achieve the same thing. And he talks a little bit about how cursors can be expensive rather than relying on the set based logic and essentially using code that's already been written for using pure SQL versus programming something yourself to return data. Now, as an example, he used being able to create some hierarchical data. Now as opposed to use a cursor to try to do this, he used a common table expression and the width recursive feature. So he used this capability in order to generate this hierarchy of data that maybe normally you would try to use a cursor or some other method like that. So if you're interested in learning more about this, definitely check out this blog post from second quarter. [00:10:29] The next piece of content is eight fascinating things you probably didn't know PostgreSQL can do. This is from Enterprisedb.com and this is an eclectic set of different SQL tricks or tips that you can use with Postgres. So I'm not going to run through all of these in detail, so you should definitely consult this blog post. But they're talking about utilizing whole row references, comparing a selection of columns, hard coded tables, custom config parameters, booleans can, standalone convert column data types for free, find which rows belong to which underlying partition and tables are types. So if you're interested in working a bit more in depth with Postgres, definitely check out this list of tips and tricks and some I haven't really seen before. [00:11:22] The next piece of content is updating the PostgreSQL root CRT file. He's saying you have a scenario where you have SSL authentication set up between two different postgres servers. Perhaps you're using Postgres foreign data wrapper they mentioned here. So if you have server certificates and the server keys, client certificates, client keys, and you have your own certificate authority that has generated this root CRT file that has signed all of these different certificates, well, what happens when the root certificate comes up for expiration. What's the best way to do that? And what you don't want to do is recreate the root certificate and then overwrite that along with generating all the different keys for a single server and then update it, you'll run into problems because on the other server it's still using the old root certificate key. So he says that the correct way to do it is to generate a new CA certificate and add it to the same root CRT file. So basically it trusts both CA certificates, both the old one and the new one. Then you generate your new certificates with the new signed CA certificate and that allows you to update all of your certificates. So if you want to know a bit more about the process, definitely check out this post from secondquader.com. [00:12:41] The next piece of content is PostgreSQL group by expression. This is from CyberTech Postgresql.com and they're talking about group bys. So for example you can group data by the values in a column for example. So here they do region or they go by whatever the first column is. So that's pretty much a group by. Group by expression is that you define an expression to group by. So in this case they're saying where the production is greater than 9000. I think this is the case of barrels of oil or using a case statement in a group by. And that is what I've used a lot of this for is case statements in a group by in order to get particular data. So this continues and shows you how to use that as well as following up with grouping sets. So if you want to learn more about some capabilities of group byte, definitely check out this post from CyberTech postgresql.com. [00:13:34] The next piece of content is using postgres row level security in Python and Django. This is from Pginalyze.com and they're talking about implementing postgres's row level security. So in this scenario with your application, you're actually going to need to set up a role per database user because you want to do the security in the database system itself. So in their example they have a salespeople table and they create each salesperson and then for each salesperson they create a role. They happen to use the ID from the salespeople table and there's probably better way to do that, but that's what they use for this example. And each of these roles are part of the salespeople role so you can grant certain permissions to that. Then they enabled the row level security to access the tables that they should. And how they change each person is that they use the set role option. So with Django it uses a common database user to connect and pull all the data. But when it needs to act as a particular user, it does the set role option to then do that command. So after showing how the database side works. They then show the implementation in Django itself. And the main things that they have to do is after inserting a new salesperson, they actually need to create the role and grant it to the salesperson role as well. In addition, whenever they're doing a query, they need to set the role of that person. So I don't necessarily think that I would use role level security for something like a web application unless it's for internal use only. And these database users have access to other database systems, so it makes sense they already have a database user set up for a specific purpose. Otherwise, to me, this tends to get very complicated. The other thing that they also mentioned here is performance. So the performance of the row level security feature of Postgres isn't as high as when you're not using it. So if you're interested in scaling postgres, this may be an option you want to avoid. But if you're interested in learning more about row level security and a basic setup for it, you can check out this post from Pganalyze.com. [00:15:41] The next piece of content is how to set up PostgreSQL on an IPV six enabled network. This is from Higo CA, and it talks about how to set up IP version six for PostgreSQL, and it goes into a lot of detail about how you can do this. So if you're interested in doing that, check out this blog post. [00:16:00] The next piece of content is a Webinar similarity queries in PostgreSQL follow up. So this is a webinar on similarity queries, and it gives about a ten minute YouTube video of an introduction of what the webinar is like. So you can take a preview of it here. And if you're interested, then click here to register for the full webinar. [00:16:20] And the last piece of content is the PostgreSQL person of the Week is Charles Klavdeche. So if you're interested in learning more about Charles, his contributions to PostgreSQL, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 83

September 30, 2019 00:13:14
Episode Cover

Postgres 12, Window Frames, Index Types, Synchronous Replication | Scaling Postgres 83

In this episode of Scaling Postgres, we discuss the release of Postgres 12, how to use window frames, the different index types, and when...

Listen

Episode 103

March 02, 2020 00:13:57
Episode Cover

Scaling Out, Planner Estimation, Create Statistics, Stay Curious | Scaling Postgres 103

In this episode of Scaling Postgres, we discuss how to scale out, how the planner estimates, uses of create statistics and investigating PostgreSQL run...

Listen

Episode 215

May 15, 2022 00:17:13
Episode Cover

New Releases, AlloyDB, Time Zones, Connecting to Postgres | Scaling Postgres 215

In this episode of Scaling Postgres, we discuss parallel server-side backup compression, IO in Postgres, parquet files and the new pg_stat_monitor extension. To get...

Listen