Interpreting Explain Analyze, Bulk Loading Data, Limiting Update & Delete, pg_buffercache | Scaling Postgres 167

Episode 167 May 30, 2021 00:18:42
Interpreting Explain Analyze, Bulk Loading Data, Limiting Update & Delete, pg_buffercache | Scaling Postgres 167
Scaling Postgres
Interpreting Explain Analyze, Bulk Loading Data, Limiting Update & Delete, pg_buffercache | Scaling Postgres 167

May 30 2021 | 00:18:42

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss how to interpret explain analyze, the best ways to load data in bulk, how to limit rows affected by update or delete and using pg_buffercache.

To get the show notes as well as get notified of new episodes, visit: 

 https://www.scalingpostgres.com/episodes/167-interpreting-explain-analyze-bulk-data-loading-limiting-update-delete-pg_buffercache/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about interpreting Explain Analyze, bulk loading data, limiting update and delete, and PG buffer cache. I'm Kristen Jameson, and this is scaling postgres episode 167. [00:00:23] All right, I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is how to interpret PostgreSQL Explain Analyze output. This is from CyberTech postgresql.com. They're talking about how you can interpret Explain Analyze. Now, they first talk about what it is, and Explain gives you the query plan that the planner tends to run on any query that you send against Postgres. So you can just do Explain and then the query you're going to run, and it shows you the plan it intends to take when you use Analyze. So if you say Explain Analyze, it will actually run that query. So again, they say, be cautious if you're going to be using it with Insert, Update, or Delete, because it's actually going to be doing those operations. But in addition to using Explain Analyze, which runs the query, you can also request an output on the buffers. You can indicate whether you want it verbose or not. You can indicate settings and then what wall usage is occurring in later versions of Postgres. But as they say here, the best way to call it is Explain Analyze with the buffers and do your statement. And they also recommend using Trackio timing to be on to give you some more I O output in terms of the results from Explain Analyze. And they give you an example of what it looks like when you run against a pretty simple query. It shows you the output that you see here. It gives the cost estimates from the planner of time to return one row, time to return time to return all rows, how many rows will be returned, the width. When you're using Analyze, it gives you the actual time. And then each of these arrows indicate what nodes were processed. So you can have with a more complex query, you can have many different nodes, and each of those have a timing indication of how long it takes to run. And you may have more than one loop, for example, so that timing needs to be taken into account, how many loops something takes. And you can see where data was filtered out using a filtered, and you'll get different nodes showing up whether it's index scans or not. And then he shows when you have the buffers option, it shows you how many buffers were read from in the I O timing. If you enable that track I O timing setting, then he talks a little bit about how to read it. And then, of course, the question is what to focus on in Explain Analyze output. And basically, this is the most important thing to pay attention to, to me, is that find the nodes where most of the execution time is spent. So basically, you're looking for the bottleneck, what node is taking the longest, and you want to look at that and say, all right, how can we optimize that? And I think these two points are related to this. So if you're looking at that slowest node and you're noticing that the planner expects to return so many rows and take so long, and the actual results are significantly different, and they're saying by a factor of ten or more, they say very often this is a cause of bad performance. I haven't seen this a lot. I've seen this when there's something going on with Analyze. So, for example, the database or that particular table is not being analyzed effectively. So maybe you have a problem with vacuum that's causing Analyze step to not happen as frequently. Or your sampling, like your statistics target is set too low based on how many rows you have in the database. But that could be a cause that you could alleviate by analyzing the data in the table more frequently or adjusting your statistics for that table. And then of course, for long running sequential scans, you could probably add an index to remove some of those filter conditions and make it more efficient. But again, it all starts with, okay, what is the slowest node and focusing your attention there. So even if you have an enormously long query, just look for that slowest node. And if you improve that, you should improve the overall query time. Just keep in mind that you can have multiple loops. So when you're looking at it, make sure that you're checking for how many times that particular node is processed. Now they talk about there are some GUI tools available if you don't like interpreting the text output. And there's Explain Dep where you can just paste in your query and it focuses you in on the slowest part. [00:04:38] And there's also the Dalibo's Explain Analyze visualizer as well that you could use. So it's a great post talking about Explain Analyze and I encourage you to check it out. [00:04:50] The next piece of content PostgreSQL bulk loading huge amounts of data. This is also from CyberTech postgresql.com. They're talking about best practices when loading a lot of data. Now, normally what you may start with is doing an inserts, but that is incredibly slow due to all the overhead of inserting one row at a time. Now you can make that faster by doing multiro inserts. So here we're just inserting one row with each statement. You can do insert into a table values and have multiple rows inserting per insert statement. That is faster than doing single row inserts, but of course the fastest is a copy. So here they just took a look at some sample data and they were able to see an 80 1 second to a 2.6 2nd improvement just by using copy as opposed to individual insert statements. And then somewhere in between this would be a multiro insert statement the other thing to check when you're loading a lot of data is that you're not checkpointing too frequently. So this is important for your database when you have a checkpoint warning set up, because it will actually tell you if checkpoints are occurring too frequently or more frequently than you want to. He says what you can do here is extend the max wall size, we'll push out those checkpoints as well as adjusting the checkpoint completion target to something like 0.9 or something like that. And basically, the checkpoint is actually what writes from the memory to the actual heap that is on the disk. The other thing that you can do when you're loading data is if you're doing this for the process of loading a bunch of data into the database, processing aggregations and then inserting it, that would be a great use case for doing an unlocked table. So with an unlocked table, it's not writing anything to the wall. So if you have a crash of the system, you're going to lose all that data. But they're great for like temporary tables and temporary tables are also not logged. So you could load your data into this temporary table or unlock table very quickly, do your processing of it, insert the data into the ultimate destination, and then delete that table. And the other two points when you're thinking about importing data is that what else does the database system have to do when you're inserting data? So, do you have indexes? So every index needs to be updated. It's much faster to actually drop indexes and then insert all of that data and then apply indexes after the fact if you can do that. The other thing is what triggers are running. So it's also generally faster to drop triggers, insert all the data into a table and then reapply those triggers. So I know that's not possible in every use case, dropping indexes, dropping triggers, but if you can do that, you will definitely see a performance improvement. And then they mentioned optimized column order for bulk loading. So this is where you put fixed size data types. Larger first, followed by variable data type lengths at the end will allow you to compact the table and allow for faster loading. And lastly, he mentions two different tools that can also help you do loading if you're interested, PG loader and PG underscore bulk load. So if you're interested in loading data quickly, maybe you want to check out this post. [00:08:02] The next piece of content simulating, update or delete with limit in postgres. CTEs to the rescue. This is from Crunchydata.com, and they're talking about wanting to limit some delete or update operation. Like it would be great if you could do delete from some table limit to say, 10,000. And frankly, I'd love to have this because if I'm going to run a delete and I expect one row, it would be great to just say limit one to know that okay, I know no matter what, I'm only going to delete one. But sadly, that's not part of the SQL standard, nor is it a part of postgres. But a way you can do that is do a select to grab the rows you need and then do a delete from it. And you can use a CTE which this post describes. But you can also just use a subquery because with the select you can use the limit clause as well as the order by clause to help you get exactly what you need. [00:08:59] So here's the example that they're using here. They are creating this CTE or this with clause and they're saying define it as rows and select something from a table, maybe an ID. [00:09:11] Limit it by ten. So you're only going to get ten IDs from this table, for example, and then you say delete from the table where that ID column or whatever it is, is in your ID from the CTE. So you could do just place this into here without a CTE and it would work the same way. So again, you don't need to use CTEs, you could just use a subquery, which of course this has to use anyway. And they are showing how you can use this with delete. You can also use it with the order by clause just by putting in the select. It shows you using updates as well as order by. So if you ever want to limit your updates or deletes that you're processing, maybe you want to check out this post from Crunchy Data. [00:09:56] The next piece of Content memory inspection through PG Buffer Cache so this is from fluca 1978 GitHub IO and PG Buffer Cache is an extension that looks at the shared buffers and what's being stored there. And what he's done is created a number of functions that help you quickly retrieve useful information from it. So he has a link here to a SQL file and if you run it, it actually creates a separate schema called memory and then creates these functions to run certain queries, basically. So now you can just do select Star from Memory FMemory to run that function and it will return the total number of shared buffers, how much is being used currently and how much is free. It also includes a number of utility functions, whether to check whether PG Buffer Cache is installed, et cetera. And then it also lets you show memory usage by very high, high mid low, very low usage by database. So you can check what is using the Buffer cache out of multiple databases on your cluster. You can check it out by table to see how much a table is in the buffer cache as well as a table cumulative. So this is a great set of utility functions for analyzing the buffer cache. And if you're interested in that, definitely check out this blog post. [00:11:17] The Next Piece of Content new Features in PostgreSQL 14 bulk inserts for foreign data records. And this is from Procona.com. They're talking about a new feature in 14 where they've altered how foreign tables work so that inserts don't just happen one at a time, because you're dealing with a foreign server that's probably connected on another server across the network, and doing bulk loads of inserts one at a time, as we saw with the first post, is going to be very slow. And if you can batch them and there's this new batch size defined for a foreign table, you can send more than one row at a time, which speeds things up. And here they went from about 4600 seconds for a set of inserts. And using a batch size of ten, it went to about 800 seconds, so maybe about six times faster using this new batch method. So if you're interested in learning more about this feature and an example of how it's used, definitely check out this post. [00:12:18] The next piece of content enhancements to the target session adders database connection parameter in PostgreSQL 14 this is from PostgreSQL Fastware.com. They're talking about the enhancement coming with 14 with different options for target session attributes. So in versions prior to 14, you could define on when a connection is made, do you want it to be read, write connection only, or do you want any other type of connection? So there wasn't a lot of flexibility with this. So they actually did a patch in 14 that enables these additions. So you can define your PostgreSQL client as to only connect to a database that is read only, connect to one that is a primary only connect to one that is standby, or connect to one preferring standbys. If none are available, then go ahead and connect to any other type. So this definitely gives a lot more flexibility, but it helps the PostgreSQL client now act as a high availability point because you can point to multiple servers and it will choose which one based upon what you said in here. So if you want to learn more about this feature that's been added and an example of how it works here, which they show in this blog post, definitely check this out. [00:13:36] The next piece of content tracking the refresh performance of materialized views with Mvstat in PostgreSQL. This is from Ungres.com, and they're talking about a new extension that they developed called MV Stats, and basically it is a statistics extension for tracking materialized view processing. So it gives you an indication of when the last time a materialized view was modified, the time of last refresh, how many times it's been refreshed, how long it took to do the refresh, as well as a min max track times of that refresh time. So if you use a fair number of materialized views and you want to track the statistics relative to it, you may want to check out this post and this new extension. [00:14:21] The next piece of content speeding up PostgreSQL recovery with PG Backrest this is from Zieler US and he's talking about he had a problem where he had a replica but had trouble catching up with the primary. And it looked to be because the wall recovery was taking too long. Now, because he's using PG Backrest, I'm assuming that they were directly storing it on a file storage system like maybe S Three, and it was trying to pull down a file each time. And PG Backrest has a number of settings you can do to speed that up. But how it speeds it up is enabling an archive async and then defining how many files to pull down ahead of time to have on the local system ready to be replayed against the database. And by having more files in place, basically you're downloading files in advance of when it's needed and storing them in the spool path. And this enables recovery to be faster. Now, this covers PG Backrest, but the same practice can be used with normal wall replay in postgres. If you're actually trying to pull those wall files one at a time from something like S Three or Google Cloud Storage, you're going to want to batch pull wall files down ahead of time so they're on the replica on the disk, ready to be replayed immediately. But if you're interested in more details about this, definitely check out this blog post. [00:15:41] The next piece of content using postgres row Level Security and Ruby on Rails this is from Pganalyze.com and this is a good overview of showing how row level security works and how you can enable it. Basically, they showed an example here of where you can enable row level security for a particular table. You can create a role and then create a policy. So this one's for account managers on the accounts table, such that managers can only view rows where the supervisor equals whatever the current user is. Now, you can also define policies based upon a session variable that is set. And that's what they did in this other implementation here, where they set a customer ID, and then they can insert data for that customer ID. Otherwise, if it's not there, they get a new row violation of that security policy. So basically, this is a way to create multi tenant apps using row level security. And not only do they show you how it works at the postgres level, but they also show you for Ruby on Rails, an application framework for Ruby, how you can set that up in Ruby on Rails to get it working. Now, I've never really used this because of the performance implications of it and they do have a whole section when they're talking about it. And really there's been a lot of unknowns about it, so I've never felt there's been enough visibility of it from a performance perspective to entrust using it. I mean, it definitely works. So if you have a high security application that maybe is not as performance centric. Maybe you would want to explore using row level security, but there are definitely performance hiccups that could potentially happen. But if you're interested in learning more, definitely check out this blog post. [00:17:25] The next piece of content, postgres 14 Highlight Fun with Hashes. This is from Pakier XYZ and he's talking all about hashes in postgres and going into a little bit of the history of how MD Five support happened, as well as Scramshaw 256 and how those different implementations are done, some internally in postgres, some relying on the presence of OpenSSL. So he's been in the process, from my understanding, doing a refactoring of how this is done to make things easier to maintain and more consistent for the future. So if you're interested in that, definitely check out this blog post. [00:18:05] And the last piece of content is the postgres Go Person of the Week is Giorgios Coglatos. So if you're interested in Giorgios and his contributions to postgres, definitely check out this blog post that does it. For this episode of Scaling Post, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 224

July 18, 2022 00:13:57
Episode Cover

Column Performance, BRIN Win, Unique and Null, Parallel Distinct | Scaling Postgres 224

In this episode of Scaling Postgres, we discuss how too many columns in a table can affect performance, at what point BRIN indexes win...

Listen

Episode 181

September 06, 2021 00:21:58
Episode Cover

Connection Scaling, Cached Sequences, Message Queuing, Harmful Subtransactions | Scaling Postgres 181

In this episode of Scaling Postgres, we discuss connection scaling, how to cache your sequences, a message queuing implementation and the harms of subtransactions....

Listen

Episode 319

June 09, 2024 00:12:05
Episode Cover

When Postgres Development Stopped! | Scaling Postgres 319

In this episode of Scaling Postgres, we discuss a time when Postgres development stopped, two new extensions pg_lakehouse & pg_compare and the upcoming event...

Listen