Explain Analyze, Sequence Counts, Best Books, Partition Migration | Scaling Postgres 112

Episode 112 May 04, 2020 00:16:44
Explain Analyze, Sequence Counts, Best Books, Partition Migration | Scaling Postgres 112
Scaling Postgres
Explain Analyze, Sequence Counts, Best Books, Partition Migration | Scaling Postgres 112

May 04 2020 | 00:16:44

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss how to use explain analyze, the problem with counting sequences, the best books to learn Postgres and how to migrate to declarative partitioning.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/112-explain-analyze-sequence-counts-best-books-partition-migration/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about Explain analyze, sequence counts, best books, and partition migration. I'm Kristen Jameson and this is scaling postgres episode 112. [00:00:20] You all right? I hope you, your family and coworkers are doing well given the situation we're still encountering. But our first piece of content is Optimizing SQL. Step one, explain costs and plans in PostgreSQL part two. This is from Higo CA. And as you can tell, this is part two of the Post about Explain. Now, we covered the first version a number of weeks ago explaining how Explain worked, and it actually shows you how you could actually build a query based upon an Explain plan. This one goes into Analyze and talks more in depth on how to interpret some of the output. So in terms of their setup, they just have a big table and a small table. But the big table still just only has 100 unique rows and a smaller table just a little bit smaller. The first thing they talk about is costs. And that basically PostgreSQL uses a cost base optimizer. So it takes into account some of the configuration settings, like the sequential page cost, and others takes into account the statistics that exist for a given table, as well as indexes and everything else. And it comes up with a cost estimate for doing each operation. So, for example, it's looking at the different operations here. How long would it take to do a hash, a sequential scan, a sort? And it comes up with a cost estimate for each one. And you'll see that there's two different values listed here. So they say the first part of it is the estimated startup cost. So how long would it take to get to the point of doing this step? Like, does anything need to happen beforehand? And this one is one of the first things it can start doing is this sequential scan and also this sequential scan. So essentially its cost for startup is essentially zero. And then this is the estimated total cost for this action, the number of rows it anticipates as well as the amount of data that it has to work with. So basically you need to look and see where the greatest cost is coming from when you're doing an Explain or an Explain Analyze to see what should be the slowest part. And in this section here, they talk about Explain Analyze, which actually runs the query. So when you do Analyze, it actually runs the query, whereas when you just use Explain, it doesn't run the query, it gives you an estimate. But I really like using Analyze, of course, because it actually runs a query and it gives you actual results so you can hear, you see the actual and it gives you the time. So how many milliseconds it took to run a particular action. So this is the actual cost estimate for doing this stage of it. And this is the cost estimate for this stage of it and again, tells you how many rows and how many loops it has to go through because there are certain types of queries where it has to do more than one loop. So for example, it has to do this materialized step 100 times based upon the query and the data and how it's laid out. So with this information, you can then look at your Explain Analyze plan and see what is resulting in taking the longest time. And then that is the area you would need to optimize. So if you want to learn more about Explain Analyze, definitely check out this second post from Higo CA. [00:03:31] Now, a great compliment to the previous post is how to use Explain Analyze for planning and optimizing query performance in PostgreSQL. And this is from the enterprisedb.com blog. And so they actually take it from the perspective of a customer contacted them and said that a particular query wasn't using an index, it was doing a sequential scan and wondered why that was. So they used Explain to actually figure out, okay, why it's not doing it. There was an existing index and basically just needed to rework the query in order to start using the index scan again. But then it goes into talking about that. Of course, the planner uses a cost based approach. And what's interesting about this post, it actually tells you how it derives the cost when just using Explain. And you can see the formula here, the number of blocks times the sequential page cost. Again, this is a PostgreSQL setting that you make. So it takes the number of blocks that are going to be involved, multiplies it by that, adding it to the number of records times the CPU Tuple cost. Again, that's another configuration variable. Plus the number of records times the CPU filter cost. Again, another configuration variable. And using these, it calculates the exact cost that's represented here for this sequential scan. So for those of you interested in details, definitely kind of shows you exactly how this cost based optimizer comes up with the cost that it comes up with. And then they talk a bit about statistics and how they are involved in terms of doing the estimates as well. And then they talk about Explain Analyze and it actually runs the query. So you get actual timing for each step of the process that a query runs through. Then they have this summary table of all the different scan types and joins that are available. So again, if you're wanting to learn more about Explain Analyze and how it works in PostgreSQL, definitely two blog posts to check out. [00:05:25] The next piece of content is speeding up. Count all or count asterisks. Why not use Maxidminid? And this is from a Cybertechn postgresql.com, and they had done a previous post, Speeding Up Count Asterisks, and they said that in some of the comments, someone's saying, why don't you just do a max ID and min ID. But the problem is it will return the wrong answer. Now, I'd mentioned this on last week's episode of Scaling Postgres, talking about sequences and how sequences can have gaps. So if you try to insert a row, have an error, insert a row, do a rollback, the next sequence you get will be higher than what you expect. It will have a gap in it. Sequences are just guaranteed to be higher than the previous one. It's not guaranteed not to have gaps, so you're going to have gaps. So you can't use that to get an exact count and they go through and give you an example of how you can insert something like they tried to insert where it would be an ID row of four, but they rolled it back and here you can see gap is present because they used a rollback. Now this could be used depending on how well you know your table as a means to give an estimate, but it would be a very rough estimate and it depends on how many errors you would tend to have in your table. So I wouldn't necessarily advocate it, but it could give you a rough ballpark sense of how many rows are in the table if you wanted to get an estimate as opposed to having to count every row. But then again, it may be easier to just use the stats tables to get a rough estimate how many accounts that there are. So if you want to learn more about this, definitely check out this blog post from CyberTech Postgresql.com. [00:07:04] The next piece of content is actually a YouTube video and it's called My Favorite PostgreSQL Books. So this is from, I believe it was a webinar that was done on the Enterprise DB YouTube channel and it goes over a set of books, each for a different type of person. Like if you're more of a developer, it suggests one book. If you're more of a beginner, it suggests one book. Or if you are looking for a cookbook of sorts. So I think it's about four or five books that they mention and they kind of give you a summary of it and their opinions on each book. So if you're looking for more educational material about PostgreSQL, maybe check out this webinar. [00:07:42] The next post is how to migrate from inheritancebased partitioning to declarative partitioning in PostgreSQL. This is from secondquadrant.com and version PostgreSQL. Ten introduced native partitioning or declarative partitioning versions. Prior used what they called inheritance based partitioning. So this post walks you through the process of being able to migrate from one to another. And I'll just go over the overview here. First they created a new table, same columns as the parent table from the existing partitioning scheme. Then for each child table in the heritage scheme, they detached this child table from the parent using alter table with no inherit. And then attach this child table to the new parent using the attached partition, then create any needed new indexes on the new parent table. And they did mention in this post that even if they exist on the child they won't be redone so that's definitely convenient. So hopefully that would happen pretty quickly. Also you may need to add views, functions or other objects, drop the original table used with inheritance partitioning and then lastly rename the new parent table to replace the old one. And then they also mentioned when doing this you would definitely want to do a backup of your tables and even doing it within a transaction. So first they start off begin, then they do all the steps and then they conclude with a commit of course. So if you're needing to do this, definitely check out this post for a technique to migrate from inheritance base to declarative partitioning. [00:09:18] The next post is using PostgreSQL for JSON storage. Now, this is a very brief post, relatively short, but it talks about using JSON in PostgreSQL and it gives an example of JSON that say you have and storing it in a column that they just happen to call JSON underscore content and ideally with a gen index to make queries faster. And they show you how you can extract certain data from it. So this is a way to get the user's last name from this syntax and then also using the where clause to actually pull out a particular piece of content from it or a particular record from it. So again very brief but if you want a quick intro to using JSON, check out this blog post. And this is from Crunchydata.com. [00:10:05] The next post again very brief is how to set up application name for PostgreSQL command line utility and this actually would apply for other applications as well. In this example if you try to set the application name using some variable with psql it doesn't work, it actually needs to be part of the connection string. So here you see application name equals a specified name and then when you show application name it will appear and it says you can also use an environmental variable PG app name before making the connection. So if you have a particular application that you want to be defined as an application name within PostgreSQL which is convenient for logging, you could use one of these two methods to do that. And this is from LFAs IO. The next piece of content is a tale of password authentication methods in PostgreSQL. This is from Secondquadrant.com and it's walking through the evolution of passwords in postgres. So at first it started off with just a password in clear text. They said this isn't great, let's try to encrypt at least a little bit. So they used the crypt utility within Unix or Linux but again that had disadvantages again because some clear text issues. So they moved to MD Five that resulted in some benefits, but there were also some negatives. So now the one that they suggest people using is Scram and they talk about all the advantages and how it is superior to the previous methods. So if you want to learn a little bit more about the evolution of passwords in postgres and the benefits of using Scram, definitely check out this blog post. [00:11:42] The next piece of content is Oracle to PostgreSQL ANSI outer join syntax in PostgreSQL. This is from Secondquadron.com as well and there is some specific syntax that Oracle sometimes uses for doing joins and this basically explains how you would do that in PostgreSQL. So for example, this would be the Oracle version using this parenz plus sign, whereas you could just use the ANSI standard write join or write outer join syntax in PostgreSQL and how to do a full join. Again, you would use the full join syntax in PostgreSQL and even how you would do a cross join. So if you have a need to migrate from Oracle to PostgreSQL and want some migration techniques, check out this blog post from Secondquadron.com. [00:12:28] The next post is my favorite PostgreSQL, extensions, Part one. And my favorite PostgreSQL extensions, Part two. So the first one that they list is the postgres foreign data wrapper or postgres FDW. So this allows you to connect to external postgres servers and query and get data from them. The second one they mentioned is Pgpartman. So if you're wanting to partition tables, pgpartman makes the management of those partitions much easier. So this is another extension that you can use to do that in the next post. The first one mentioned is PG audit. So if you're wanting to set up audit tables to track, inserts, deletes updates, et cetera, in your PostgreSQL instance, you could check out this extension. The next one they mentioned is Pgrepak. So this helps you vacuum and reclaim space from tables while the database is online. Because normally the only way to reclaim space is with a vacuum full, but that locks the whole table. But this extension, Pgrepak, uses a technique so that you can do resizing of tables while they are online and reclaim bloat in your tables. And the last one they recommend is Hypopge. And this lets you set up hypothetical indexes to see how they could be advantageous if you were to add them. What could be the benefit by doing an explain within there, so they help you get a sense of would particular indexes help with particular queries? So if you are wanting to learn more details about these extensions and how to use them, you can check out either of these two blog posts from several nines.com. [00:14:05] The next piece of content is how the Citus distributed Query executor adapts to your postgres workload. This is from citrusdata.com. Now they're talking about Citis, which is an extension for postgres that helps you shard across multiple postgres servers. And they're talking about the open source version so this is included in it. Now, I'm not that familiar with using Citus, I've never used it. But I like this blog post because it kind of gives some insight into potentially where Postgres is going with its sharding and talking about some of the issues they were dealing with and some of the use cases and how they were adapting their solution to handle these use cases better. So if you're wanting more insight into kind of Sharding and kind of potentially where the future will lead, this is a very interesting blog post to check out. The next piece of content is the PostgreSQL Person of the Week is Melanie Plageman. So if you're interested in learning more about Melanie and her contributions to Postgres, definitely check out this blog post. [00:15:07] The next piece of content is a deep dive into PostGIS Nearest Neighbor search. This is from Crunchydata.com and it is a deep dive talking about Nearest Neighbor search using PostGIS. So if you have interested in learning more about that, check out this blog post. The next post is intersecting tracks of individuals mobilitydb. This is from CyberTech Postgresql.com and this is following up on a post kind of talking about contact tracing that's been discussed recently. And this post talks about setting up data structures within PostgreSQL so that they are mobility DB enabled. This is an extension for PostgreSQL to make some of this easier. It's built on top of PostGIS they mentioned here set up trips based on initial mobile points and then discovering the intersection of infected individual to retrieve possible contacts. So if you want to go more in depth on this, definitely check out this blog post. [00:16:07] The last piece of content is how to use Tree machine learning model with two UDA PostgreSQL and orange. Part Four so this is part four of the machine learning posts from Second Quadrant.com. [00:16:21] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com where you can sign up to receive weekly notifications of each episode. Or you could subscribe via YouTube or itunes. Thanks, our.

Other Episodes

Episode 48

January 27, 2019 00:14:41
Episode Cover

Microsoft Acquires Citus Data, Split WAL, Maintenance Work Mem | Scaling Postgres 48

In this episode of Scaling Postgres, we review articles covering Microsoft acquiring Citus Data, split WAL files, maintenance_work_mem and logical replication. To get the...

Listen

Episode 44

December 31, 2018 00:08:15
Episode Cover

Postgres Rise, Row Count Estimates, Nulls, Linux Config | Scaling Postgres 44

In this episode of Scaling Postgres, we review articles covering the rise of Postgres, row count estimates, nulls and linux configuration. To get the...

Listen

Episode 202

February 14, 2022 00:12:49
Episode Cover

New Postgres Releases, A Hairy Incident, Slow Down To Go Faster, Loadable Module Archiving | Scaling Postgres 202

In this episode of Scaling Postgres, we discuss the new releases of Postgres, a hairy upgrade incident, why slowing down can make you go...

Listen