Int to BigInt, Resistance is Futile, Optimizer Statistics, Advanced Query Optimization | Scaling Postgres 153

Episode 153 February 21, 2021 00:12:24
Int to BigInt, Resistance is Futile, Optimizer Statistics, Advanced Query Optimization | Scaling Postgres 153
Scaling Postgres
Int to BigInt, Resistance is Futile, Optimizer Statistics, Advanced Query Optimization | Scaling Postgres 153

Feb 21 2021 | 00:12:24

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss how to convert an integer primary key to a bigint, how container resistance is futile, how the optimizer uses statistics and how to set manual statistics.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/153-int-to-bigint-resistance-is-futile-optimizer-statistics-advanced-query-optimization/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about int to big int primary keys. Resistance is Futile optimizer statistics and advanced query optimization. I'm creston. Jameson. And this is scaling postgres episode 153. You all all right? I hope you, your friends, family and coworkers continue to do well. Our first piece of content is actually a YouTube video. And it's called Changing your Huge tables. Data types in production. This is from Fostim 2021 in the PostgreSQL dev room. And this is actually on the channel of Jimmy Angelacos. And he's talking about changing data types on huge tables in a production database. So do it without impacting production. The example that he used for a lot of the scripts that he shows in this presentation are where you have an integer primary key, and you need to make it a big int because you're about to run out of integers. So he goes through the whole process and shows you these scripts. Basically, you first add a new big int column and then use triggers to start copying the data from that integer to the big int. And then you set up a job or process to backfill the data in that big int column to make sure that it matches what's in the int column. And once everything is in sync, then as a part of one transaction, you do a series of DDL to rename the column, drop the old column, set up the new primary key, et cetera. But this technique can, of course, be used for other data types. He just is using the example of where you have an integer that's about to run out, and it's the primary key of a table. So if you want to learn more about this technique that he's describing, definitely check out this YouTube video. [00:01:48] The next piece of content is deep PostgreSQL thoughts. Resistance to containers is futile. This is from Crunchydata.com in the first part of the post he's talking about. There's some people that think containers basically aren't ready for primetime, particularly for deploying your databases. And this post talks about how trying to resist this trend is probably futile, and we're heading down that path anyway. And he talks about what a container is and even running PostgreSQL. And some operating systems, they already place certain things in C groups, which is kind of what containers use to do some of their containerization. So you're kind of already doing that. It's just making it more formal when you use a container. Now, he does list some considerations you need to address when you're dealing with containers. And running postgres. Number one is the out of memory killer. We covered that post last week that he [email protected]. The next is storage, because a lot of containers are designed to be ephemeral. So how do you handle persistent storage? How do you handle restarts and emotion? And this is basically talking about the orchestration layers and work is being done in that area. And they're talking about the Crunchy data operator that they have developed and then custom deployments. That means someone has set up an automated container orchestration, but then they want to make tweaks to it and sometimes that causes issues those tweaks. Now, personally, I haven't deployed PostgreSQL in a container. I don't see a reason to for my use cases or for a lot of customers I work with, I don't see a reason for them to use Containers. It seems like the push towards containers is because an organization is already using them, say for their application deployment. So they just kind of want to normalize everything. But personally, I don't see a great pressing need using containers with Postgres. But like anything, it depends on your use case. So if you're considering using containers with PostgreSQL, you may want to check out this post from Crunchydata.com. [00:03:45] The next piece of content is PostgreSQL Analyze and optimize their statistics. And this is from CyberTech Postgresql.com and he's talking about statistics and how they are used in regards to completing queries. So it goes through the general process of processing SQL in that it goes through a parser. It goes to a traffic cop that sends utility commands this way, but queries down to the rewrite system which then prepares the query for the optimizer. The optimizer consults statistics in order to determine the best path for the executor. Now, he has an example here where he creates a table using Generate series and he uses an explain plan to see what the optimizer is planning to do. And then he looks at the PG Stats table, which is actually where these statistics are stored. He shows you all the different columns, then he shows an example of what some data looks like and it looks at these statistics per column. So you're looking at the ID column here and it gives a histogram of different bounds as well as other information that it uses to determine the best way to approach querying this column. Then you see a name column here and it's storing something a little bit different. It doesn't have a histogram a different bounds because pretty much everything is hans in this column. So it's presenting it the most column value and their frequency from the statistics that it analyzed. So if you want to learn more about statistics and how PostgreSQL uses them, you can definitely check out this post. [00:05:15] The next piece of content. PostgreSQL creates statistics, advanced query optimization. This is also from CyberTech Postgresql.com and continuing the discussion of statistics, this is where you can actually use the command Create Statistics to create dependencies between columns. So normally these statistics are per column, but the Create Statistics allows you to define relationships between columns. Now, some of those relationships could be between the country and the language that is spoken. Those are highly correlated. Or there could be relationship between the zip codes within a state or the states within a country or provinces within a country, those are correlated. But in his example, he used a Generate series and he defined a correlation between the X and Y column in that the Y column is always just 50,000 greater than the x column. And then he did a query looking at this and when doing an explain, it expected to get a million rows. But when it actually did an explain analyze, it actually only pulled out 10,000 rows. So there is a correlation between X and Y that when you're assessing these statistics based on per column and doing calculations on the number of rows that you expect to be there, you're going to get differences. But if you then use Create Statistics, give it a name and he's using the indistinct, there's a number of different statistics you can create. He's using indistinct between these two columns. Then when you do an explain, you get the properly estimated 10,000 because it did identify a correlation between these columns when using Create Statistics. So if you want to learn more how to do that, you can check out this post. [00:06:57] The next piece of content is querying JSON data in PostgreSQL. This is from Aaronboss Dev and he's talking about using JSON data in postgres. Now this is a relatively simple post. He has a basic set of JSON here that he's working with and he goes through how you can pull it out as part of a select using the arrow syntax. And this returns JSON and then he looks at the arrow notation with a double arrow which basically returns text instead of JSON. And with that you can do where queries to determine particular values in the JSON and also how you can nest responses to navigate through the JSON to return the exact data you want to in a select clause. He then covers the containment operator so you can use it in a where clause to see does this record contain a particular type of JSON as well as the question mark operator, which helps you to determine if there are any keys that match within a particular set of JSON. So it's a pretty basic post, but if you're wanting to get started using JSON with PostgreSQL, definitely check out this post. The next piece of content Query Optimization in postgres with PG Stat statements. This is from Crunchydata.com and this is a pretty basic post again, and it's talking about PG Stat statements. It talks about how to get it set up. Basically it's part of the contrib module. So as long as you have that installed, you need to add PG Stat statements to your shared preload libraries and then create the extension for each database you want to use it in. He shows what the view looks like, but once you get it up and running, you can start tracking the statements that are being run against postgres. And he gives an example of a report so you can look at your queries, taking the most time to run or queries that are being run very frequently. He also covers you can reset the statistics at a particular point if you want to look at fresh data after you've created an index or done some schema changes. And he also covers how you can configure it by just seeing how many statements you're going to be recording in PG Stat statements, as well as the type of statements you want to track. And of course, once you have identified a statement you want to improve, you would then use Explain, explain, analyze, et cetera to figure out how to optimize that query. Now, he also mentioned you can enable I O statistics so you get something for the block read time and block write time. But this is not enabled by default, so you need to adjust the track I O timing parameter in the postgresql.com file, but you need to be aware that this could be a performance burden on your system, so you want to make sure that's not going to negatively impact your system. And he did talk about a utility that I actually haven't heard of before is PG Test timing. So it can actually look at your timers and help you determine if you're potentially going to be impacted by enabling this parameter in your system. So if you want to learn more about PG Stat statements and how it works to analyze your queries, definitely check out this post. [00:09:52] The next piece of content is why partition OpenStreetMap data. This is from Rustproof Labs, and he's talking about a decision process he's going through on whether he should partition data that he's received from OpenStreetMap into his postgres database. So he's using PostGIS to analyze this. And right now, he's juggling different schemas as he loads data in to do analysis. And he's wondering if converting this to using partitions would make things easier. Now, this is the first post that kind of talks about how he's working now, basically loading data and identifying it in a particular schema and then potentially having to change those schemas. And then sometimes he wants to combine data, which requires unions, which kind of brings source some different issues. So he's really wanted to look at partitioning to see if this can improve his workflow. Now the second post is partition. OpenStreetMap data in PostGIS. So this is a second post going into more detail about how he's planning to do it and how the implementation would follow. Now, he hasn't determined that he's 100% going to do this, but these are posts that are walking through the process about how he's considering doing it and if it would make sense for him. Now, a third post is coming. It's not ready yet. Perhaps next week's episode. We'll have the third post to determine whether he's decided to go forward with this or not. But these two posts are a great review of when you decide to do partitioning. It is a burden to set up and maintain, so you want to make sure that the benefits outweigh the cost. So definitely check out these posts if you're interested. [00:11:28] The next piece of content ArcGIS feature service to PostGIS the QGIS Way so this is a post about GIS and working with different services to get it integrated into PostGIS. So if you're interested in that, you can check out this post from Crunchydata.com. And the last piece of content is the PostgreSQL person of the Week is Valeria Kaplan. So if you're interested in learning more about Valeria and her contributions to postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show. Notes be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube itunes. Thanks.

Other Episodes

Episode 297

January 07, 2024 00:20:38
Episode Cover

Postgres In 2024 | Scaling Postgres 297

In this episode of Scaling Postgres, we discuss hopes for Postgres in 2024, whether you need foreign keys, incremental backups, and five ways of...

Listen

Episode 177

August 09, 2021 00:16:28
Episode Cover

Planner Workarounds, Grouping Sets, Lateral Join, Not-So-Easy | Scaling Postgres 177

In this episode of Scaling Postgres, we discuss ways to work around the planner, working with grouping sets, using a lateral join and not-so-easy...

Listen

Episode 307

March 17, 2024 00:14:22
Episode Cover

Collation Conundrum | Scaling Postgres 307

In this episode of Scaling Postgres, we discuss whether we should be using bytewise or linguistic ordering by default, how to transform data in...

Listen