Forced Sequential Scans, Table Renaming, Terminology, PGSQL Phriday | Scaling Postgres 240

Episode 240 November 06, 2022 00:19:17
Forced Sequential Scans, Table Renaming, Terminology, PGSQL Phriday | Scaling Postgres 240
Scaling Postgres
Forced Sequential Scans, Table Renaming, Terminology, PGSQL Phriday | Scaling Postgres 240

Nov 06 2022 | 00:19:17

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss how sequential scans can be forced, the best way to rename a table without downtime, different Postgres terminology and the PGSQL Phriday blogging event.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/240-forced-sequential-scans-table-renaming-terminology-pgsql-phriday/

View Full Transcript

Episode Transcript

[00:00:00] Hello. In this episode of Scaling Postgres, we talk about forced sequential scans, table renaming, terminology and pgSQL Friday. I'm Kristen Jameson and this is Scaling postgres episode 240. [00:00:23] Alright, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is forcing sequential scans on PostgreSQL using large integers. This is from code Jeremevans net and he develops a tool called SQL that gives a DSL, or domain specific language for sending SQL to a database using Ruby. And he discovered this particular issue that has implications for a potential Denial of Service attack. So basically, when you have an integer value in an SQL query and it's outside the range of an integer or the big int, it treats that column as a numeric. Now, that might be a little bit confusing, but imagine you have a where clause and you have a super large number here that's larger than a big int. What does postgres do? Converts it to a numeric and then to do the comparison, it converts the integer or the big int to a numeric. Now, the problem with this is that now it will no longer use an index on that particular column because it was indexed as an int or a big int. It can't use a different type when using the index. So essentially this causes a sequential scan. And he has an example of it here, where he does a generate series of about a million records, adds an index on it, it just has a single column, which is an integer. He uses a big int against it and you see you'll get an index only scan that makes sense. But when he tries to use a number larger than a big int, it converts it to numerics. So you can see it's comparing to numerics. And then it does a sequential scan. So if you have a billion row table and someone is able to send larger numbers than an integer to your table, they could trigger a bunch of sequential scans, either intentionally or unintentionally. But no matter what, as he says here, this is a potential denial of Service vector. Now, he goes through some different ways to avoid this. One is to use bound variables. Another is raising an exception instead of running the query. A third way is explicit casting, but there seems to be a lot of potential issues with doing that. A fourth way he explored is quoting, but even that doesn't work all the time. So it looks like the bound variables are the ideal way to go, or manually checking it and triggering an exception if that occurs. But the thing to note is that this isn't really considered a postgres bug, it's basically how it was designed. And he goes into more detail of whether you're potentially impacted because some other languages also could be impacted with switching a number to double precision, for example, or using floats. And he says for a large part, he doesn't see the other databases being vulnerable compared to postgres, and he has some ideas on how that could potentially change and then how he ultimately changed his library. Now of course, I was interested in this from a Ruby on Rails perspective because that's the application framework I use. And by default they do use the bound variables or essentially the prepared statements for the orm active record that it uses. So its behavior, if it sees a number larger than a big int, it actually never issues the query. It just returns an exception saying that that record can't be found, it just never runs query. So there's no risk of this happening. But the thing to keep in mind is that a lot of people who use PG Bouncer in their Ruby on Rails applications or even other application frameworks have to turn off prepared statements and then essentially you lose that capability. So I actually did a test in Rails and in my database configuration for talking to the database, I turned off prepared statements and it did issue the queries with the larger than integer. So this would be susceptible to this type of potential denial of service issue. So basically this is definitely something to watch out for. So again, it looks like the ideal solution is to use bound variables, but if you can't do that for some reason, maybe you're using PG Bouncer or your orm doesn't support it, then you should be manually checking user input and firing an exception if it's outside acceptable bounds. So definitely check out this blog post if you want to learn more about that. But this also happens to be the next episode of five minutes of postgres. Episode 42 a surprising case of very large integers causing a sequential scan. This is from Pginalyze.com, and he covers this topic in detail as well. So if you want to learn even more about this, definitely encourage you to check out Lucas's post as well. [00:04:54] Next piece of content postgres safely renaming a table with no downtime using updatable views. This is from Brandier.org. He's talking about an issue where you want to update a table, and he says people frequently don't do that because of all the potential moving parts that can occur. But with postgres, there's a relatively easy way to do it if you can get by with it. Basically, within one transaction, you rename the table to what you want to rename it to, and then you create a view of the old table name pointing to the new table name. And this should still allow you to use that old table name, still do your inserts deletes updates. It still should all work through that view. And then later on, if things are working, you can then start using the new table name in all of your code and then you can drop that view. This is a great way to easily be able to rename a table but if you're needing to do that, definitely check out using this technique and this post. [00:05:49] Next piece of content. Postgres Insider terminology. This is from Crunchydata.com. They're talking about different terminology used within the postgres community. One, when you're talking about databases, you frequently hear the word Tuple. What does that mean? It basically means a row or a record in a table. Now you may hear record or row. That basically means a row in a table. But in postgres you typically see record mentioned in terms of what's returned from functions. So you could see this returns a record. And in terms of row, you can frequently see it if you try to construct a value. Like you can use the row function here when issuing certain statements, they talk about arrays, they talk about relations. And now typically relation is considered a table, but it can also mean views. So views are relations as well as the results of queries are relations. So it's basically a set of Tuples could be considered a relation. Then he gets a little bit more into the code of postgres and there's references to a target list, which is generally the list of columns in a select query, and restrictions, which are generally what the where clause is specifying. [00:06:56] He talks about schemas, which are basically namespace areas. And to see what your schemas are, you basically select all from PG underscore namespace and then a discussion about the term pages and what that means, as well as toast, which if you're not familiar, is the oversized attribute storage technique though. So this is interesting post covering different terminology in postgres, so check it out if you're interested. [00:07:19] Next piece of content. pgSQL Friday number two PostgreSQL Backup and Restore. This is from Andreascherbaum La. He's basically kicking off the next week of pgSQL Friday, a community blogging event, and it's on Backups and Restores. So this is the kickoff post and then the next set of posts. I'll cover that topic. This one. Backups for postgres pgSQL Friday. Two. This is the blog from Rustprooflabs.com. And in terms of his Backup and Restore philosophy, his first line of defense is actually his streaming Replication Replica. So basically if anything happens with a primary database, you just promote the Replica. And for nearly all my clients, this is the first line of defense. You want to promote that Replica and hopefully at the most you'll lose is a few seconds of data if it didn't get replicated for some reason. But in terms of actually doing backup, they like using PG backrest. And you'll see this as a trend in all of these posts. A lot of them promote PG backrest for managing their backup and restore process. Next Post mydba notebook.org is how to do proper backups. And this post actually lists some problems with the default tool that comes with postgres PG based backup. Basically, it doesn't do what she feels a full featured backup tool should do, and she thinks it should be named PG Clone Cluster. She does say Barman does exist, but her preference is PG backrest like the other poster was mentioning. The next post. Validating. Backups. pgSQL. Friday two. This is from Scarydba.com, and he basically says backups don't matter, it's the restores that matter. And I can agree with that. No matter what backup method you use, you want to restore. Next post Adventures in PostgreSQL Backups this is from Softwareandbooz.com, and he also advocates that restores are more important than the backups themselves. Definitely agree, but he also mentions extensions here. So whenever you're wanting to bring your server back up, you need to make sure to bring it back up to a consistent state. So you want to make sure you know all your extensions and all the requisite packages are in place to restore that database to its fully functioning level. Now, some of the other posts mentioned ansible that's the way that I choose to be able to rebuild a full database server from scratch, if that ever became necessary, and it's also the technique to build up the Replicas as well. Now, one thing that wasn't mentioned in these posts is being able to do automated restores. So I actually have a few clients that actually do test restores on a regular basis, or they've entirely automated. It in a cron job for a dedicated server to take the backup and automatically restore the database to a particular point in time. So those are other techniques you can do as well. [00:10:05] Next piece of Content PG based backup could not set compression worker count unsupported parameter this is from Rhos blogspot.com. He's talking about an issue where when you're running PG based backup in postgres 15, you could run into this message that says could not initiate backup error. Could not set compression worker count to four unsupported parameter and apparently this is due to if you've set the compression either on the server or the client to use Zstandard. So basically you need to ensure that the Zstandard library is present. Libztd needs to be present for these features to work. So this is a quick post just to let everyone know that. And if you still have problems, make sure that the version you're installing is at least 1.5.0. So definitely check out this blog post if you want to learn more about that. [00:10:56] Next piece of Content be Tree versus Brin two options for indexing PostgreSQL data warehouses. This is from Cybertechyphenposql.com and this post compares Btree and Brin indexes. Now, I was reading this and I was actually confused about a point and some results that they had and actually tried replicating it and I could not. So basically they created a table with a sorted integer and a random integer and then they populated it with 5 billion rows. So quite a lot of data. Now, in my test, I only did 100 million because I didn't want to wait for a long index build or long table generation or things like that, but I don't think that would have the same impact. So then they created an index on the ID sorted column, and then created another index on the ID random column. These are b tree indexes. Now, selecting account from the table, where the sorted is between a range of about 10 million, returned in 358 milliseconds. So pretty quickly, and it did an index only scan in parallel. So that makes sense. They then did the same query, but they used the ID random column. The range was adjusted a bit to try and get the same amount of data, because again, they're random numbers, so it's a little bit harder to do that. But here they had an atrocious result in using explain buffers. They had a huge number of heap fetches, which this didn't make any sense to me, because essentially the index is in order and it shouldn't be going to the heap to get anything. Now, this is what I couldn't replicate because I did the same table, not the same data size, but the same indexes. And the result for ID random was identical to the ID sorted. It was doing a parallel index only scan, which makes sense. So I don't quite understand this result, because intuitively, the index is ordered, you're not asking for any columns. It should do a parallel index only scan, which is what I got when I tested it on postgres 14, not 15. So I don't quite understand what's going on here. But then he tried Brin indexes. Now, the thing to know about Brin is that they are block range indexes. They are super small because they only specify ranges. So when you look up the data, it knows a range of values and then if it needs to know specific ones in there, it needs to go to the heap to get specific values. So essentially, when you run the Brin index, it's going to do a bitmap index scan, but then it's always going to have to go to the heap to find the specific rows, because it's only operating in ranges. And with that, he actually saw worse performance up the Brin compared to the Btree. Now, that's not always the case. Sometimes Brin is going to be faster, but I think the reason this is slower is because it has to go to the heap. The B tree index up here, looking at the sorted, is an index only scan and you basically can't get an index only scan with a Brin because you have to go heap to get the specific values. It's only looking in given ranges. So I think that's the reason why it's slower or explains what's going on. Now, this next post was actually done back in July and it's postgres indexing, when does Brin win? And this is from Crunchydata.com, and they demonstrated that at a small number of rows for Btree sequential, it's going to trounce the brin and this is not using an index only scan. So they were pulling another column. So it was the typical you scan the index and then you go to the heap to get the data. So there were no index only scans here like in the previous post. What they showed is that at small row counts you're pulling the B tree index will definitely win because it knows exactly where those rows are. It starts to get less as they went up to 1000 rows. At 10,000 rows the B tree and the brin decks are near even, and then at 100,000 rows the brin starts winning and it basically goes up from there. The more rows that you're going to access, the brin is going to win because it's faster to access that index and then go to the heap to get the specific values needed compared to using a huge B tree index, because that's another difference between B tree and a brin. A brin is super small because it only is indexing ranges, whereas a B tree indexes every value. So you can definitely check out these posts if you want to learn more about Btree and brin indexes. [00:15:18] Next piece of content new Public Schema permissions in PostgreSQL 15 this is from Enterprisedb.com. They're talking about the schema changes to 15 that we've covered in previous blog posts on Scaling postgres. And basically users can no longer create any object they want in the public schema, but this doesn't happen by default, even after you upgrade. And in order to get this behavior for existing databases, you actually need to do these two commands to do that, alter the schema public owner to the PG database owner and then revoke create on schema public from public. And that'll essentially set your existing database to the way a new database would be created. But if you want to learn more, definitely check out this blog post. Next Piece of content PostgreSQL are all nulls the same? This is from Procona.com. They're talking about distinct and nulls. And then typically a null is an unknown value, so there could be distinct values in there. There could not be. Basically, nulls have typically been considered all to be distinct, but there's a new command in postgres 15 that allows you to say not distinct, so it considers all nulls one value for that purpose. And we've covered this in a previous blog post on Scaling Postgres, but if you want to learn more about that, you can definitely check out this blog post as well. Next piece of content securing Petroni Rest API endpoints Part One this is from Procona.com. They're talking about how you can set up usernames and passwords for the Petroni Rest API, I believe, as well as Https encryption to protect your petrone install. So if you're interested in that, check out this blog post. Next piece of content Data Loading and Postgres for newbies this is from Crunchydata.com. They're talking about all sorts of different ways you can get Postgres into your database. The main one is definitely using a CSV import to do it using the psql Copy command, but there's also ways to do it using JSON, particularly using the JQ Operating System command and how they're doing it. Here, they're just loading it into one JSON B table, and then perhaps you want to transfer that data elsewhere using an insert into another table and selecting from the existing one. And typically this is how I load data. I load it into some sort of temporary table because you frequently need to transform the data or you have data quality issues. You load it into a temporary table that is not as strict, and then you insert it into its final destination, doing all the transformations and data validations needed. They also talk about different GUI ways to get data in ways to dump and restore the data, using foreign data wrappers to get a connection from another database, to it to insert data, or even to run a query in order to generate the data that you want to produce. So if you want to learn more about all these different techniques, definitely check out this blog post next piece of content. There was another episode of Postgres FM this week. This one was on version control for databases, so if you want to learn more about that, you can definitely listen to the episode or check out the YouTube link below. Here the next piece of content. The PostgreSQL Person of the Week is Adrian Nara. If you're interested in learning more about Adrian and his contributions to Postgres, definitely check out this blog post and the last piece of content. We did have another episode of the Rubber Duck Dev show this past Wednesday evening. This was on where do you put your application business logic in your application framework? We tend to use Rails, so we discussed that as well as some other application frameworks where they tend to put them. But if you're interested in that type of long form developer content, we welcome you to check out our show that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 261

April 17, 2023 00:16:24
Episode Cover

Easy Foreign Data Wrappers, JSONB Cheatsheet, Updating Cost Limit, Parallel Aggregate | Scaling Postgres 261

  In this episode of Scaling Postgres, we discuss how to easily crate a foreign data wrapper to consume an API, present a convenient JSONB...

Listen

Episode 227

August 07, 2022 00:08:53
Episode Cover

Researching Performance, Postgres Survey, pgAdmin Survey, Long Running Queries | Scaling Postgres 227

In this episode of Scaling Postgres, we discuss PG14's new SQL function In this episode of Scaling Postgres, we discuss research into a performance...

Listen

Episode 292

November 26, 2023 00:13:13
Episode Cover

Companion Databases? | Scaling Postgres 292

In this episode of Scaling Postgres, we discuss using companion databases, multi-tenancy database design, whether SQL is good, and different transaction isolation levels. To...

Listen