Hash Indexes, Int Float Numeric Types, Postgres 14 Features, Data Science

[00:00:00] In this episode of Scaling Postgres, we talk about hash indexes, int float and numeric postgres 14 and data science. I'm Kristen Jameson and this is Scaling Postgres, episode 148. [00:00:21] All right, I hope you, your friends, family and co workers continue to do well. Our first piece of content is reintroducing hash indexes in PostgreSQL. This is from Hakibanita.com, and they're talking about hash indexes. Now, it says reintroduced, and they've had them for a while, but as of version ten, they actually really became more usable. So the first thing this post describes is what is a hash index? And you basically take a hashing function and apply it to a value, and that hashing function then places that value within a discrete number of buckets. So they have an example of one hash function is doing a modulo or looking at a remainder of a number, and you're putting these values in, say, one through ten, and then whatever the remainder is from three, it will be placed into that bucket. So one goes into bucket one, but also four goes into bucket one because four divided by three has a remainder of one. And also, whatever this number is, you're going to have that many buckets. Now, the hash indexing function within postgres looks like it maps it to a 32 bit integer. So about 4 billion possible hash codes. So essentially those would be the buckets now, because there are a certain number of buckets, and you can see that when you have three, you have a lot of collisions, meaning the same value goes into the same bucket. Presumably that will happen less often with a 32 bit possibility, but collisions do occur, and it discusses kind of how they handle them. And they have what they call hash index overflow pages that then look to find out what the actual value is. So it's not sufficient to just go to the bucket. You need to go to what the exact value is. And then it also mentions there's also splitting that can occur where it needs to actually split a bucket into two buckets. So that's also a possibility. And then they get into actually testing these out. So they're doing a test using a URL shortener. So they're going to have just an integer ID. The key is essentially what the short URL is, and then the full URL here. So they're going to be placing Btree indexes on both the key and the URL as well as hash indexes. And the first thing that they looked at was the index size at different points of inserting data. And the first thing that you'll notice is that the hash index usually seems to be about half the size of a B tree index. So it has spatial efficiency on its side. And because it's a hashed value that it's storing, if you have a very large amount of data you're trying to index, it probably makes more sense to use a hash index versus a b tree index for that. Another thing you can notice is that it doesn't grow in a linear fashion like the b tree index, but it has the stepped approach and that's that hash splitting that happens. They also mentioned the hash index size is not affected by the size of the index value. That's what I mentioned before, when you have a column with a lot or long pieces of information and a hash index could be more efficient and lastly, it's not affected by the selectivity of the index value. Now one thing they mentioned here, the difference between these b tree indexes was actually due to the deduplication that version 13 introduced. And if you actually look in version twelve, you'll see an identical b tree index size across the row values tested, and also that the hash value is still about half the size of the b tree index. Now, they did some reindexing to see if they could make some difference and it does minimize the size difference between the two indexes, but it is still the case that the hash index is smaller. Then they talk a little bit about hash index fill factor which you can adjust, although they say it's default to 75 and between the values of 50 to 100, it doesn't seem to be that much variance, so I would probably just stick with the default. They looked at insert performance to determine how fast inserts can happen and while they were both similar, the hash index had a little bit of an edge and they're testing about 9.7 versus 10.8, so maybe there's about a 10% difference. Then they looked at select performance, so selecting out different rows and here the difference seemed to be a little bit larger, although they did say it's a very slight difference. Looking at the timing, it looks to be about a 30% 35% difference. So the hash index performed these selects in 00:59, whereas it was zero 92 for a b tree index. So the hash index looks a little bit smaller, a little bit more performant for select queries, but it does have of course limitations. You can't create a unique index as a hash index, you can't index across multiple columns the way you can with a b tree index and you can't create sorted indexes. So it really works best for getting individual values. Of course you can't use it for range groups and you can't use it in order by clauses. Again, due to that sorting issue. So their conclusion is that the hash index is usually smaller than a corresponding b tree index. The hash index select and insert performance can be better than a b tree index. The hash index removed many of the restrictions in PostgreSQL ten and is now safe to use. And the hash index has many restrictions that limit its use to very specific use cases. So if you're interested in learning more about hash indexes, potentially using them in your projects, check out this blog post. [00:05:42] The next piece of content is PostgreSQL Int, four versus float, four versus numeric. [00:05:48] And this is from Cybertechsql.com and it's looking at these different numeric data types. So they created a table with the different values in here and then they did some comparisons. And basically a lot of this post talks about when wanting to work with money. Of course you would never want to use a float because it uses approximations to do its calculation using the floating point unit of your CPU. So it's not going to give you an accurate count, but more of an approximation when using floats. So you'll always want to use integers or numerics for any kind of money that you're working with. Now of course one consequence of this is say using numerics as opposed to floats is that you do have a performance impact. So averaging the integer in their example here took about 255 milliseconds. Averaging the float, which is probably using the floating point unit in the CPU took 267 milliseconds, whereas doing the numeric took about 368 milliseconds. So much more of a difference compared to the previous two. And then they give you a little chart showing the differences. So if you're calculating any type of currency, generally you would want to stick with an int or use a numeric if you're wanting to maintain decimals. So if you want to learn more about these data types and their performance, you can check out this post. [00:07:08] The next piece of content waiting for PostgreSQL 14 report progress of copy commands. This is from Dep.com. They're talking about that. There's a new view for you to monitor copy progress called PGSTAT progress copy. So if you start a copy command, you can start watching the copy progress and it will output the number of bytes processed and the number of lines processed for a copy operation. So this looks to be a great addition to Postgres 14. The next post, also from the same site, is waiting for PostgreSQL 14 add idle session timeout. So this is adding an idle session timeout. Now there already exists idle in transaction session timeout. So if you're in a transaction in idle, it will actually cancel that session for you if it hits this timeout. Whereas this is just you have something using a connection and if you want to drop that connection, if it's not actively being used, you have an idle session. You can use this timeout to kill those sessions. So this is a great way to release resources back to the database. So this looks like it will be added in Postgres 14. [00:08:16] The next piece of content is our predictive analytics in Data Science work using PostgreSQL. And this is from Crunchydata.com. And it looks like this is the final post in the data science series that's been done by Crunchy data. This is the final one where they're using the model to do predictions. So they're talking about numer models and using R to predict the probability of fire. So if you want to check out this data science post and cover all of the items here, all running within postgres doing data science analysis, definitely check out this post and the previous series talking about using data science with PostgreSQL. [00:08:53] The next piece of content is the mapping of OID and rel file node in PG. This is from Haigo CA, and they're talking about how the OID of an object in the PG class table maps to a table called Rel file node, and also how those change in the architecture of them. So if you're interested in learning how postgres works under the covers as it were, you can check out this post. [00:09:17] Another post, also from Higo, CA, is how PostgreSQL executes sequential scans with the help of table access method APIs. So in postgres twelve they released the table access method API so that others can use it, and with it they've looked at and analyzed the functions for doing sequential scans. So if you're interested in learning more about that, you can check out this post. [00:09:42] The next piece of content is Sysbench in Memory postgres postgres is boring. This is from Small Datum blogspot.com, and here they're talking about an analysis they did of postgres eleven through 13 and basically reports no regressions for the items that he discovered, hence I believe why he calls it boring. But if you want some more detailed analysis of performance for some lower level workloads, you can check out the results of his tests here. [00:10:12] The next piece of content is how to run some tasks without user intervention at specific times. This is from Dep, and essentially the first item he mentions is just using Cron. So using Cron you can start the psql session and send any data you want into it to say, query against the database or create a table or do a backup or whatever. So essentially Cron is the main one he goes toward and he talks about how he uses it with the postgres scripts he wants to run. Now, some people don't have access to the underlying operating system to use Cron and they're looking for a scheduling tool. So he also mentions the new scheduler tool called PG Timetable. So that's an option and he discusses using it and some use cases for it. And another one that I'm also familiar with is PG Cron, which I believe is being added as an extension that's being built into say, the Amazon RDS instances. So therefore that would be readily accessible for you to use to schedule tasks. So if you want to learn more about scheduling tasks in postgres, definitely check out this blog post. [00:11:19] And the last piece of content is the PostgreSQL Person of the Week is Gunner Nick Bluth. So if you want to learn more about Gunner or Nick, go ahead and check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube. RyTunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 95

max_locks_per_transaction, Logical Replication, Backup & Recovery, Pgbouncer | Scaling Postgres 95

Episode 364

Scaling For High Volume | Scaling Postgres 364

Episode 58

Security Concerns, Fast Counting, Indexing, Trends | Scaling Postgres 58

Hash Indexes, Int Float Numeric Types, Postgres 14 Features, Data Science | Scaling Postgres 148