20 Times Faster Text Search & Ranking? | Scaling Postgres 286

Episode 286 October 15, 2023 00:12:59
20 Times Faster Text Search & Ranking? | Scaling Postgres 286
Scaling Postgres
20 Times Faster Text Search & Ranking? | Scaling Postgres 286

Oct 15 2023 | 00:12:59

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss a new extension that promises substantially faster text search and ranking, an AI content storm, how to work with money and the fastest way to stop Postgres.

To get the show notes as well as get notified of new episodes, visit: 

 https://www.scalingpostgres.com/episodes/286-20-times-faster-text-search-ranking/

 

View Full Transcript

Episode Transcript

[00:00:00] Could Postgres work like Elastasearch and get you 20 times faster searching and ranking? We'll explore that in this episode of Scaling Postgres. But before we do that, I hope you, your friends, family and coworkers continue to do well. So the first thing I wanted to mention this week is that I am working on a postgres course and it's going to focus on performance optimization because, you know, scaling post. Basically, I'll be teaching the framework I use when I engage with clients to help them make postgres faster or reduce the query or the statement burden on their database so it operates more efficiently. And I'm mentioning this because I'm looking for some feedback. If this course sounds like something you might be interested in, I'd like to have a Zoom call with you. So if you're interested in doing that, you can head over to Scalingpostgres.com Zoom and just send me your email and we can coordinate a time to have about a 15 or 20 minutes discussion. But let's get started with this week's content. And that is pGBM 25 elastic quality. Full text search inside postgres. This is from Docs Paradb.com and they're talking about a new extension that provides a new type of index called BM 25, which is the algorithm that's used by modern search engines to calculate relevancy scores and search results. So presumably this is the same type of algorithm used in Elastasearch. Now they're focusing on full text search which uses the TS vector type and they're saying it has two issues. One is performance and that when tables grow to millions of rows, a single full text search can take several minutes. I haven't really seen that unless they're talking hundreds of millions of rows up to a billion. But then of course, I'm always looking for faster way to do things. The second thing they mention is that Postgres has no support for operations like fuzzy search, which it does, just not in the full text search area. But it doesn't do relevancy tuning or relevancy scoring. So already I'm kind of like, okay, we'll see what the rest of this covers. But this new extension and index type will be, as they say here, 100% postgres native with no dependencies on some sort of external search engine. That's great. It's built on top of Tantavi, which is a Rust based alternative to Apache Lucine. It says it's faster than the full text search that exists now and it also supports fuzzy search, aggregations, highlighting and relevancy tuning. And it does real time search as well. In other words, you don't have to re index things. Now apparently what made this possible is that Tantavi was created as a Rust based alternative to Apache Lucing and then Postgres got Pgrx, which made it possible to build extensions in Rust. So you put those two together and they came up with this new extension to be able to do Elastasearch like searches. So they're tying into Postgres using the index access method API as well as the operator API. And as a result you can search for a query strings within a table and they even designed a way to give instructions to Tantev's Mini query language here. So they did do some benchmarkings on their site here and they indicate that the build times are faster than the TS query, which is in red. But also the performance looks very impressive, although I haven't heard minutes to return a query from Full Text Search when indexes are properly set up. But this one here at a million rows is returning in just over 3 seconds with the TS query. You can see some of the benchmarks here. I took a look at it, nothing jumped out at me, I'd probably have to install it and play around with it. But if this relationship holds as you get to millions of millions of rows, then the Parade DB results look quite significant. Now in terms of installing it, you can download the Paradb docker file and you may be wondering what Parade DB is. Well, that's basically their postgres solution dedicated to searching, kind of like Citus is dedicated to scale at postgres and Timescale is dedicated to time series databases. This one is an index focused one, but going back to installing it, you can get it as a docker container or you can do it self hosted. So you basically install Rust in the Cargo Pgrx using this method here. Then you get clone the code initialize, I think it's called the Create, install it and then create the pGBM 25 extension. So I find this super interesting, but what do you think? Is this something you think you would like to check out? Do you think it would work better for Full Text Search cases? I do have a portion of my clients that are also using Elastasearch alongside postgres, but if you can put it all in one database, that would seem really advantageous to me because I know some clients are only using it for a couple of simple search. But check out this content if you want to learn more. Next piece of content. AI ignites the rain. Content. Storm is back. This is from Superbase.com and apparently they have a Content Storm where they coordinate with a number of content producers to say hey, let's create content on this one subject and this one was dedicated to AI or artificial intelligence. So they actually had 30 creators participate in this Content Storm and all the content was published this past Tuesday, October 10. And you can see all the different content here. Now, all of this content is of course related to Subabase who did this blog post along with primarily PG Vector and how that's being used together. But if you are looking for AI content, definitely bookmark this blog post to go over all this content. Next piece of content postgres the Next Generation investing in the Next Generation of Committers this is from Redmonk.com and they're talking about the committers to the postgres code base and particularly how a lot of the core development team has the average age of probably around 50. And basically, even though that's fine and I know I'm in around that range, it's definitely important to keep building the community, particularly for the future. And I just thought it was pretty interesting, this discussion on the community's status and things to potentially improve about it. So you can check out this blog post if you're interested in that next piece of content. Working with Money in Postgres this is from Crunchydata.com and they're talking about things you should or should not use when you're working with Money. The first thing not to use that they say is basically the Money data type, primarily because it's tied to the database locale setting. So that could cause you a lot of problems down the line. So it's not really the best data type to use. Also floats. You never want to use floats because they're highly inaccurate. I mean, it's fine for graphical representation like games, or maybe geospatial work because the accuracy does not have to be super, super high. But with Money you don't want to lose a cent or even a partial cent, so you don't want to use those. The two candidates you should use are integers. And now it's normally used if you have an amount that's 33 dollarsfifty cents, they store 3350 in the integer column. Basically they move the decimal point however many places over and just store the full amount in the integer. But you need to have the currency as a separate column along with it so you can know what calculations are needed. Now, the downside of that is, of course, you must always have a calculation to know what the true amount in the database is. Meaning if it's 3350, you know you need to divide it by 100 to get the proper dollars and cents. If you're dealing with United States dollars, the other option is numeric, where you can actually specify the precision and the scale. So how much to the left of the decimal point and how much to the right of the decimal point. And be sure to choose a scale factor that's appropriate for whatever fractions you want to support. And they also talk about different functions. So if you're interested in this, you can definitely check out this blog post next piece of content. Pgvector versus Pine Cone. Cost and performance. This is from Superbase.com. They're comparing the Pgvector extension to Pine Cone, which is a dedicated vector database, and they did some performance tests, which of course you have to take with a grain of salt. But using the HNSW index on Pgvector compared to Pine Cone, the performance looks like it was over ten times faster. So that's a significant difference. Even the IVF flat performed better than Pine Cone in their example here, they also increased the number of pods, but PG Vector in the HNSW index still performed better than Pine Cone. Then they tried the fastest pod type, but it results in decreased accuracy. So zero 94 compared to zero 99 or zero 98, I think even higher. But once they match the accuracy for the HNSW index, it looks like it still performed well, just a little bit less than twice as fast. So another good post to check out if you're considering using Postgres for AI work, so you can see how all the different knobs can be tuned to get the best performance out of different indexes. Next piece of content stop and start postgres faster. This is from Postgres FM. This was the episode they had last week. And here Nikolai did a solo show where he was talking about this and he talked about when you're stopping you could have issues with ongoing connections that haven't been dropped or maybe wall that has not been archived yet. But the main reason I think are checkpoints. So basically whenever you shut down the database, you need to do a checkpoint to take everything that's in memory and then to the wall and reach a consistent state on the data files on the disk. And I like what he said in this because he suggested taking a manual checkpoint before the shutdown procedure and maybe even doing one or two checkpoints, because maybe you do a checkpoint and it takes ten minutes to do well, do a checkpoint then immediately. So you get as much work done as possible before you start the shutdown procedure, because I haven't really seen a lot of content mentioning that, but it's something I've done running checkpoints before shutdown to make sure as much of the checkpointing process is done as possible. And then similarly to speed up starts, it's basically make sure that you gracefully shut down that will allow you to start faster, because if you have a crash scenario, you have to recover from it, you have to read all this wall to get back up to a consistent state. So as long as you shut down on a clean state, the startup procedure should be pretty fast. But you can play the episode here or watch his video on YouTube here. Next Piece of Content Hot Updates and Brin indexes in Postgres 16 this is from Pganalyze.com, and apparently another update to postgres 16 is that hot updates can be done even with Brin indexes. So hot updates are basically heap only Tuple updates, so you're only updating the heap, you don't have to change any indexes. And apparently Brin indexes were blocking heap only Tuple updates, whereas in postgres 16 now updates can happen even though you have a Brin index on the table. So check out this piece of content if you want to learn more. Next Piece of Content PG Reset Wall when to Reset the Wall in PostgreSQL this is from Cyprtech Hive and Postgresql.com basically I would say probably don't ever run it. So, as he says here, the wall helps you reach a consistent state along with the data files. Or if you have shared buffers in the data file, you can reach a consistent state. So anytime you're resetting wall files, you're going to be throwing away some of that consistent state. And I like his very important rules here one always back up what you have before running this. And really I would have probably put this first, although the backups are important, but use it as a means of last resort. In other words, this is the last thing you should do. The third is do not call PG reset wall carelessly definitely not. And expect data loss. So this is for when you just can't get your database running, in my opinion. And maybe it can allow you to get access to some of the data. But if you want to learn more about it, definitely check out this blog post. Next piece of content Explore Practical PostgreSQL substring use cases with examples this is from Techbuddies IO and they're talking about the substring function in Postgres and all sorts of different ways you can use it, especially with a lot of the different regular expression options that they cover here. So a lot of different examples on how to use it. So if you're interested in that, you can check out this piece of content. Next piece of Content set up Geocoder with PostGIS and Tiger Line this is from Blog Rushprooflabs.com and he's talking about setting up the US census Bureau Tiger Line data set along with PostGIS to do geocoding of addresses. So this is the setup portion and then the second blog post actually does the geocoding. So if you're interested in that, you can definitely check out this blog post. And the last piece of content there was a null presentation update that was done by Bruce Momgium and the title was Nulls Make Things Easier question Mark, which I like the question mark and how he says, quote humorously my major recommendation was to liberally use not null column constraints where appropriate to prevent their existence. Fewer nulls, fewer problems. And I agree, I try to use as few nulls as possible. But if you want to learn more, you can definitely check out this blog post. [00:12:25] I hope you enjoyed this episode of Scaling Postgres. Be sure to head over to Scalingpostgres.com where you can check out the links of all the content that I covered this week, as well as transcripts, as well as the podcast and the video replay. And if you are interested in the course that I'm developing with regard to performance, please do check out Scalingpostgres.com Zoom and submit your email and we can coordinate a time to have a Zoom call. Other than that, I will see you next week. [00:12:54] Our.

Other Episodes

Episode 40

November 25, 2018 00:08:37
Episode Cover

Materialized, PgBouncer Auth, Implicit vs Explicit, JIT | Scaling Postgres 40

In this episode of Scaling Postgres, we review articles covering materialized views in Rails, pgbouncer auth_user, implicit vs explicit joins and JIT. To get...

Listen

Episode 270

June 18, 2023 00:14:16
Episode Cover

Data Skew, Monitoring, Remote Access CSV, Documentation | Scaling Postgres 270

  In this episode of Scaling Postgres, we discuss how to handle data skew, tools for Postgres monitoring, how to load data remotely and Postgres...

Listen

Episode 229

August 22, 2022 00:11:49
Episode Cover

Postgres Playground, PG14 Internals, DB Corruption, Anti-Join | Scaling Postgres 229

In this episode of Scaling Postgres, we discuss a new Postgres playground, a book about PG14 internals, how to corrupt your database and using...

Listen