Indexing Documents, GIN Indexes, Data Science, Generated Columns

[00:00:00] In this episode of Scaling Postgres we talk about indexing documents, gin indexes data science and generated columns. I'm Kristen Jameson and this is Scaling Postgres episode 71. [00:00:21] Alright, I hope you're having a great week. Our first piece of content hint is indexing documents for full text search. This is from Dave's Postgres blog at Pgsnake Blogspot.com. So here he said his boss asked about the new full text search in MongoDB that apparently they've recently released and he was saying maybe we should do something similar for Postgres. And he said, well we can do that already. So he basically did in relatively little lines of code. He says 36, but of course there's comments and some other things, but basically it shows you relatively easily how to index documents for full text search. The first thing he does is create a table, and this is for specifically indexing HTML documents. So he has a path title body, a TS vector for doing the searches, creates a gen index on the column he's going to be searching the TS vector. He then creates a trigger that's going to generate the TS vector whenever data is changed. So upon an insert or update, it will update this column. And then he has some Python code here that basically goes through a list of files, extracts the data and inserts it into the table he created. And then lastly, he does a search using this query to look for the keyword trigger within the documents that he indexed. And he also ordered it by the rank and then limited it to 20 results. And you can see those here. So not a lot of code to create a full text search when you're indexing a number of documents. [00:02:03] The next post is PostgreSQL's indexes gen, and this is from Louisemeda.com and it's basically describing the gen index or the generalized inverted index. And of course the first thing she mentions here is they're generally used to index arrays, JSON, B fields and TS vector fields for the full text search as we've just seen. So this explains how they work at a pretty detailed level. She uses examples of crocodiles and teeth for doing her examples. But the key thing to take about the gen structure is that it is basically a B tree index for getting all the keys and values, but it doesn't store every instance of a value the way a B tree does. But what it's storing is essentially the key values and storing those only once. So one reference for every value and then from there it creates a list or even an additional b tree index to find where all those occurrences happen. So for example, she has a diagram here and there's a value of six. Well that six only exists in one place, there's no other six in a normal B tree index. If you had ten different sixes, you would have in the leaf pages, the ten one index entry per row in the table, whereas here it just stores a single instance of it, but then an additional list or a tree that identifies all those instances. So here's an example of it where you have a value that goes to the leaf in the posting tree here or posting list. So this is what makes it very effective for identifying the occurrences of a word in a document or multiple documents, or finding the occurrence of a value within an array of columns. So if you want to learn more about how a gen index is structured, definitely a blog post to check out. Now, related to that is another blog [email protected] called PostgreSQL's Indexes Gen Algorithms. Now, this covers how the gen index is used, so how it does searching from this phase of scanning keys, scanning the pending list, scanning the main index, and then updating the bidmap. And then she also goes into the process of inserting into the gen index and what's required for doing that, as well as deleting from the gen index. So again, if you want to get more in depth about how a gen index works, definitely another blog post to check out. [00:04:36] Continuing our theme of indexes. The next piece of content is deep dive into PostgreSQL's indexes webinar q A. This is from the Percona.com blog. So they did a presentation or a webinar. [00:04:51] This first post. They have some of the questions that occurred during the webinar, but they have a separate link here with the slides and the recording, and it actually has a PDF. And they cover things such as all the different index types from Btree, hash, Brin, gin, Gist, et cetera, as well as talking about partial indexes, expression indexes, as well as things like index only scans. So if you want a more general review of indexing, definitely a post to check out. And they have this slide here, 22 of the presentation where they're talking about the different index types and use cases for each. Now, what I found interesting is that they don't list Gin as a full text search option, which is generally the one that I tend to use. I tend to use Gin as opposed to Gist. And even looking at the PostgreSQL documentation for version eleven where they're talking about the Gin and Gist types they actually list. Now this wasn't always the case, but they've started listing Gin indexes as the preferred text search index type. But they talk about there is some advantages and certain use cases for using Gist, but just something to keep in mind as you're choosing your index types. The next post is PostgreSQL Meets Data Science and AI. So this is an opinion piece, but I actually think it's pretty relevant. So the first section here they're talking about in terms of for data science or AI purposes, are you going to use something like PostgreSQL versus CSV files versus commercial databases? And they've run into issues with certain clients where their data quality within a CSV file is really bad. Like for example, he says quote, let me quote 3% was stored as a string. So with something like that it's quite hard to compare different data. And he also included this quote here if you're doing artificial intelligence, 80% of your effort has to go into data preparation. Only 20% are the real model. And he continues in short, you cannot seriously expect a model to learn something about interest rates if the raw data does not even ensure that the interest rate is actually a number. So in general, if you're going to be cleaning up data, you might as well put it into a system such as relational database system such as PostgreSQL where you can ensure data integrity and data consistency. And he talks about what system is better for sorting data. All the experience that PostgreSQL has versus writing your own Python scripts. What about filtering? Again tailor made for relational databases as opposed to trying to develop your own script to go through CSV files. So basically what he says makes a lot of sense and he actually goes into classical SQL analytics versus machine learning. And so many people are using the term machine learning where all they're wanting to do is just add up numbers or make a small prediction about what will happen. And basically you can look at historical data and extrapolate you don't need to use some of these very complex machine learning tools. But he does go over some considerations if you are wanting to do data science and machine learning or AI on some tools you can use potentially in conjunction with PostgreSQL. So, definitely an interesting blog post I encourage you to check out. [00:08:16] The next post is how the CIS benchmark for PostgreSQL eleven works. This is from Crunchydata.com and again the CIS is the center for Internet Security and they have a benchmark for establishing your PostgreSQL instance when running on CentOS that it is secured following a set of best practices. So they've updated the benchmark for Levin and they talk about the various areas that it covers in terms of installation and patches, directory file permissions, logging, monitoring and auditing, user authentication, access controls and authorization, connection and replication, as well as PostgreSQL settings and special configuration considerations. [00:08:58] Like some of the updates they did for eleven are actually in reference to the SSL passphrase command where you can actually define where you can get the SSL passphrase for the private key when starting PostgreSQL on the server. So if you're interested in securing your PostgreSQL setup, definitely a blog post and a benchmark to check out. [00:09:21] The next post is generated columns in PostgreSQL twelve, and this is from secondquadrant.com. And they're talking about the new feature called Generated columns, where you can actually define a column that will be generated based upon existing data in the table. In the example here, they actually used a concat function to concatenate an address to a delivery address that you can just simply run and use that will output the address as follows here basically broken out appropriately. Now, they actually had to use their own Concate function due to some potential mutability that can happen because some string functions are locale dependent. So they actually had to generate their own function to do this type of concatenation. But he said as another use case for it is that perhaps you want to generate in a data warehouse example, separate columns for month, day, year, quarter, day of the week, things of that nature, all from a single date. And generated columns are a good use case for that. So whenever the raw data is updated, that generated column is regenerated and he says you can't directly update the data, but of course you can index it. And then finally he has a very comprehensive set of functions for splitting out this type of data in a data warehouse use case. So if you're interested in using generated and columns in this way, definitely blog post to check out. [00:10:53] The next post is PostgreSQL interval date, timestamp and time data types and this is from secondquadrant.com and basically they're covering all the different data type as it relates to dates and times. So they cover the basic data types, talk about also talking about ranges, the different styles that you can use for defining your dates and times, performing arithmetic on them. Basically a very comprehensive post showing different ways of using date times and intervals. So if you're interested in that, definitely a blog post to check out. [00:11:30] The next piece of content is actually a PG Bouncer update called Afraid of the World. [00:11:39] The main feature added for PG Bouncer one point ten is adding support for enabling and disabling TLS 1.3 as well as a number of bug fixes. So if you use PG Bouncer, definitely an upgrade to check out. [00:11:55] The next piece of content is actually a vimeo video called Database as API with PostgreSQL and Massive JS by Diane Faye. So this talks about using PostgreSQL with a node environment. Now many node users use MongoDB, but if you are wanting to potentially use PostgreSQL, Massive JS is a library you can use for interfacing with PostgreSQL. So if you're a node user, this is a potential video to check out to see if you want to potentially use PostgreSQL with it. [00:12:29] The last piece of content is actually a ton of videos in the last week or so have been posted to the PGConf Russia YouTube channel. Now, a lot of these are in Russian and even the presentations, but they do have English dubbed versions for some of them and I believe some of them are in English. Like for example, this presentation is. So if you're interested in additional video content, perhaps this is a channel you want to check out to see if there are any videos you'd like to review that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode. Or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 90

Chaos Order, Pub Sub, Petabyte Scale, PgBouncer Connections | Scaling Postgres 90

Episode 136

Stat Tracking, Temporary Tables, pg_squeeze, pg_auto_failover | Scaling Postgres 136

Episode 345

Over 5X Faster Postgres? | Scaling Postgres 345

Indexing Documents, GIN Indexes, Data Science, Generated Columns | Scaling Postgres 71