Episode Transcript
[00:00:00] In this episode of Scaling Postgres we talk about optimizing trigram, search, replication, review, logical improvements in timescale, investment. I'm Kristen Jamison and this is Scaling Postgres, episode 204.
[00:00:24] I hope you, your friends, family and coworkers continue to do well. Our first piece of content is Optimizing postgres text search with Trigrams. This is from Alex Klebizus. My apologies for mispronouncing that. So this is a very significant post and you can see here it's a 37 minutes read and it has this comprehensive table of contents here on the right covering all the different areas. But given it's the first post that I'm showcasing, you should know that that means it's a good post. So I'd highly suggest reading it. And he starts off with some very basic simplistic implementation, runs in about 360 seconds, but he gets it down to just over 100 milliseconds. So he goes through the process of doing a fuzzy base search using trigrams and showing you how to accelerate that. So first he says, okay, what is text search? It's basically you have one input and it's going to search through one or more columns in a particular table. Now why postgres? And basically for a lot of search cases, and I hear this from other places, but postgres is probably good enough. If it's not good enough you're going to have to get something more sophisticated like elasticsearch and have someone manage it, et cetera. Then it covers what are trigrams. So it's basically separating a word into three character sequences. So for example, hello would be represented by these trigrams here. Now if you're a little confused here, I think some of the double quotes are missing from some of these, but that's the basic gist. And there's actually a postgres function called Show underscore Tgrm that you can put in a text string and it will give you all of the different trigrams for it. Now he's not addressing full text search because, well, to my understanding you can't really do a fuzzy search with that. And full text search is not useful when you have something that's not natural language based. So for example, if you have code or if you have some sequence of numbers like he says, product SKUs and email addresses, et cetera. Now he talks about the test environment he set up. The Amazon data set he's using is a review data set. So we place this into a single table called Reviews and he put the data in its own columns. And he only used five columns from this data set, but it was about a gigabyte in size. And then he uses Explain Analyze with buffers to check the performance of things. So before he gets into his first query, he talks about the trigram operators and you have a similarity function as well as two different operators he's using. So similarity gives you a numeric function that determines how similar some trigrams is to something else. And he has this great query here that shows the comparison between two different sets of trigrams what their intersection is, what the union is, what the similarity function is. The percent operator, which basically returns true above a certain default threshold which is, say, zero three here. And then the less than hyphen greater than gives you the distance between two sets of text. And that's useful for doing your ranking or doing your sorting. So he's searching for the name Michael Lewis within the reviews, and here's the query he's using. So he's using the percent operator to find those that match a certain threshold of similarity. So only pull those records and then just pull the top ten ordering or sorting by their similarity. And with a query like this, you get results that look like this. So you get the score listed, here what was returned and the review ID, and this ran in a minute and 34 seconds, so not great. Then he did a fuzzy search and that gave results similar to this. So he's not really happy with the 94 seconds. So he says, okay, how can we speed that up? And the first thing he's reaching for is of course, indexes. So I typically for text search use a gen index, and actually that's what the postgres documentation talks about. But you can also use Gist indexes, and he actually uses the Gist index because Gist supports both filtering and sorting, so you can easily do that sorting operation to get the similarity if you use a Gist index. And also with the Gist index, he's using a newer feature called Siglen, which is the signature length, and you can think of it as how it builds. The index is the level of granularity, so if you have a high Siglin, the index will be larger, but you'll actually be able to target a smaller portion of the index. So less scanning should be involved to do a given search, whereas if it's smaller, it's going to be covering more of the index when it does a search. But that index overall size should be smaller. So we actually test a signature length of 64 and 256. Now using the Gist index, with the shorter Siglin it went down to 4.5 seconds, and with the 256 signal in, it went down to just under 2 seconds. So that's the first set of improvements. Now he also went into doing an exact search and then a trigram search. So using the like operator, he says, well, just give me an exact search to see how fast that can happen. And doing that exact search on a single column returned in six milliseconds, and the fuzzy search returned in ten milliseconds, so definitely super fast. So his thought process here is, hey, maybe we do an exact search to pull out those. Because he's only looking for the first ten, you might not need to do a fuzzy search. So you do the exact search, get what you need then you'll only do the trigram search if you need to. So that's a thought process he went through here about potentially doing that. So now up to this point he's just been searching for a single column. So now he wants to do all the columns and he did a few implementations of this. He tried doing unions querying on each column that really wasn't very efficient. He then tried doing an or so this column or this column or this column and that efficiency was actually even worse. So finally he said, okay, I'm going to put all the columns together in the index. So he's basically using an expression index. So he's concatenating the different fields together and searching on that. The other thing he discovered there's a set of functions for word similarity. So we were talking about the similarity index. Well this is a word similarity index and what he discovered, it's immune to different sizes of text comparisons in terms of matching similarity. So for example, if you do this fuzzy search of Lewis against Lewis, it has pretty good similarity, but once you start having a very large text string like you're concatenating them together, the similarity starts being hurt by that, whereas it maintains it with the word similarity. So he's switching over to use essentially that function and some slightly different operators. So he creates his just index using the Siglin of 256 and he concatenates all the different columns together. He wants to be part of that functional index. And then this is a query that actually queries it. And you'll see this is a slightly different operator that's using the word similarity operator and also this uses the word similarity as well. Again, this is for ranking and this is for filtering. Again, this is the trigram searching portion. And then he did a like implementation for the exact search and the results look good and here's what the performance times look like. So using an exact only search, just using like for looking for the exact name, it returned in 37 milliseconds. The trigram search for the exact name was 39 milliseconds. So it's super close, it's almost identical. Whether you're doing Ilike or a trigram search, it's identical. So I don't even know if you need to first do that exact search and then try trigram. I would just go with this implementation. Same thing with a fuzzy search.
[00:08:20] The exact search was 87 milliseconds, whereas the trigram search was 113 milliseconds. So I would probably just go with the trigram search and not worry about this exact search. But basically this is how he gets it down to 100 milliseconds across four columns. Now, one thing he didn't mention, but I wonder if you could do this. These queries are quite verbose and I wondered if instead you could use a generated column feature of Postgres where you generate a column that contains this concatenated information and then you put the gist index on that. So essentially you would get rid of this code here, this code here and all the concatenation code you could get rid of because you're just querying what your user gave to a particular column. But this was a super impressive post and I definitely suggest you check it out.
[00:09:13] The Next Piece of Content five minutes of postgres episode Six optimizing postgres text search with Trigrams and just indexes. This is from Pganalyze.com and this is a follow on to that post that I just discussed. So if you want to add a little bit more insight into what this post is communicating highly suggest you check out this episode. They also talk a little bit more about just indexes in general if you want to learn more about how those work.
[00:09:39] The next piece of content replication Review this is from Enterprisedb.com and this kind of goes through the process of replication and how it was added to postgres. Basically, in the beginning there were only wall files that existed for the write ahead log for crash recovery purposes. So it was never for replication. But they had the idea, hey, maybe we could use these to build and keep a Replica postgres database in sync. That's when they started having this process called log shipping to keep a Replica in place. And then with postgres nine they started introducing streaming replication where using a wire protocol it could keep a Replica up to date. You didn't have to do the log shipping method. Then over the different versions they added Cascading replication replication slots to help prevent Replicas from falling out of sync. They added different views as well as synchronous replication. And in version 9.4 they added logical decoding so you could decode the wall files that are being transferred over and follow it on with logical replication in version ten. So this post is talking about we have a lot of these pieces in place but if you want to actually create a cluster that's aware of what is the master, who are the Replicas, what state are they in, you actually have to query each database to get an understanding of the topology of the cluster. So it's basically stating an argument. It would be great to have this type of tool to understand the state of the database in general. Now separate tools have done things to try and serve that role. Like he's talking about EDB's bi directional postgres Extension, which of course is the this is the EDB blog but other tools are trying to do that as well, such as Petrone for example. But this is a great post talking about where we've been with replication and where we can potentially go to the future to build a system with more high availability. So definitely suggest checking out this blog post.
[00:11:46] The next piece of content. Logical replication decoding improvements in PostgreSQL 13 and 14. This is from Procona.com and this is a follow on from the post where they were discussing petroni has added some features to address the problem with the logical replication slot failover not happening, or there's no way for it to happen now in a standard postgres database. They actually built that process into Petrone failover. Well, he's talking about some other improvements that have made logical replication easier to use. Number one, there were some problems with some memory usage and disk usage, especially when you have long transactions or a long load time that could impact logical replication. Well, they made some changes in version 13 to improve this by spilling over some things to disk and then sometimes people were having even on 13, some replication delays and CPU load taking care of that spilling to disk and there's been some changes to improve that. So one of the main things in 13 is they added a logical decoding workmen to allow you to allot more memory dedicated to handling logical decoding and not have to spill it to the disk when it runs out of memory. And then secondly, they added the ability to stream in process transactions in postgres 14. So if you have a really long transaction happening, you can actually start streaming it to Replicas before it's actually committed. And we've discussed different posts about that happening, but they cover it in this post as well and they actually show you how to set that up. So if you're actually wanting to do streaming logical replication or have that capability for in process transactions, when you create your subscription, you go ahead and state that with streaming on and that will go ahead and turn on streaming. Or if you have an existing subscription, you can actually alter it to set streaming on. And then there's a couple of different system views that have been added. One is to check the progress of a copy which is with PGSTAT Progress copy, which is important when you're initially setting up logical replication as well as additional columns to PGSTAT replication slots because this allows you to track what transactions are spilling to disks, the size of it, as well as what's streaming. So what's being streamed to other Replicas. So definitely some great additions to Postgres. And go ahead and check out this blog post if you want to learn more.
[00:14:06] The next piece of content Year of the Tiger $110,000,000 to build the future of data for developers worldwide. This is from Timescale.com and they just received $110,000,000 investment. So that's quite significant. And I believe all told, this organization has received $180,000,000 in funding, if that's what they're talking about here. So you can read more of the detail here. But this is significant because of course they are an extension of Postgres, so they rely on base postgres and these funds should help them contribute more to the open source project, making Postgres better itself as well as their solution. So it's been an interesting number of years where, for example, Microsoft has purchased Cytus for their scale out technology and they're doing more with regard to time series functions as well. And now Timescale has received this significant investment. But if you want to learn more, definitely check out this blog post next piece of Content pgosc Zero Downtime schema Changes in PostgreSQL this is from Jayon Dev. He's talking about a new tool that he's developed called Pgosc, which stands for Online Schema Changes. And basically it's a tool that he says is based upon the implementation design of Pgrapak. So what he's done is that any schema change, you're going to remove a column, you're going to add a column, you're going to change a column. What it does is it just doesn't execute those potentially unsafe changes on an active database. It actually creates a whole new table and then uses transactions to flip over to it. Now, I think this is a great idea and can work for small tables, but if you have a multi terabyte table or even really high hundreds of gigabytes, I don't see how this is really going to work that well, because I have some clients that have a multi terabyte table and there's not even enough disk space to replicate the whole table. So we can't even do, for example, PG repack operations on it unless we were to actually bring up a whole nother volume to do that process. He designed this so that engineers wouldn't have to think about carefully doing migrations without causing issues with the database. But I don't necessarily see this as the solution to that. But if you have relatively small tables, maybe this could work. Now this is relatively new, so I wouldn't call it production ready yet, but it's an interesting concept. If you want to learn more, you can check out this blog post, the Next Piece of Content google Cloud PostgreSQL under PG Watch Two Monitoring this is from Cyberposgresql.com and he's talking about how, yes, you can manage some of your cloud managed databases with third party tools. So for example, they are using a Google Cloud instance of postgres to be able to monitor it using PG Watch Two. And basically, instead of having some monitor on the instance itself, it basically calls in to the database connection to collect the necessary data. And he goes over the process of doing this. And why would you potentially want to do it? I mean, maybe you have a multicloud infrastructure and you ideally have one tool that you can add more customization to to do monitoring as opposed to relying upon each cloud vendor's solution. But if you want to learn more how to do that and get it set up, definitely check out this blog post, the Next Piece of Content. The PostgreSQL Person of the week is Julia Google. If you're interested in learning more about Julia and her contributions to postgres, definitely check out this blog post and the Last Piece of content. We had another episode of The Reproductive Dev show this past Wednesday. This episode was on how to learn a new code base. So if you've just been hired and come to a new organization, what's the best way to get started? Or, for example, maybe you're a consultant and you're seeing a new code base for the first time. What are some techniques you can use to get familiar with it? So if you're interested in that type of content, you can definitely check out our show.
[00:18:17] That does it. For this episode of Scaling Post, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.