Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about automatic indexing function pipelines withhold cursors and query scans. I'm Kristen Jameson. And this is scaling postgres episode 188.
[00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is why is it hard to automatically suggest what index to create? And this is from Dep, and they're talking about a post that was done, that we reported on in Scaling Postgres a week or two ago about PG Analyze having an index advisor for postgres. And this is posted live for people to paste in their query and maybe some schema information and it gives a prediction on what is the best index to use. And he actually tried it with a table layout like this. So this is the schema and then proposed this query here, looking at four different columns from this table and it recommended a single index with four columns to it. But there's a lot of issues with this and basically he's leading to the same conclusion I did, is that it's not quite ready for primetime or it does basic things, but getting into the sophistication of predicting what indexes are needed, there's a lot more that would need to be done. So you could take a look at this and say, well, we could do an index on each column where he says maybe you have six different versions where you have a two column index or four versions with three column indexes. And it doesn't even take into consideration selectivity or how unique the values are within it or how frequently they occur or cardinality is another way to phrase that. And he did an example of some here and talking about how this impacts whether you would want to have an index on it. So basically, if you have something that is a high cardinality, you have a lot of different unique values, then that is a very good candidate for an index. But if you have a column that only has, say, four different values, then putting an index on that doesn't make a whole lot of sense. If you're talking about millions or even billions of rows, however, that would be a good candidate for potentially a partial index where you just index each of the four values within there have their own index using the where command. And he has a number of different conclusions down here that I think are very good to follow if you're thinking about the best way to apply an index. And the point of the post is this makes it very hard to automatically come up with a way to design indexes for certain data. Now, we'll probably get there eventually, but it's still going to be a while before we get there. And some of his recommendations are that if selectivity of conditions is low meaning almost all the rows match, then there's no point in putting a column on the index if you have a condition that you always use. In this example it was where a status equals done then you can put it in the where part of the index definition.
[00:03:17] And also don't add too many columns to an index because it makes it larger and less efficient. Particularly if you have two columns and then you add a third column to the index doesn't significantly reduce the number of rows returned, then it may just make sense to go with the two column index and allow the planner to filter out the rest by looking in the heap. For example, and another thing he mentions when you have greater than or less than, which he did in his query, it can only operate efficiently one index on one inequality at once and generally you want that at the end of the index. So you could for example have an index here that is looking at A and then B with greater than or less than and then in the where have basically the D column or the status. So this is a great post to help kind of educate you how you would design your indexes as well as discussing the issue of how hard it is to automatically determine what indexes should go on a particular table. But I think we'll eventually get there because all of these pieces of information are just more information to feed into the model to make better decisions. So I think we'll get there eventually, it's just not quite ready for primetime.
[00:04:33] The next piece of content function pipelines building functional programming into PostgreSQL using custom operators this is from the blog Timescale.com and they're talking about a very interesting extension that they've just released and so because it's a separate extension you can use it on any version of postgres. Now they have the Timescale extension for timescale DB so clearly you could use it there and they say it's available to use now in kind of a beta state. However, you can download the extension separately and use it on any version of postgres. And basically what it does is it uses a combination of custom data types, custom operators and custom functions to enable you to do some interesting time series calculations. Like for example, they were looking to do a specific type of type of data analysis that they describe in the post here. But in order to get information about variances over time for particular devices you had to do a query like this with a window function and a sub query and it is not the most readable query. Now clearly you can do it but using this new extension with these pipeline functions, essentially you make the query much simpler and it creates a pipeline of functions that you send your data through. So for example, the time vector here is a custom type that is a timestamp and a value together within that type and you pass them through a sort then you pass through them through a delta to get the differences between the previous one. So these are essentially functions that you're just passing data through, that's why it's called a pipeline. And then you take the absolute value of the result of those deltas and then you add them up altogether again grouped by device ID. So I found this really interesting if you do a lot of data analysis work because it is much easier to read this and reason what's happening versus looking at the raw SQL up here. Now of course this is not something built into PostgreSQL and you can't use it everywhere, but if your job requires doing a lot of analysis work and these types of functions could assist you, I believe they said they've developed 60 or 70 different functions. You might want to check out this extension because it is quite interesting what's possible and particularly for time series data, which is what these functions are designed for. And here are the number of different functions that they've come up with. Various unary mathematical binary, mathematical compound, transforms some lambda elements. So definitely an interesting blog post and if you're interested in learning more, I definitely encourage you to check this out.
[00:07:25] The next piece of content withhold cursors and transactions in PostgreSQL. This is from CyberTech Postgresql.com and they're talking about transactions in PostgreSQL. And when you're using cursors, generally those all operate within a transaction and once that transaction is done, the cursor is no longer needed and is removed. And a cursor basically allows you to grab one or several rows at a time from a query rather than getting all the data at once. So you can do different operations with it. In the example here, they have a PLSQL code block that they have set up where they are looking at the information schema tables to then delete certain tables from a schema. But you can also use cursors directly in SQL. Now what they're talking about is, again, cursors operate within a transaction, but there is a command you can use called withhold that will basically keep that cursor in existence after the transaction is complete. But you need to be cautious and close it because otherwise it's going to be utilizing resources until it's closed. So I personally haven't used a lot of cursors in my work, I haven't found it necessary for my application. But if you want to learn more about cursors and how they're used, particularly within transactions, definitely check out this blog post.
[00:08:44] Next piece of content analyzing scans in PostgreSQL. This is from Archetype.com and they're talking about looking at how the query plans of postgres works and specifically they're talking about the different scans that are possible within postgres. So for example, they're taking a look at the scan based upon a particular query and the first scan they cover is of course sequential scans where you're just scanning the table straight through it's, not utilizing an index it just goes directly to the heap, directly to the table and reads all the data. The next one they look at is an index scan where you have applied an index and it shows you how an index scan is applied once you're looking for a particular value in that index and then they cover looking at an index only scan where the data that is returned is essentially the data in the index. So that gives you an index only scan. And then they talk about when you're doing a query with textual data, generally what you'll see is a bitmap heap scan along with a bitmap index scan and talks about how those kind of work in tandem to be able to retrieve the data, particularly for text values. Then they also cover bitmap ands and bitmap ors that can be relevant when doing or queries, for example. And then lastly they cover parallel scans, which is basically using multiple workers to scan over a table to return the data faster if you're doing a sequential scan of the table. So if you're interested in learning more about the scans for queries in postgres, you can definitely check out this blog post.
[00:10:12] The next piece of content useful queries to analyze PostgreSQL lock trees, also known as lock queues. So basically when Postgres goes to create a lock on, say a row for example, and then another process comes through and needs to update that row, it needs to wait. So essentially that has created a queue of processes that need to access data, but it is blocked by a lock. And you could also call these lock trees because if another process comes along wanting to update that row, it needs to wait behind the second process until it's able to do its update and it is allowed through. Well, you can sometimes get these in your database system and you may need to diagnose it and the PG locks table is the place to look. But working with that data and how quickly it changes as queries are happening in the system, it could be a little difficult. And he has utilized a number of different resources here, all of them very good about how to come up with what he thinks is the best query to use to analyze essentially your lock queues. So it's quite a log query and he does talk about how it works and the important points of it, but he just wanted to share the query that he uses in order to help him trace issues with lock queues, for example. So if you're interested in that, you can check out this blog post next piece of content. Projecting monthly revenue run rate in postgres. This is from Crunchydata.com and he's talking about developing a query to calculate the projected monthly revenue run rate as well as an annual run rate. And he goes through each of the steps of this relatively complex query of how we built it up and how it does the proper calculations for, say, a subscription business. So if you want to get some more query experience, you can definitely check out this blog post.
[00:12:02] Next piece of content postgres text search, balancing query time and relevancy. This is from Sourcegraph.com and he's talking about not necessarily full text search, but he's talking about the trigram extension and its usage because I believe they work with source code. They use trigrams instead of full text search, which basically searches on words, whereas the trigrams are broken up into a series of three characters. And those are the generally type search that they use. And he was showing how when you do a search using a trigram search, you can relatively quickly get a set of relevant results returned. But you can also do the query such that you increase the relevancy of it and it gives you a ranking. But the problem he says, is with performance, the query not asking for a ranking runs in about 80 milliseconds, whereas one that gives you a ranking runs in about 7 seconds. So quite a bit slower. And he says unfortunately, due to how it's designed, there's no real way around it. But he does mention something that's mentioned down here where someone proposed using a rum index. So this type of index would be larger than say, a Gin index, but it's designed to be faster specifically for ranking. And they posted a link in that post and they're showing the example where yes, a gen is smaller but the run has more information to help with that ranking. So basically if you use PG trigram extension and you want more ranking or relevancy, maybe you would want to try a run index or read this post to kind of see the issues that he was dealing with.
[00:13:42] The next piece of Content does QGIS work with postgres 14? This is from Elephantamer. Net and this is a super short blog post that basically says yes it does. So apparently there were some issues in versions postgres eleven to twelve upgrade that caused some issues, but apparently with 14 he has not discovered any issues so far.
[00:14:03] Next piece of content upgrade amazon RDS and Amazon Aurora PostgreSQL version 9.6. This is from AWS Amazon.com and basically they're communicating that version 9.6 of Postgres on these hosted platforms will be classified as End of Life as the beginning of next year, which is appropriate because 9.6 is no longer receiving releases from the Postgres community.
[00:14:29] The next piece of content the PostgreSQL person of the week is ivan Panchenko. So if you're interested in Ivan and his contributions to Postgres, definitely check out this blog post.
[00:14:40] And the last piece of content we had another episode of the Rubber Duck Dev show this past Wednesday where we discussed practicing continuous integration and development. So if you're interested in learning more about that topic, you can check out this piece of content. Our upcoming show will be about how many third party libraries you should use in your application code.
[00:15:02] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks.