Episode Transcript
[00:00:00] In this episode of Scaling Postgres we talk about time series queries arrays, ranges in C triggers. I'm Kristen Jameson. And this is scaling postgres episode 77 one.
[00:00:20] Alright, I hope you're having a great week. Our first piece of content is post trivial time series examples and this is from CyberTech Postgresql.com and they're talking about doing time series queries in PostgreSQL. Now the first thing they talk about is loading time series data the easy way and they're just showing an example where you can actually use copy from a program and just pull data from a website and load it into PostgreSQL. And the first very simple example they mentioned here is Lag. So this is a window function. So basically you can pull what the last value is for a given column and you're choosing the column that preceded it ordered by the year. So for example, 1966 looks at the previous column by year which is here and places it as the function specifies here. So it gives you the last year's set of data. And then with that you can actually use a function where you just take the primary column production minus this lag function and it basically gives you the difference from the previous year to the current year and you could make a percentage if you want to, et cetera. So it's just basically a very simple way to deal with time series data and you can even do computation such as correlations because there is a core function. So you can take two different values and see how they correlate. So how does production correlate to consumption for example, and this is by country, so you can see how well they correlate production versus consumption. So a very brief post, but a couple of very simple window functions. What's interesting about this week is there's a lot of content for developers. So moving on to the next one is also by CyberTech hyphen postgresql.com is time series exclude ties, current row and group. So the first thing they did here is they prepared a set of data using generate series and they have an ordinality column that's basically an incrementing integer and then a day column that increases by day and then defining by week in integer. Now the first example they looked at is let's look at a sliding windows which is moving over time series in SQL. Now they use an array aggregate function in order to show you what makes up the window. So you do a window functions over order by the day and the rows between one preceding and the one following. So as a good example, looking at the second row here, it's going to show the current row, the one preceding and the one following. So again, the array aggregate column just shows you where that window is and you can see as it moves to three, it has three in the preceding and the following row, et cetera. And you can include more if you want, but he just did one preceding, one following. This is not useful in and of itself. But now you do calculations, you can do averages, sums mins maxes over that window series. The next area talks about excluding the current row. So it does the same window range, but it excludes the current row. So here on the second row, you could see it includes the previous and the next row, but not the current one, et cetera. And it continues on. And you'll notice at the beginning and the end, it only has one value. And similarly up here it only has two values because there is no previous row to row one and there is no row after row 14. So the window will only have two values in it. And then there's excluding ties. It says here, quote, exclude ties, excludes any peers of the current row from the frame, but not the current row itself. The next area is going into excluding entire groups from a windowing function. So we'll see here, this is a window that covers essentially five rows, two preceding, two following and the current one, and excludes whatever this week value is in the group. So of course, since there's only ones in the window for the first two rows, nothing shows because that group is excluded. However, it does show the twos because they start showing up once the window reaches the third column here. So essentially this will be included. And then once you get to this row, these two twos will be excluded, but these two twos will be included, but that will be excluded. And then when it switches to a two, it will then start showing these two ones that are still in the window and the last area they cover is distinct. Now, what they say here is that in PostgreSQL, the current version, there's no way to use distinct as part of a windowing function and that PostgreSQL will error out. And he shows the error here. Distinct is not implemented for window functions. And he says what you have to do is filter the duplicates on a higher level so you can use a subselect, enroll the array, remove the duplicates and assemble the array again. So he gives an example of this implementation if you wanted to do something like a window function using distinct. So these are a lot of great queries for getting a better understanding of window functions for working with time series data. So if you do that, I definitely encourage you to check these two pieces of content out. The next piece of content is how to get the first or last value in a group using group by in SQL. This is from Hackibinita.com. So again, this is another piece of content that I feel is a great example for developers. Now, they have a table here where they have data stored in essentially a JSON B field and it looks like a time log of when certain events happen, so certain amounts of money are deposited, withdrawn, and an available credit is set at a periodic time. Now, they had to do a calculation to get what the total balance was for the account, as well as what the last credit row was. Now, I find this layout a bit unorthodox, but this was the data that he had. And to get the balance, you just pull out from that JSON B field, the delta balances and do a sum and by account you can get what the current balance is. But the problem was finding this last credit entry. Now, he said in Oracle that you can use a last function to be able to do it in the manner described here. But when he tried to do this in PostgreSQL, he couldn't really do it with a group by. So you could remove the group by and do it with a window function this way to get it, but it's still not exactly what you need. And he says, quote PostgreSQL doesn't have the built in function to obtain the first or last value in a group using group by. Now he did some examples using SQL, but then he came up on this, what he's calling an array trick. So using this function he developed an array and found the max value of the array and filtered it only where the type is credit set, and then grabbed the second element of the array, which is the credit. And that got the answer that he was looking for. And it was a very efficient query plan. So this was a pretty interesting trick and perhaps you could use it in some of your SQL queries or similar to the previous post where they were manipulating arrays to be able to do distinct counts. Perhaps you could use rays in different ways to do queries that you need to do.
[00:07:48] The next post is range types in PostgreSQL and Gist indexes. It's from Alibabloud.com and they're talking about someone was using MySQL to be able to search for whether a particular value existed in a range and they decided, this individual decided to give it a try using PostgreSQL and using their different range types. So the post goes into describing the different range types that can be set up. So here's all the different types of range types and different functions that can be used against them. And it goes over a lot of different examples of building certain range types and eventually discusses using a gist index. And in this case he's using a Btree gist to be able to get the query performance that he was looking for. And in the process he created an exclude constraint, inserted a series of test data, performed his query, and then looked at the execution plan, which looks like a pretty simple plan, so I assume he's happy with the performance. It didn't give too much of a comment with reference to it or how performance compared to the MySQL version. But if you have a use case that could use range types, this would be a good post potentially to look at and review how you might be able to use them in your implementation.
[00:09:05] The next post is a faster lightweight trigger function in C for PostgreSQL and this is from Procona.com and they had done some previous examples of using C for triggers to give more performance. And this gives an example of doing a simple audit with a C trigger and see what kind of performance that could give. Because from what they state here, a lot of people use triggers for auditing purposes like to insert into a table when other parts of a table have changed, or another example is just updating as they say, an insert timestamp or an update timestamp using triggers. So they're wondering how fast they could make it. So basically they developed a C function that is shown here.
[00:09:48] Now, one thing that they did mention is that they used at number to specify which column is going to be updated. So not the actual name of the column, but this was more performant. Like choosing a particular column of the table is more performant than using this SPI underscore F number function to get the column Adventures bytes name. So that's one caveat they mentioned here. And then once you have this C function developed, go ahead and make it into an extension that you install in your database and then create the function that the trigger will use using the C language. Create your trigger before insert or update for each row, execute the function you defined. And then for benchmarking purposes, they created a Plpg SQL function. So this is what you would normally do in this case. So they wanted to see what the performance difference was and over multiple runs you could see that without the trigger. Here's the baseline adding the C trigger only resulted in just about a 12% performance hit, whereas the Plpg SQL trigger resulted in 80% performance hit. So as you're looking potentially scale your database, maybe using more C functions as extensions could help increase the performance of your database. Now again, the caveats mentioned here is they did hard code the attribute number or essentially the column number.
[00:11:12] When they dropped down to using the actual name and used this SPIF number function, the performance wasn't as good. It was about between the performance of these just so, just something to keep in mind. But if you're looking to eke out every bit of performance, perhaps this is a path you may want to explore.
[00:11:29] And the last post is waiting for PostGIS Three st tile envelope ZXY. This is from Crunchydata.com and again with the upcoming postgres Three, they are mentioning all the different features and this is yet another one. So if you are interested in PostGIS three, definitely a blog post to check out that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode. Or you can subscribe by YouTube rightunes. Thanks.