Episode Transcript
[00:00:00] Hello. In this episode of Scaling Postgres, we talk about aggregate filters insert only and vacuum out of sync sequences and functional indexes. I'm Kristen Jameson. And this is scaling postgres episode 180.
[00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content fast, Flexible Summaries with aggregate filters in Windows. This is from Crunchydata.com, and they're talking about the situation where, say, you have some sales data that's broken out into three categories. They have A, B and C here, it looks like, in different columns. And they want to know what sales data is attributed by percentage to different values within those columns. Now, one of the first ways that they did it was using a CTE. But essentially this has to access the table three times in order to do it, to find out what percentage of sales was due to where there were B's and what percentage where the value of A is greater than 900. So that's not a very efficient way to do it, and it's quite long as well. The other way to do it, which I have tended to use in the past, is case statements. So you could say where this column is B, then show the value, otherwise do a zero, and you just sum it up. So that's a pretty simple way to think about it and it can give you the answer pretty efficiently. However, what they're talking about it here is that you could also use a filter, and that any aggregate function, you can use a filter. So they're doing the sum of the value, but then they filter it and they specify where C equals B to do essentially the same thing the case statement is doing. Or they could filter where A is greater than 900. So this is a very efficient way to write it, very clear way to read it. And it says it runs faster than the other solutions. I don't know how much faster would be than the case statement, but they said it does run faster. So that's another benefit of this technique. And they say this applies to all aggregate functions. So here they use the filter clause with a standard deviation aggregate function. So pretty much any aggregate function, you can apply this type of filter. So using aggregate filters is a great way to make your SQL code more readable and potentially more performant. So you can check out this blog post if you want to learn more.
[00:02:24] The next piece of content insert only tables and auto vacuum issues prior to PostgreSQL 13. This is from Crunchydata.com, and they're talking about a scenario where they had a client where they had a query that was slowing down a lot on the primary and then even more on a replica. So normally the return values were less than one millisecond for a query, but the primary got up to around ten milliseconds and the replica got up to 300 milliseconds, but it was resolved by simply doing a vacuum. So they wanted to investigate why this was the case. Now, because they were doing physical streaming replication, all the statistics should be the same. Pretty much everything should be identical with regards to the data. Now, what would be different is that of course, the shared buffers would contain different things on the Replica compared to what's on the primary because it's presumably receiving different queries. So that might accounted for some of the slower response times of the Replica. But the reason for the slowness that a vacuum resolved is because this query apparently was using an index only scan, which index only scans rely on the visibility map being updated. And that is one of the jobs of vacuum, is to update the visibility map to know what data is in a page and that it helps make index only scans super efficient. So then the question came, okay, well, why wasn't vacuum running? And what was happening is that this was an append only table. Essentially, they never deleted or updated data in it. Now, when you have that, you do have an auto analyze that kicks off to keep your statistics up to date, but it doesn't run an auto vacuum because essentially there's no work to be done. Now, in version 13, they made changes where insert only activity would result in a vacuum. So that's a benefit of being on 13 is that it will handle these insert only use cases where vacuum isn't running and you can even adjust it with a parameter called auto vacuum. Vacuum insert threshold. So basically they mentioned the solutions to solve this type of problem is the first option is, of course, to upgrade to postgres 13. The second option is to run vacuums manually on a periodic basis. So you could use Cron or any other scheduling tool to do that. The third option is to alter the table storage parameters. Now, they mentioned modifying the auto vacuum max freeze age. I would think you could also alter the scale factor and the vacuum threshold for the table as well to get it to kick off and be able to update the visibility map. But they did mention that option here. But basically, if you run into this type of issue with appendonly tables, you could do the upgrade, run vacuum more frequently or alter some individual table parameters. So if you want to learn more, you can check out this blog post.
[00:05:19] The next piece of content fixing out of sync sequences in PostgreSQL. This is from Cybertechnpostgresql.com and they're talking about cases where the IDs that are in a table are out of sync from its sequences. So when you create an auto incrementing sequence by creating a table with a serial column in it or a big serial where they're saying the serial lay here, it creates a table and then makes this ID auto incrementing because it creates a sequence object as well for it. So when you insert data into the table, it automatically inserts the correct value in the ID based upon what the sequence says. And if you insert a second row, just specifying the payload column, it automatically places the ID in there, ID two. But you can also manually upload data and specify that ID and you can update it for the value that you want. Now, at this point, you are out of sync with the sequence. So the sequence thinks the next value should be three, but you already inserted a three. So when you go to insert the next value with just the column payload, it actually tells you, hey, there's a duplicate key violation because key three already exists. Now interestingly, I didn't see that in this blog post, but essentially that's just a single failure because at this point, the sequence believes the next value should be four. So you could easily retry this insert and it should succeed, but it would be inserted as ID equals four. So in terms of a problem getting out of sync, the sequences kind of know how to handle it itself. It tries inserting and it will fail, but it will go ahead and try the next number the next time. And of course, the way to avoid it is don't use an ID when you're doing an insert, just specify the other columns and don't specify the ID when you're inserting data. But if for some reason you want to fix these, they actually created a new extension called PG Sequence Fixer. And what it basically does is it looks at all of your tables, finds the max ID in it, and sets the sequence to that max ID. And you can even add a safety margin where they actually set it 1000 more than the max failure of the table. So I've never had a really big issue with sequences being out of sync because they're kind of autocorrecting. They'll eventually find the next value to enter, but it would produce some failures along the way. But if you want a way to resolve it using a tool, you could check out this blog post and their PG sequence. Fixer the next piece of content. Index only scan on functional indexes. This is from Dev Two from the AWS Heroes section, and they're talking about a scenario where you want an index only scan to work on a functional index. And so a functional index is where basically you want to query the data with a function. Say you want to look at the username where you're using an upper function. Well, a normal index on just the username won't be triggered because it's using a function. You literally have to create the index with that function included for that function to then be used. But what if you just want the username? So you just want to be able to use an index only scan to retrieve the username postgres actually won't do an index only scan. As you could see the query here, it doesn't use an index only scan because Postgres simply looks for the username column and it doesn't exist. But what you can do is actually create a covering index, basically include the username column as a payload. So here they created the index and then included the username as a payload. And then when you do that same query just looking through the username, you will get an index only scan. So that's a convenient way to more efficiently query for just this data. Now he said what you can also do in postgres twelve of course is that they have the new feature generated columns so you could actually create a separate column of uppercased usernames and it will be kept up to date by the system itself essentially using something like a trigger mechanism and that you can just use a basic index. You don't have to create a functional index to avoid this type of problem. So this was a pretty brief post on how you can get more index only scans when you want to query based upon a function.
[00:09:34] The next piece of content PostgreSQL database Security OS Authentication this is from Percona.com and they're talking about operating system based authentication. In a previous post they talked about internal authentication where the postgres manages the passwords itself and you can create them and change them there. This one talks about the passwords existing within the operating system and how the database can use that to do authentication. The first option available is Ident. So that would be using an Ident server that exists on the server itself and running on a particular TCP IP port to be able to do that type of authentication. The second option is Pam or the pluggable authentication modules. So you could set that up to be able to do authentication with the operating system. And then the third option is Peer and this is the one I tend to use more frequently and it basically uses a Unix socket to map the operating system user to a database user to be able to log in and do queries. So if you want to learn more about operating system based security for Postgres, definitely check out this blog post.
[00:10:43] The next piece of content how to get advisory lock in Shell this is from Dep and he's talking about a scenario where he has a lot of maintenance jobs going on with a database server and he doesn't want them to overlap. So maybe he says he is doing a repack operation or particular dumps of the database or other data migrations happening and he doesn't want them to overlap. So we want some form of locking to make sure that they don't essentially step on each other's toes. So his plan was to use the advisory lock feature. So basically it's an application based lock so you can just specify a particular number and say this is this particular lock and nothing else should be able to lock it while it's running. Now, with his particular jobs, he actually wanted to be able to hold the lock open for a period of time. So he actually developed a simple script that did a lock and kept it open using the watch functionality to be able to set up this lock within a shell script. So if you want to learn more about how we implemented this, you can definitely check out this blog post.
[00:11:53] The Next Piece of Content announcing Pgdbms Job in PostgreSQL for Oracle DBMS Job Compatibility so this is from Migops.com, and according to this post, Migops does a lot of conversions of Oracle to postgres and they like to use open source tools to do it. Now, they've used various different scheduling tools such as PG. Agent PG. Cron PG timetable. But Oracle has its own DBMS Job scheduling tool and there were some features in it that none of these other tools were able to implement what DBMS Job could do. So they actually created their own extension that mimicked a lot of these features. So it allows an Oracle to postgres migration that's using DBMS Job to simply use this extension and it's called Pgdbms Job. So if you want to learn more about its capabilities and how it could potentially help you with an Oracle migration, you may want to check out this post.
[00:12:51] The Next Piece of Content tuning the PG Pool Two Load Balancer this is from Higo CA and they're talking about using PG Pool as not only a connection pooler, but also a load balancer. And how you can specify a session based load balancing where once a connection is made, a session established from a client, it always goes to a particular standby. So for example, the green client is always going to stand by one and the red client is always going to stand by two based upon when that session was created. But of course I should mention that this is for read load balancing. The writes always go to the primary, of course, whereas you can also select a statement load balance and there select statements are randomly distributed between the existing standbys. And this post talks about these two different forms of load balancing and which makes the most sense based upon your usage. So if you're interested in that, you can check out this blog post.
[00:13:50] The Next Piece of Content advanced PostgreSQL features a Guide this is from Arctite.com and this blog post talks about a number of different advanced features of postgres. So it's a little bit of a simpler post explaining a variety of different features. Just a very basic introduction, but they talk about the capabilities of table inheritance that postgres offers non atomic columns. So basically they're talking about arrays, for example, or maybe even JSON B fields where you can store more than one value within a column. Talking about window functions, which of course a lot of relational database systems support, talking about its support for JSON data, talking about its full text search capabilities, talking about its views capabilities, of course also available in many other relational databases. And then lastly geospatial data that you can do with the PostGIS extension. So if you want to learn more about these different types of advanced features, you can check out this blog post.
[00:14:51] The next piece of Content timescale DB Compression by PostgreSQL this is a YouTube video that's on the Timescale DB YouTube channel, and it's specifically talking about compression as it exists within Timescale DB. Now, Timescale DB does compression as a part of its column store capabilities, and this presentation, over an hour in length goes over how this works, how it's implemented, and how you could use it for different use cases and some examples. So if you're interested in potentially using Timescale DB, I highly encourage you to check out this presentation.
[00:15:26] The next piece of content, the postgres Go Person of the week is danielle Verrazzo. So if you're interested in learning more about Danielle and his contributions to Postgres, definitely check out this blog post and the Last Piece of Content we did have another presentation of the Rubber Duck Dev show this past Wednesday. In this episode we talked about application infrastructure performance. So if you're looking for more developer related content in a long form format, maybe you would like to check out our next presentation on Wednesday night at 08:00 p.m. Eastern Standard Time, where we will be talking about modern web application front ends.
[00:16:04] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.