Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about PG Bouncer, released query parameter types, lateral joins and security invoker views. I'm Kristen Jameson and this is scaling postgres episode 280.
[00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. We don't have a lot of content this week, so I'm actually doing something slightly different where I'm introducing basically a video blog post of something covering lateral joints. But there's still a handful of posts this week. So our first piece of content is PG Bouncer 1.170 is released. So this is from Pgbouncer.org and it has some new functions as well as fixes. So now database definitions can be added as a list that supports round robin of contacting those databases to handle statements. So that's interesting. They also increase the max length of passwords as well as fixed support for OpenSSL three. And you can definitely click the link here to review the full change log.
[00:01:15] The next piece of content query parameter, data types and performance. This is from Cybertechn Postgresql.com and they're talking about parameters that you can set up for your SQL statements. Now, normally you do this within a language and basically you parameterize a section of the where statement typically. So if you're looking at account one, you'll put ID one. If you're looking at count two you put ID two. Now the benefits of it is it does advanced planning of this, so it just has to execute it when it comes in. So that can result in a speed boost as well as for security reasons because that helps avoid SQL injection when you separate your parameters from the actual query statement. Now for the front end back end protocols they're talking about here, it actually sends a prepare message, a bind message and an execute message. But you can easily do this in two statements within psql by doing a prepare, giving it some certain statement, a data type you expect to pass in and then as well as the parameterized query with the dollar sign one or dollar sign two or three, et cetera, for the parameters. And then you just run and execute that function name you declared and pass it in the number and it will execute your statement. Here's an implementation using Java, but basically you prepare the statement including parameters, you set what that value should be and then you execute it. Now, why this can become important for performance is data types. So for example, you'll notice that the data type is specified here, the data type is specified here as a long in Java. Well, if those mismatch what indexes are expecting, you're not going to get an index hit. And he shows this example here where a big int is different from a numeric type, they're not considered to be the same. So in this example, the column is a big int and you can pass in an integer and it can convert the integer to a big int just fine and it's going to do the index scan.
[00:03:21] There is an unknown type, so you can set something as unknown and the system will try to infer what the type is. And using that method it's able to identify that, yes, this number is some type of integer, so it does do the index scan. But when you specify to be a numeric and in Java they're using a big decimal type, it interprets those as not matching and it actually converts both to a numeric, which there's no index for that. So you get a sequential scan instead of an index scan.
[00:03:54] So this can definitely result in poor performance. And their recommendation is to either set the data type appropriately or leave it unknown and let the system infer. It usually does a pretty good job on that, but if you want to learn more, you can definitely check out this blog post.
[00:04:11] The next piece of content is something that actually I have done using a lateral join to improve performance. So you can think of this as a video blog post. So something I was working with, there are a set of gifs in a system and there can be payments applied to those gifts or the gifts are made over multiple payments. So there's a gift table with an amount and a month just as a simple integer to make this example simple. And then there's a payments table that has the gift ID as a foreign key along with the amount of the payment and the month that the payment came in.
[00:04:44] So I just inserted five rows into the gifts table. So $1,000 gift, a 2000 gift, 3000, 4000, $5,000 gift, et cetera, all in the same month. And then I inserted a number of payments, five payments per gift. So the $1,000 gift had $500 payments, the $2,000 gift had $5200 payments, et cetera, and they were made over different months. Now the report that needs to be generated is looking at all gifts by month, adding up the total of those gifts as well as the payments that have been made against them. So this simple implementation actually won't work where you just do a gift and you do a left at or join to the payments, then you group by the gift month and sum the gift amount. Now that's going to double count your gift amount because of the left adder, join the payments. It's going to duplicate those gift rows. You can't do something like a simple distinct on the amount. It will work in this case, but not when you're talking about thousands and thousands of gifts in a system with similar amounts. That's not going to work that way either. So the solution I had come up with that gave the correct answer, but it ultimately ended up being slower, is doing a sub query in place of the payments table. And what it does is it queries the payments table groups the rows being retrieved by the gift ID and summarizing the amount. So basically you get one row from the payment table that can be matched to a row in the gift table. So when you do it this way, you're not going to duplicate the gift rows and you're going to get the correct amount. But this started getting slow. Now, the main reason why this started getting slow in the actual production implementation is that there were hundreds of thousands of payments. And trying to look at that payment table, even if you're selecting by a particular account, you can't really narrow it by the date range because payments are coming in at any time. You want all payments for the gifts. So it started getting really slow calculating this value and then joining it to the gifts table. But the solution I came up with is using a lateral join. So as you can tell, the query is very similar, but I'm doing a lateral join to something close to the same sub query. So it's not significantly different. But what this does is essentially does a for loop for every gift row that's being selected. So it goes in and adds up each payment row for each gift row analyzed. So if we look at the subquery plan, it basically runs a query on the payments table, runs the query on the gifts table and then does a hash join between them by this hash condition to match everything up and give you the results. Whereas the lateral plan, it's querying the gift table, but then it goes and it queries the payment table for each gift row. So that's why in this plan you could see it's doing loops of five. So if I had ten gifts, it would be doing a loop of ten. If I have 100 gifts, it would be doing loops of 100 to pull all the payments for that particular gift ID. Now ultimately this ended up being faster. So this lateral version of a query for my implementation actually gave me a ten x better performance. Now, that's not always going to be the case, or sometimes it may be even more significant. It depends on how your data is structured, how many rows you're having to loop through and how narrowly you can define the sub query. But this lateral join was an alternative that worked for me. But let me know if you have a different idea of how to make this performance even better.
[00:08:27] The next piece of content waiting for PostgreSQL 15 add support for security invoker Views. So this is a patch and in terms of functions, they talk about security definer functions where you can basically determine is the security going to be set by the person who's running the function or the person who created the function. But this actually applies to Views and in version 15 it looks like they added a feature where you can choose to run the view as the person who created the view, which is how it works today, or you can do it by the person who actually invoked the view. So for example, the standard or old type view, the person who created the view, that's how the view accesses the data. You don't need to give table permissions to that person and you're able to query the view if you're not the person who created it. But you actually can set security invoker equals true for the view, and then it's going to check the table permissions of the person who has selected from that view. And in this case it says permission denied for table source data. So this is a pretty interesting addition and you can check out this blog post if you want to learn more.
[00:09:38] Next Piece of Content five Minutes of Postgres episode Ten max wall size postgres, full page Writes and UUID versus big int primary Keys this is from Pganalyze.com and they covered the post that we covered last week on Scaling Postgres, talking about the max wall size. And they go into a little bit more depth about what happens after the checkpointing process and how many full page writes can be made to the wall and how this can impact performance. And it can particularly impact performance if you're using UUIDs. The reason being, after a checkpoint, whenever any byte of data is changed, that whole page is written to the wall if you're doing full page writes. Now the issue with UUIDs is that they're inserted randomly across the pages. So you're going to get a ton more wall writing happening with a Uuid, whereas an incremental integer is written to the same page incrementally and then the next page, et cetera. So you're going to get less wall writing volume with Begins and the post that he was referencing here, they showed a 20 times difference between the data sizes between big serial incremental IDs and a random UUID. But if you want to learn more about that, definitely check out this episode.
[00:10:58] The Next Piece of Content installing Pgpool Two on Debian Ubuntu this is from Bping Blogspot.com and now it appears that the official postgres repository for Debian Ubuntu, the app repository now has PG pool Two. So this blog post runs through how you can get it installed using this new app repository.
[00:11:23] The next piece of content, the Postgres Guild Person of the week is Michael Christophides. If you're interested in learning more about Michael and his contributions to Postgres, definitely check out this blog post and the Last Piece of Content we had another episode of the Rubber Duck Dev show this past Wednesday. This episode was on how to best organize your tests for your applications. So if you're interested in that type of content, definitely welcome you to check out our show.
[00:11:50] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks.