Episode Transcript
[00:00:00] In this episode of Scaling Postgres we talk about recursive queries lateral joins Pagination and JSON b arrays. I'm Kristen Jameson, and this is scaling postgres episode 127 one.
[00:00:21] Alright, I hope you, your family, friends and coworkers continue to do well. Our first piece of content is PostgreSQL. Versions have been released, so this includes twelve point 411.910, point 14 9.69.5, as well as beta three of version 13. Now there are two security issues. One is an uncontrolled search path element in logical replication and the second is an uncontrolled search path element in create extension. So be sure to check those out to see if you need to upgrade sooner rather than later, depending on how you use postgres. And then at the bottom here, they mention all of the different bug fixes and improvements that were done. So definitely review and see when you want to upgrade your version of postgres.
[00:01:07] The next piece of content is understanding recursive queries in PostgreSQL. So this is looking at recursive queries and how you do it in postgres is via CTEs. So first they review common table expressions and basically convey it as a view that only exists within the scope of a single query because you can usually always write a CTE as a subquery and then they cover the syntax of a recursive query. So you use with recursive the name of your CTE and then its definition and how you do the definition. It's two queries combined with a union or a union all. The first one is the non recursive branch and the second one is the recursive branch that references itself. So therein lies the recursion. And then lastly you say select all from the CTE you've defined and they go a little bit over how recursive queries are processed internally. And it uses a working table, so really it kind of works through things procedurally. But then it goes into an example here. And of course, a lot of recursive examples use people who are managing people in a hierarchy. So they're using that. Here you have this manager or this MGR column that defines who the manager of that person is. So what they want to find out is all the subordinates of person 7566. So the nonrecursive branch basically selects the employee e number where it's 7566. And then the recursive branch selects from the table, joining on itself the CTE where the manager ID equals the employee ID. And then here they show the complete query building out the CTE and then what the results, which is what you expect. And then they go into showing how you can do generated columns as long as you use this union all. And here as each part of the recursion, they add one to this level that they're defining. So you can see a first level subordinate and a second level subordinate. And then next they show defining the path. So for example, being able to output that. Jones supervises Ford. Who supervises Smith? And basically in the Recursive query they append the next employee name onto this column that they've defined. And then lastly, they go into how this compares to doing it with Oracle. And they do mention that Oracle does not conform to the SQL standard, but they do have their own way of doing this. And then they also cover things looking at designing a Fibonacci sequence using Recursion. So if you want a refresher on Recursive queries, definitely check out this post from Cybertechnposgresql.com.
[00:03:38] The next post is iterators in PostgreSQL with lateral join. So a lateral join, as they say here, is that the output will apply the right hand part of the join to every record in the left hand part of the join. So they use this example of a Generate series and you're generating a series of numbers from one to four. So in the left column you'll see 1234. But then they do a cross join lateral to another generate series that goes one through whatever the value is in the generate series. So for example, when the left column is one, it's going to generate a series from one to one. So you've got one and one because there's nothing other than the one. When the generate series is at two, it generates two rows. Why? Because this lateral join executes this twice. Once to go from one to one and then it goes to two. So it's going to show you two. And then when you have three in the left hand column, it's going to generate three rows. When you have four in the left hand column, it will generate four rows, and so on and so forth. Now, they use functions as examples here and even here, where they're showing movie recommendations based upon a user ID and recording a name and a rank. You could also do this as a subquery. So here they're say picking the five most popular streams by zip code. So this was a relatively brief post giving a good explanation of lateral joins. And they even have a reference here to an online class that crunchydata has put together to help understand it better. So if you want to learn more about lateral joins in PostgreSQL, check out this blog post from Crunchydata.com.
[00:05:18] The next post is Keyset Pagination. So this is basically how to do Keyset Pagination Paging through a data set. Now, there was a previous post and the way he does his blog post on the same page as the next blog post. So this one posted is called Pagination Tips and covers a couple of different ways to do it, but I would probably only choose the key set Pagination. He goes into more detail here for performance reasons, because basically using offset with a limit is really poor on performance as you get to larger and larger records. And how he does it is basically remembering the product ID that was used and then showing records greater than the last product ID looked at by a limit. Now, the reason he does a limit of eleven if he wanted to display ten is to know whether there's a next page or not, to say show a next page or not. Then he talks about a method using the CTE here to be able to page backwards potentially. But there's usually a disconnect between good performance and actually doing things by pages because it gets very complicated in terms of what happens if a record that you're looking at is deleted or a new record is inserted in its place for whatever reason. How would you handle that in the Paging? Do you need a cache values in the application? So there's a lot of things to consider when doing Pagination. Now, I also am going to reference two other blog posts that cover this in a lot of detail and they have their own recommendations. One is a post from last year around this time called Pagination with Relative Cursors and essentially it's like a keyset Pagination. Basically they are saying you want to avoid doing limit offsets when you have large offsets because the performance is just horrible. And they show here once you have 100,000 offset it takes over 2 seconds and then it continuously gets worse and worse the larger offsets you go. Now they're calling it a relative cursor Pagination, but basically it's the same technique. You're using an ID and a limit. So you're seeing what products are above a particular area. Now, they mentioned the disadvantage is that it's hard to go to a specific page. Basically you can go forward and backward but you aren't really able to go to a specific page three or page six for example, because it's always going forward and backward. And then they talk about you even have a way to use sorting and skipping records by a different sort other than the ID and they have a technique of how to do that here. Now, even a third post related to this that I think may be close to five plus years old is Paging through results from Usetheindexlook.com and he covers different databases. But for PostgreSQL he shows again the one that's the least efficient using an offset and a limit. Now, he uses the SQL specific syntax fetch next ten rows only. But again he uses the same technique of using greater than or less some particular date with a limit. And he says you can use this specific syntax in postgres a row values syntax to consider both a date and an ID at the same time. And he shows here in this great graph on the bottom how the performance with offset starts getting progressively bad as you go up in page numbers, whereas the performance, using what he calls a seek or this keyset Pagination maintains pretty good performance no matter how large the pages get. So if you're interested in designing an efficient way of paginating in your application, definitely check out these blog posts.
[00:08:40] The next piece of content is working with a JSONB array of objects in PostgreSQL. Now, this is an application developer describing how to work with JSON B in their data, and he uses a number of different JSON B functions to do it. His example is having purchases in an array of products in the JSON B field. Now, I wouldn't necessarily advocate this particular data model, so I'm not sure if I would follow this data model myself, but as an example in how to manipulate the results and using all the different JSON B functions, this was a great post to read through to learn more about that. So if you're interested, check out this post from Levelupconnected.com.
[00:09:24] The next piece of content is running multiple PG Bouncer instances with Systemd, and this is from Secondquader.com. So Pgbouncer runs as a single process, and it only uses one CPU or one CPU core. So if you have a multicore machine, PG Bouncer is not able to use all of the cores of those machines. Now, what people typically do is they set up multiple PG Bouncer services on different ports. So you have to run different services to be able to leverage all the CPUs. However, this has a technique of running multiple PG Bouncer instances without having to do that using a common port. So they cover an example of the unit file for the system D configuration as well as the PG Bouncer configuration that he's using. And there's this setting reuse port that you can set to be able to run multiple instances. And he made a modification of the PG Bouncer services using the Systemd template system. And after making these modifications, you're able to start PG Bouncer service one, start PG Bouncer service two. He says they don't even have to be numbers. They could be some other Identifier that you use. Now, he says the problem with this particular setup right now is that the Unix domain sockets can't be shared, but he has a solution for that. So you can set up socket activation, which is supported in PG Bouncer 1.14, and he shows the configuration for that. And what that gives you is a common port for all services to connect to on that system. And then a per instance TCP IP and Unix domain sockets for administration and monitoring. So basically, you can start the services. What started here are the port numbers for administration of that system, but they can all use the same connection port to connect to the database. So they show that here that production traffic can just connect to a common port. Whereas if you want to administer a particular PG Bouncer instance, you can use the port defined when you started the service. So this is a pretty interesting technique that I'm definitely going to keep in mind potentially, if I'm going to be setting up multiple PG Bouncer instances on a single server. So if you're interested in getting more details, definitely check out this post from Secondquader.com.
[00:11:38] The next piece of content is Partitioning Improvements in PostgreSQL 13. This is from Higo CA, and they first talk about the history of all the improvements that have been made to partitioning over all the different versions. And in 13, they talk about partitionwise join improvements, partition Pruning improvements, as well as before row level triggers. So if you want to get all the details and example of queries of how to check out these improvements, definitely check out this post from Higo CA.
[00:12:11] The next post, also from Higo CA is PostgreSQL high availability the considerations and candidates So they're looking at High Availability and they're proposing what people generally want when they're talking about a High Availability PostgreSQL solution. And they came up with these requirements or considerations, as well as kind of what High Availability means. For example, there should not be any single point of failures in the system, continuous health monitoring of back end servers and systems, and reliable failover in case of failure. Now, they also consider four High Availability solutions that they're going to consider to meet these requirements. The first one is PG pool Two. The second is PostgreSQL automatic failover or PAF. The third is Rep Manager or the Replication Manager. And the fourth is Petrone. So they cover the very basic summary of each of these solutions. And in future blog posts, they're going to be examining each one to meet the High Availability needs of postgres. So if you're interested, you can check out this post and look forward to the future ones covering each solution.
[00:13:19] The next post is consensus based. Failover with PG pool Two. So this is talking about Failover and how you can do it with PG Two based upon their watchdog feature and using Consensus to do it and showing the different values that you can configure for it. So if you're interested in doing that, definitely check out this blog post.
[00:13:39] The next post is building a recommendation engine inside Postgres with Python and Pandas. This was an interesting post where they actually were building a basic recommendation system, but doing it within PostgreSQL and using functions to do it. If you're interested in checking that out, check out this post from Crunchydata.com.
[00:14:01] And the last piece of content. The PostgreSQL Person of the Week is Michael Brewer. So if you're interested in learning more about Michael and his contributions to PostgreSQL, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the Show Notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.