Lateral Joins, Generating JSON, dbcritic, Fetch First Rows | Scaling Postgres 174

Episode 174 July 19, 2021 00:15:54
Lateral Joins, Generating JSON, dbcritic, Fetch First Rows | Scaling Postgres 174
Scaling Postgres
Lateral Joins, Generating JSON, dbcritic, Fetch First Rows | Scaling Postgres 174

Jul 19 2021 | 00:15:54

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss understanding lateral joins, generating JSON with Postgres, the dbcritic utility, and fetch first rows vs. limit.

To get the show notes as well as get notified of new episodes, visit: 

 https://www.scalingpostgres.com/episodes/174-lateral-joins-generating-json-dbcritic-fetch-first-rows/

View Full Transcript

Episode Transcript

[00:00:00] Hello. In this episode of Scaling Postgres, we talk about lateral joins, generating JSON DB critic and fetch first rows. I'm Kristen Jameson, and this is scaling postgres episode 174 one. [00:00:23] Alright, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is understanding lateral joins in PostgreSQL. This is from CyberTech Postgresql.com and they're talking about lateral joins. Now, I really like this blog post from the perspective of how they explained lateral joins. They took an example here you have a select statement that you would send to postgres that select whatever columns from a particular table called tab. Well, you could think of this as a for loop in a programming language. Basically, for every instance of tab, assume it's an array of something, you're going to loop over it and grab data from each instance of the array. So now, what is a lateral join? Well, a lateral join is basically a nested loop. So you want some set of data and then within that set of data or that instance in an array or row in a table, you want additional items of data from another table. So they have this following example where you have a set of products and they set up 1000 products with particular prices. They set up a wish list table and inserted three individuals with the desired price for a particular product for each of these individuals. And this is what the 1000 products basically look like and what the wish list looks like. So, going back to thinking of a select statement as a loop, essentially you are going to loop over each item in the wish list and then you're going to loop over the products, looking for the case where it's a certain price and returning three of those items. And that's essentially what the lateral join does. So you are selecting from the wishlist table, then you're doing a lateral join to this query or subset of a table. So it's a query with a product table where the price is lower than the desired price. On wishlist, they're ordering by the price and giving you a limit of three. So each wishlist is shown along with three prices that are below the target price. So basically, these first three columns come from the wish list and the remaining columns come from the product table. And that's all a lateral join is. It's basically a nested loop. And when you do an explain of the plan, you can see that it is indeed doing a nested loop. So that's all that a lateral join is. And what this post covers, and it kind of makes me think, should they have perhaps renamed lateral join in the SQL standard? But if you want to learn more, check out this blog post. [00:02:54] The next piece of content cut out the middle tier generating JSON directly from postgres. This is from Crunchydata.com and they're talking about using functions that exist within postgres to actually return JSON to your application. So as opposed to just generating the normal output that you typically receive, you output it as JSON. So they have an example of here where they have employees and departments. So employees belonging to departments and the first function they show is called row to JSON. So you can give it a single row of data. Here they're selecting from where the employee ID is one and it will output in JSON each of the columns in that row. And then it does something interesting here with the geometry column where it actually nests some JSON because they have a custom type setup that casts it to JSON. So it automatically does this. Now what you're generally going to want to do is update rows of JSON essentially, or an array of JSON objects. And that's what JSON AGZ or the JSON aggregate function does. So here as a sub query, just doing selecting the employee ID and the name from the employees table where the department ID equals one. And then doing a JSON aggregate on that output shows you the JSON that you see here, an array of two essentially JSON objects or JSON entities. [00:04:15] And you can even nest the output. So here's an example using a CTE where they're selecting the data they want from the employees table, then they're joining it to the departments and they're including in the output an aggregate of the employees for each department. And then the last section of the CTE aggregates all of the departments that are being selected. So you can see it generates the JSON, as you see here where you have departments and each employee in those departments for each object. Now, the benefits of doing this in your application are definitely speed, because if you're using an application framework such as Django or Rails or any other usually those use orms and there's a lot of overhead using that orm generating objects for each of the returned pieces of data and returning it to the user. Going directly to JSON is much faster and should give you better performance. So if you're interested in that, definitely check out this blog post. [00:05:19] The next piece of content dbcritic constructively criticizing your postgres schema using Idris. Now this is from Channel and they're talking about how they wanted to help audit how they were setting up their database schema and give guidance on some best practices. So they created this solution called Dbcritic. I've heard of other language critics such as Ruby critic basically criticizes how you're writing Ruby. So you run this against your code to give you an output of it. Well, this does it for your postgres schema. Now right now they're only criticizing two areas. The first area they're mentioning is having an index on your foreign keys. Now the importance of this is for when you go to say delete. In this example, here an ad. So if you want to delete one ad. If it has a foreign key, it's going to need to look at ad statuses by this ad ID to delete those as well. And without an index, that's going to be slow and it'll be really slow. If you need to delete multiple ads that then each has multiple statuses so the time increases dramatically. But if you have an index on this Adid column, it should be able to delete those much faster. And that's one of the things that Dbcritic criticizes. The other one is mistakenly using timestamp when you wanted timestamp that includes the time zone. So timestamp TZ. So basically it does an audit of your schema to make sure that all of your timestamps are timestamp TZ. So those are some of the areas that Dbcritic covers now and they have integrated into their CI pipeline. Now, it's an interesting project, but it is written in Idris, which they say is a functional programming language similar to Haskell. I've never used it before, so I think that would probably be the only thing that would give me pause about using this at this time. But these types of audits should be relatively easy to implement in other languages. So if you're interested in auditing your postgres schema, maybe you can take some of the suggestions from this post and implement them yourself. [00:07:24] The next piece of content PostgreSQL Limit versus fetch first rows with ties. This is from CyberTech Postgresql.com and they're talking about how postgres allows you to limit to return only a certain number of rows from a query. But the actual SQL standard is to use fetch first rows. So for example, if you want to use Limit in this query with seven rows, you could say Limit three and it returns you three rows. You could also use fetch first three rows only and it will return you the exact same thing Limit would. So it is more verbose, but it is the SQL standard. Now they also cover with ties. So this is something you cannot do with Limit but you can do with fetch first rows. So when you specify fetch the first three rows with ties, what it will do is not just give you the first three rows, but it will include additional rows that would have been equal to the last one. So if you look at the table they initially implemented up here, they have three appearing twice, they also have four appearing twice. So that's the data in the table. When you fetch only three rows without ties, it's going to fetch these first three, but when you fetch it with ties, it will fetch four of them. And then if you actually add another two row to it and you do fetch first three rows with ties, it will then give you one, two, two, because it won't retrieve the three anymore. Now, as long as you're ordering by the ID, when you fetch first four rows, you're going to get that repeated row here, so the five rows. But if you order by the ID and then this random column, you're only going to get four rows even though you're doing with ties. So that's something to keep in mind. The order by clause is very important when using the with ties. So this is another great blog post. Go ahead and check it out if you're interested in learning more. [00:09:20] The next piece of content what happens to logical replication after running PG upgrade? This is from Elephanttamer Net and they're talking about a scenario where they have a large active database that's doing logical replication to another database as well as maybe even doing some logical decoding. So it's very active. A lot of queries and statements are being run and you want to upgrade them to the next version. So the issue that you will encounter is that if you need to use PG upgrade to do that upgrade, because that's the fastest way to do it generally is that PG upgrade will destroy all the replication slots. So you're going to lose any logical replication going on. So it's critical to have those replication slots back in place before you allow any more traffic at all to the main database system. So he's worked out a process to do these types of upgrades. So the first step is to block all traffic to the primary database and he does this by altering the Pghbi comp file. Or you could alter your firewall settings. The second step is that wait till the replication lag is zero and then drop the subscription on the subscriber, which will also drop the publisher's replication slot. Then you do your upgrade on the subscriber, do the upgrade on the publisher, and then the next thing is on the subscriber. Create the subscription again, but you do it with copy data equals false because the data should already exist there. Then you do a smoke test to make sure that logical replication is working. So you insert some data, make sure it gets replicated, and go ahead and remove it. Then you can create any other logical replication slots that are required, and then you unlock the right track. So this is a process that worked for him. He did mention that there's also the ability to alter the subscription and set a new replication slot, but he had some issues getting this working with his upgrade procedures. So if you use logical replication and need to upgrade your database, maybe you want to check out this post to follow his suggested procedures. [00:11:24] The next piece of content secrets of psql a great tool this is a YouTube video on the San Francisco Bay Area PostgreSQL Users Group and this is a great post talking about using psql, the command line tool, because if you're going to be using a lot of postgres, it's a great tool to get to learn all the ins and outs of. And this presentation does a really good job of doing that. [00:11:48] The Next Piece of Content how Much Data goes into the Walls this is from fluca 1978 GitHub IO, and this post first covers a lot about the wall, basically the write ahead log of postgres that basically records all data changes to postgres. And then eventually they get written to the actual data files. And he talks about the existence of the wall files, how they're set up, and how they have log sequence numbers within them that reference different areas of the log, including some utilities you can use to inspect the log. So this is a good post of explaining a lot of that to begin with. But then he goes in to talk about how wall gets generated. And when you insert data into the table, you can see how much wall is generated given different types of inserts, and how if you have an unlocked table, you're basically not going to get any wall generated from any changes made to particular tables. So if you want to learn more about the wall and how it works, definitely check out this blog post. [00:12:47] The next piece of content improved Logging by libpq in PostgreSQL 14. This is from PostgreSQL Fastware.com. They're talking about some enhancements to 14 with the client server communications and to be able to trace what's going on, and that they have added some timestamps for you to be able to debug different issues in client server communication. So if you're interested in that, definitely check out this blog post. [00:13:16] The Next Piece of Content first contact with the PG file dump this is from Higo CA and they're talking about using a utility to inspect postgres's data files. So for example, he created a table, inserted three rows into that table, then identified what file exists by using the PG Relation file path function to identify where a file is, and then used PG filedump to run against that file. So now you can inspect the header block for that table, look at individual data items and the different parameters and how they work to inspect postgres. So if you're interested in learning more about the internals of postgres, you can definitely check out this blog post. [00:14:00] The Next Piece of Content logical Replication of in Progress Transactions this is from Amitcapilla 16 blogspot.com, and in postgres 14 they released some enhancements to be able to start sending data to a logical replication subscriber before a transaction is all the way committed. So generally in versions earlier than 14, you needed to wait till the commit happened, until information started transferring to subscribers. But if you have a lot of activity going on in a transaction, that's less than ideal. So they've started a way to be able to start streaming changes before they're actually committed in postgres 14 and higher. And this blog post walks through the description of the feature, all the changes they had to make in order to make Git work in postgres as well as some of the contributors to it. So if you're interested in that, you can check out this blog post. [00:14:54] The next piece of content is the Postgres Guild person of the week is Jean Kristoff Arnieux. So if you're interested in learning more about Jean Kristoff and his contributions to postgres, definitely check out this blog post and the last piece of content. This past week, The Reproductive Dev Show gave a presentation on pair programming. So if you want to learn the whens, whys and hows of pair programming, you can check out this episode. This coming week, we're going to be discussing object oriented versus functional programming, so be sure to join us on Wednesday, 08:00 p.m. Eastern Standard Time. [00:15:29] That does it for this episode of Scaling Postgres. You can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube. RyTunes thanks you.

Other Episodes

Episode 279

August 27, 2023 00:15:59
Episode Cover

TPS Benchmark, Partition-wise Join & Aggregate, Partitioning Billions, Posgres 16 Features | Scaling Postgres 279

  In this episode of Scaling Postgres, we discuss how pgbouncer can impact a TPS benchmark, partition-wise join & aggregate performance, partitioning a table with...

Listen

Episode 315

May 12, 2024 00:19:31
Episode Cover

Hacking On Postgres is Hard! | Scaling Postgres 315

In this episode of Scaling Postgres, we discuss how hacking on Postgres is hard, a notifier pattern for using Listen/Notify, using histograms for metrics...

Listen

Episode 134

October 04, 2020 00:14:10
Episode Cover

Community Acquisition, Space Saving Terabytes, WAL Archiving, Vacuum Analyze Tips | Scaling Postgres 134

In this episode of Scaling Postgres, we discuss a company acquisition within the Postgres community, how to save terabytes of space, setting up WAL...

Listen