Recursive CTEs, DB Access Patterns, Dates & Times, Cluster

[00:00:00] In this episode of Scaling Postgres, we talk about recursive CTEs database access patterns, dates and times and cluster. I'm creston, Jameson. And this is scaling postgres episode 116. [00:00:22] All right, well, with all the craziness this going on, I hope you, your coworkers and family are safe and I hope you're having a great week. Our first piece of content is Recursive SQL queries with PostgreSQL. And this is from the Towardsdatascience.com blog. And basically they're talking about recursive CTEs or common table expressions. And he first talks about why you would want to use a recursive query. And he gives a few examples. Where you have first is where you have self referential data. So for example, you have an employee table and one of the columns indicates who is the manager of that person or some other type of relationship. Or you have a category that belongs to another particular category, but it's referencing it within the same table. The next example is you actually have a tree of information, some sort of taxonomy system, links between articles and things of that nature. Now, what I really like about this post is it really starts off from a basic standpoint and explains everything going on. I would say it's one of the best recursive CTE posts I've seen in a while. Now first they just discuss what a CTE is showing you the with statement here to basically compartmentalize or kind of do a sub query to make it easier to reason about queries you need to do. And then he shows the self referential employee table here where you have a Manager ID that it points to the primary key of the person who is that employee's manager. And he inserts some sample data here. You can see the CEO has no Manager ID, whereas all of these others are managed by other people in the chart. And then he goes into recursion basically how you use a recursive CTE. And he gives a basic example here where it has one, which is the base case, and then does a union all for selecting the other rows. And when you run this query, you could see it prints out these ten rows. Now then he uses the example of his employee table that he created to get a manager tree. And so he uses the base case of Alice, the VP, and then unions it with this query that increases the level here. He starts it at one, but he increases it each time to see who that person manages. And you can see the results of this recursive CTE down here where you see those staff who are on the second level, those staff that are on the third level, and he even gives a convenient diagram so you can actually see what this output means graphically. And then he even does a test doing degrees of separation between two employees. And lastly, he continues on and looking in terms of graphs. So defining the relationship that exists between multiple points. He does that in the Recursive CTE sets up here and it gives this recursive output defining all the different interactions between different points, or I should say the path between different points. So if you're wanting to learn more about Recursive CTEs, I highly suggest checking out this blog post. [00:03:32] The next piece of content is the Troublesome Active Record Pattern and this is from Calpatterson.com. Now there is a Ruby library called ActiveRecord, but what he's referring to here is the actual Active Record pattern and he mentioned the definition here in Martin Fowler's definition, basically converting a whole programmable object into the primitives that are stored in a database table as a row. So basically he has issue with the pattern itself, not explicitly say the Ruby on Rails Active Record library, although he does have issues with ORNs themselves because he does mention here issues with object relational mappers. Now, some of the problems that he sees is that typically by default it retrieves the entire author object. Now actually that is what happens with an SQL query. If you're using select star or select Asterisks, you're going to be pulling all of the columns from that table, so much as you should narrow down what you retrieve. If you're doing it programmatically through an Orn, you should only choose the particular columns of the information that you need. Then he mentions the entire book for each object has to be retrieved because he's doing it as a nested loop here. But of course some Orms you can use Joins and you can literally send queries that look like this. And he mentions the third nested loops prelude, much parallelism. So yes, there are a lot of problems, but reason I brought this post up is because I still use an Orm because it's efficient, but I know how to use it to be able to generate the SQL that looks like this efficiently. So you don't want to just use an Orm blindly, but understand how to use it so that you can restrict how many columns you're pulling back so that you can do a join to a table and you don't have to pull more data than you need. So basically in my opinion, you can use an Orm but you need to use it intelligently, you need to know how the database works and use it efficiently. Now he also mentions here the issues with transactions where you get a record, you change the number of copies in stock and then you save it. So of course there has been no locking going on. So another process could have already updated this or increased the stock and now you're putting a wrong value in the value retrieved earlier because there was no lock on it. And how he has an example here of how you would want to implement this. You can even do it in a particular update query to do this using raw SQL and there's a way to do that communicating through the orms if you need to. Then he also discusses Restful APIs as well. So while this post isn't PostgreSQL specific, he mentions things that you need to be aware of and use intelligently. If you're using some of these tools, such as orms that make certain operations more efficient, you just want to know how to use them correctly to avoid performance and data problems. [00:06:41] The next piece of content is PostgreSQL date and Time data types Explained. So this basically just goes over date and data types and how they work. Now that first covers a timestamp and the difference between timestamp with time zone and timestamp without time zone, and how most people encourage using timestamp with time zone. And it gives you the code here because it handles time zone things for you and you can make your programs easier to work with. Now, you don't always have to have a set session here. You can do it within the query itself to define what the time zone is. But for his demonstration purposes, he just shows it here. And if you have a timestamp with the time zone included, it can do the time zone translation for you. But they also have the timestamp without time zone if you wanted to rely on, say, your application framework to do the time zone conversions. But he covers all about timestamps and then he goes into dates which exclude time. And he shows how you can compare a timestamp to a date in cases where it's true, in cases where it's false, and how you can potentially convert between them. Then he goes into times of a day and intervals and following up with time ranges. Now, if you want a refresher on using dates and times with postgres, definitely a blog post to check out. [00:08:01] The next piece of content is Cluster improving PostgreSQL performance. This is from CyberTech Postgresql.com and they're talking about Cluster, which is a tool that reorders the data based upon an index. So the reason why you would potentially do this is because physically ordered data on the disk, a lot of times it's easier to do lookups or faster and more efficient to do lookups. Now for this, he created a test table and did a generate series of 10 million rows. Then he did a second table that randomized the order of it. So the tables were the same size and include the exact same data. They are just in a different order. He created an index on each of the tables and you can even see the random index takes longer to generate because data is mixed all around on the disk. He did a vacuum and then looked and see how long it would take to read the blocks of the database. Now, all of this was pretty much stored in memory, so you're not seeing some of the disk access costs here, but you can see that the sequential one on disk was about twelve milliseconds, whereas the one that's random is about 13 milliseconds. Again, that's because it didn't need necessarily to go to disk, but look at the buffer shared hit ratio. It only had to access 130 buffers when it's sequential, where it had to access over 18,000 when it was randomized. Now if you then cluster based upon that index, so we ran the cluster command and now it's not still not quite as fast, about 13 milliseconds still, but the read is down to 135. So basically if you know your table is typically accessed by a particular index, it can be beneficial to use cluster to order those on the disk. Now the disadvantage of cluster is that it doesn't maintain that order on the disk. If you do a lot of updates, it's going to start spreading them around again and you'll need to run cluster again. And it's also something that cannot be done while the table is active. So it needs a lock on the table. But if you're looking to boost performance, maybe for as he mentions here, maybe more so of an analytic database or where you have more static data, then it could be beneficial. [00:10:22] The next post is is postgres running and this is from Mydbainotebook.org and it's just a post of what you can check to see is postgres running. Because if there's a problem that you can't connect, that's probably the first thing you're going to look for. And he just has this very quick PS command to look and see what the postgres processes are. And once you know what processes are running, you can actually reference the directory issued for the postmaster to kind of find where the PostgreSQL comp file is because to connect to it you're going to need to know the port. Now you also need to consult if there's any other comp files that are included in postgresql.com as well as the auto.com file. So that's something to keep in mind. And then for completeness this is for Red Hat and others, but he also mentions for Debian and others as well what that looks like when you do the PS command and even shows a little quick said command to be able to find what the port is in the file. But again, you need to make sure that there's not included files or it's in the auto.com file as well. And then how do you start and stop postgres? He referenced some of the Debian tools that are available such as PGLS clusters to list out clusters as well as how to start, stop and restart them, and then as well as system D as well. So, pretty quick blog post, but if you work with different PostgreSQL instances, definitely some quick commands to determine is postgres running. [00:11:57] The next piece of content is actually a YouTube channel. It appears they've posted a number of videos for the Southern California Linux expo on their YouTube channel, and there's about five or six or so videos that have PostgreSQL. So for example, there's using PostgreSQL for data privacy and security declarative caching with Postgres and Redis. No excuses. PostgreSQL Security PostgreSQL Worst Practices enterprise Authentication with PostgreSQL linux I O internals for PostgreSQL Administrators welcome to Postgres Twelve, et cetera. So if you're interested in video content, definitely check out this YouTube channel. [00:12:42] The next post is creating a PostgreSQL procedural language. Part Five returning Results this is the of course, fifth in series. That second quadrant.com has been working through setting up Julia as a programming language to use within PostgreSQL. So if you've been following this series, the fifth post is available. [00:13:02] The next piece of content is the new PostgreSQL person of the Week is Bruce Momgen. So if you want to learn about Bruce's contributions and work in PostgreSQL, definitely check out this blog post. [00:13:15] The next post is Spatial Tile serving with PostgreSQL functions. This is from Crunchydata.com, and again they're following along with the web services that they've created to work with Postgres and how you can generate a spatial tile serving. So if you have an interest in this content, definitely check out this blog post. [00:13:38] The next piece of content is PostgreSQL GSS API authentication with Kerberos. Part One how to set up Kerberos on Ubuntu so this looks to be the first part of a three part blog post talking about authenticating to PostgreSQL using Kerberos. Now, this first post doesn't cover PostgreSQL, but it covers setting up Kerberos on Ubuntu. So if you have an interest in that, maybe you want to check out this first post. [00:14:09] And lastly, PG. Timetable advanced, PostgreSQL, cron like, scheduler released. This is from CyberTech Postgresql.com, and they've released the second version of their PG Timetable tool because they made some changes to how data was stored. So they want to go ahead and do that and explain how this new tool works. So if you're interested, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 205

Indexing With tsearch, Tuples Not Yet Removable, Wicked Problems, Pattern Matching | Scaling Postgres 205

Episode 350

Seven Years of Scaling Postgres | Scaling Postgres 350

Episode 38

Parallel Query, Real Time Analytics, Killed Index Tuples | Scaling Postgres 38

Recursive CTEs, DB Access Patterns, Dates & Times, Cluster | Scaling Postgres 106