Episode Transcript
[00:00:00] In this episode of Scaling Postgres we talk about redis versus postgres. The hard quadrant optimizing joins and materialized views. I'm Kristen Jamison, and this is Scaling postgres episode 170.
[00:00:23] Alright, I hope you, your friends, family and coworkers continue to do well. Before we get started with our first piece of content, I wanted to mention that I will be starting a new live show called the Rubber Duck Dev Show and this show will be all about software development. It'll be me as well as Chris, so if you're looking for more developer focused content, maybe you want to check it out. So the last link in the Scaling Postgres Show notes will have a link for you to check out the channel trailer for it. And this will be a live stream show where we can accept questions. The first show is scheduled for this Wednesday evening, eastern Standard Time at 08:00, so be sure to join us if you're interested.
[00:01:07] Our first piece of content is do you really Need Redis? How to get away with just PostgreSQL this is from Atomicobject.com and he's talking about there are certain use cases where people tend to use redis but postgres can also handle many of these use cases as well. The first use case he mentioned is Job Queuing. So frequently you have multiple worker boxes that run asynchronous jobs, typically in a web application and a lot of times Redis is used for this purpose. However, you could use PostgreSQL and most notably when you do an update and you're selecting the records for it, you can use Skip locked. So here's an example. Within a transaction using a CTE to basically select the next pending job and do it for an update, skip locked. And when you do that update to indicate it's running, it will only pull one and won't pull any that are currently being locked by some other process because presumably you have multiple boxes with multiple processes that are processing jot, so it prevents them from stepping over one another. Now he says at some level this does break down with the locking, it maybe becomes less efficient, but he says for most of the apps he's worked with this has been sufficient. Now one reason why Redis might have an advantage here is because it doesn't need to worry about stepping on other processes or coordinating locks of the data because it's only single threaded so it only has one process or thread to worry about to access the data it needs to work with. The next use case is application locks. So you can establish a lock on something and you can use something in postgres called advisory locks to do something similar in postgres. And I've actually known some clients that have run their own postgres instance that its only responsibility is advisory locks. It didn't really store data for them, but it was a separate instance to be able to utilize this feature. The next use case is Pub sub. So being able to publish and subscribe to notifications, well, PostgreSQL has the listen notify capability, so you could use this in lieu of Redis. So Redis does have a lot of advantages and some of the things he mentioned here is for caching data with particular TTLs and storing and manipulating ephemeral data. I'll also add to this one thing that Redis Excels compared to postgres is if you're wanting to update a single row or a single count or a total value, it's much better to do that in redis than to do the same thing in postgres. Because frequently when you do that update, it's not going to do an update in place, but it's going to create a new row. And therefore you could have a huge bloat problem if all you're doing is updating a particular value super fast. In that use case, maybe Redis would be better. But the interesting thing is with Postgres eventually being able to support different storage engines when something like Zheep is ready, maybe you could create a table that utilizes Zheep and then you wouldn't have that type of bloat issue if you're just wanting to update a value really fast. But this blog post was an interesting comparison of Redis versus Postgres and how Postgres has some features that people have typically used Redis for. So I encourage you to check it out.
[00:04:25] The next piece of content is the hard quadrant. This is from Momgm US and he's basically talking about the issue where he frequently has to deal with or work with locks and sharing and the complexity inherent in that. Because going back to the previous Postgres has multiple processes that could be accessing the data at any one time and how it handles that is through locking different rows or locking tables or things of that nature. Whereas Redis is single threaded so it doesn't need any locks, it's just a single worker thread that's accessing the data it needs and retrieves it so there's no need to lock anything. So by its nature, Postgres is going to be more complex in that area. And he actually references a talk that was given showed a slide that he feels expresses the reason for the complexity. And that slide is here and it shows a four quadrant image looking at two different axes. One is data or information is unshared or whether it's shared and whether that data is mutable or immutable. So basically, if you have things that are immutable, it doesn't matter whether they're shared or unshared, there's no need for synchronization because the data is never going to change. But when you're having something that's mutable, like you want to change someone's email address if nothing's shared, for example, like Redis, there's no synchronization. It's pretty easy to handle. But if you have multiple processes that need access something, then you get a lot more complexity. So it's this shared mutable area that causes a lot of complexity so if ever you can make something more immutable, it makes it easier to work with as well as not being shared. So I thought this was a very interesting concept and I encourage you to check out this blog post as well as some of the talk.
[00:06:17] The next piece of content is actually a YouTube video and it's PostgreSQL optimizer methodology. This is from Robert Haas and it was posted on the CMU database group channel. And specifically he's talking about optimizing joins and how the optimizer works through how to do the proper join for a query, including talking about analyzer statistics and ways that you could actually fool the optimizer depending on the queries that you send to it. So this was a very interesting deep dive into how the optimizer works and if you're interested in that, I highly encourage you to check this video out.
[00:06:54] The next piece of content creating and refreshing materialized views in PostgreSQL this is from CyberTech postgresql.com. He's talking about materialized views. So views, just a standard view is a virtual table and it never stores the data that it reveals. Basically it runs dynamically every time. A materialized view actually creates a table out of what you are querying. So for example, he has a table and inserted a number of rows into that base table and then he creates a materialized view from that table he created. Now the data in the materialized view is very small, just two rows. But that is what a materialized view, it saves that data so it doesn't have to go back to the main table again like it would with a standard view. And the thing to keep in mind with a materialized view it's just another table. So for example, you're going to want to vacuum that on a regular basis, you're going to want to analyze that on a regular basis. So for example, if you've just created it, maybe you want to vacuum analyze it. Now some of the things he mentioned here is that because it is standard table, you can also use the using clause and that specifies which storage format you want to use. So right now there is just the heap. But again, in the future they're talked about having Zheep or Z store to be able to do it. So eventually you could store it in a different format. You can also set different table parameters. Like for example, this one has actually disabled auto vacuum for this table. So that's something you could choose to do. Or you can also create it in its own table space so that's something you can do with a materialized view. And lastly, you can modify materialized view by doing an alter statement. And then of course, the most important thing that people tend to do is refresh the materialized view because the materialized view creates static data at the time it's created and it's never refreshed again. So you have to do that manually so for example, you just say refresh materialized view and give it the view name. But the problem with that, it does lock that materialized view. So typically what you would want to do is do it concurrently. But what he does mention here is that in order to do it concurrently, you do need a unique index on that materialized view. So be sure to do that if you want to refresh it concurrently. So this is a great post all about materialized views. And if you're interested in learning more, definitely check out this blog post.
[00:09:15] The next piece of content fun with PG Checksums this is from Crunchydata.com and checksums are meant to check for corruption of the postgres data that's on the disk. Now by default these are not enabled, unfortunately. And in order to enable them, you have to do it at the point at which you create the cluster. So I haven't frequently worked with clients that have had this enabled by default. And right now, even on the different operating system installers, when you go to install postgres, unless you specify it, they're not going to be enabled. Now, there is some performance impact with enabling them that you'll have to check out what that is. But if you are concerned about integrity of your data files, you would definitely want to enable them. And this goes through the process of how you can do this. You can enable checksums after your database has been created and running, but you do have to shut down your database and then let it run through all the data files to do those checksums. So on a terabyte sized database, that can take quite a bit of time. But this post does go through that process of enabling checksums for your database and even tracking the timing using the time command to see how long it would potentially take. Now, they did mention one thing here where they have a patch into postgres 15 that may allow you to apply checksums in a piecemeal manner. So for example, shut down the database, run PG Checksums for a while, stop it, start the database again, and then try it again later to incrementally build up. But we'll see what happens with that. A great feature for this would to be able to apply a lock on a given file, run the checksums for it and then release the lock while the database is running. But I haven't heard of anyone working on a patch to do that yet. But if you want to learn more about PG Checksums, you can definitely check out this blog post.
[00:11:11] The next piece of content about cross join in PostgreSQL this is from CyberTech Postgresql.com. They're talking about cross joins which are a product of two tables when you join them together. So a cross join of these two tables, this with three rows, this with two rows gives you six rows of data. If you use the join syntax of a column in the from clause. By default it's going to do a cross join unless you put a where statement in there, like where aid equals bid, for example, it'll do a cross join. And that can be pretty dangerous if you have a lot of rows in the table. So for example, because it's a product of the total number of rows in each table, if you were to do a cross join of a table with the million rows, with another million row table, you would actually get a trillion rows back. So a lot of times a cross join is not what you want. And actually what they mentioned here is that the safest way to do it, to potentially avoid accidentally doing one, is to actually just use this syntax. So use the inner join syntax or cross join or left outer join, et cetera. And that's a safer way to do it to avoid mistakenly doing a cross join. But they do say there are some use cases like they talk about a lateral cross join and joining with a particular variable where maybe you do want to use some cross joins. So if you want to learn more about them, you can check out this blog post, the Next Piece of Content postgres 14 highlight Monitoring for Copy this is from Pacquiao XYZ and they've introduced a new view called PG Stat Progress Copy that actually tracks the status of a copy while it's in progress. So if you're trying to copy data into or out of postgres, if it's going to take a particularly long time, you can actually track its progress now with this view. So if you want to learn more about this, definitely check out this blog post.
[00:13:07] The Next piece of content is a YouTube channel. I'm highlighting the procona YouTube channel again because again, over the last week they've posted a number of new videos, some on postgres. So if you're interested more video content, check out this YouTube channel.
[00:13:24] The next piece of content is promoting specified node in PG Pool Two. And this is a new feature in PG pool Two where you can actually specify a node to become the primary. So historically you would run with a primary and have some standby nodes with PG pool two, but you could never explicitly tell it to promote a given node. It would automatically choose which node to promote, whereas with this new feature in 4.3 you can actually specify the specific node you want to promote. So if you're interested in learning more about that, check out this blog post.
[00:13:59] The next piece of content how and Why to Become a PostgreSQL Contributor so this post runs through how you can become a contributor to PostgreSQL code development, and he walks through different steps, such as identifying your motivation. Why would you want to do it? Learn about the development process of postgres, how to identify the first patch that you want to submit as well as how to contribute and avoid common mistakes. So if you're interested in becoming a contributor or learning more about the process, definitely check out this blog post.
[00:14:31] And the last piece of content is the PostgreSQL Person of the Week is Josh Burkis. So if you're interested in learning more about Josh and his contributions to Postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube. Right. Tunes. Thanks.