Episode Transcript
[00:00:00] It. In this episode of Scaling Postgres, we talk about using JSON procedures, concurrency issues in Kubernetes I'm Creston Jameson and this is Scaling Postgres, episode 21.
[00:00:21] Alright, so we have a lot of content this week. I'm going to try to move through it maybe a little bit quicker than normal, but definitely going to try to cover what I feel is kind of the best pieces of content for this week. So, starting off we have Asynchronous notifications in postgres. And this is from the Citizen 428 blog. And this post is basically talking about the notify and listen capabilities of postgres. Basically a Pub sub notify being a method of publication of some event and listen as a way to kind of subscribe and listen for those events. And this is great because I believe it's done by a developer and he covers how to set this up with a particular application and it's presented in a very clear fashion. So, for example, he's focused on a table of orders and just has a few columns here and he wants to set up a notification when a new order comes in. So he creates a notify event so he creates a notify event function and the first thing the function does is it wants to say if it's a delete. He wants to basically capture that old record that's deleted, otherwise get what the new record is. And from this he's going to build a JSON payload. So he's taking the record doing a row to JSON function to store it in data. So essentially using a JSON field, he's determining the action is, is it insert, update, delete and then the table that's impacted. And then he essentially performs a PG notify to an events channel as it were, sending the payload that he built as text. Now on the orders table, he creates a function that calls this procedure. So for each row, execute the Notify event procedure. And then he sets up a listen for essentially the events channel and he does an insert with some values. And you can see what the Asynchronous notification payload is. You see the table he defined as orders, the action defined as he did an insert.
[00:02:28] And then all the data that was insert is located in this JSON column. Now of course, it gets more interesting when you're listening from a separate process. And in this example, he's using Ruby and using the SQL gem and simply setting up a listener that loops over and essentially waiting for new notifications on this listen channel. And basically when an insert happens, you get a JSON notification being printed out. Essentially the payload gets printed out for every insert done to that table on this separate process that's running. And then to make it even more interesting, he actually adds a front end. And again. He's using Ruby. So he's chosen to create a simple rack app with Faye. And essentially his configuration here enables him to print to the console when a particular event happens. Now if you're using this with your application you can envision that when an event comes in you could trigger like a JavaScript event that appears in your application. So this wasn't too long a blog post, but he really goes into each of the steps that you would need to do and how simple it could be to set up this kind of pub sub event system using Postgres's. Listen to notify features.
[00:03:41] The next post is audit logging using JSON b and Postgres. And this is from the Dinesh Patel blog pateldinish.com and he talks a way of different ways of auditing PostgreSQL and he talks about different ways of doing DML auditing so what inserts are happening, updates, deletes, etc. And and he says one method is to use a trigger that then updates or inserts new data into a table. Another way is to actually turn on high level of logging in PostgreSQL. But I actually comes up with an approach that is similar to the last post we looked at using JSON B fields to store the content of the data that was changed. So first he creates just a simple users table tracking basic user information, he then creates a separate schema called audit and created a table in that schema called Users underscore Audit that has a timestamp basically when it happened, what operation? Again similar to the previous post, the username that did it and then he's actually recording before the modification what did the row look like, and then after the modification what did the row look like. He then creates a function and if it's an insert he uses two JSON B for the new row and inserts it in the after. If there's an update, he again uses two JSON b. The old goes in the before column and the new goes in the after column. And in the case of a delete he just takes what the old row was and puts it in the before column. Because there's no after with the delete. He then of course uses this function and for each row run this function and creates a trigger for it. And then you can look at what this audit table looks like. He inserts some rows and then does a select from the user's audit table where in the after column the user ID is 101 and this is what it looks like. You get to see the before and after for an insert in what user performed that insert and then he goes into creating gen indexes to make access more efficient. But definitely an interesting take on how to do auditing using JSON B fields and pretty simple to do if you're looking to do that. The next post is also related to JSON and it's a video that is introduction to JSON data types in PostgreSQL. Now this is about a 40 minutes presentation from one of the principal contributors of the JSON functionality in PostgreSQL and I found the video pretty informative. However, it took a while to kind of get up to speed, but if you want some introductory information, I suggest watching the whole video. But from about the zero to five minute mark, they're talking about JSON structure in general, JavaScript object notation. About the five minute to ten minute mark, they talk about what features Postgres has added for JSON support over the past number of versions over the last few years. About the ten minute mark to the 15 minutes mark, they talk about specific advice and that he suggests use cases for it. And then starting around the 15 minutes mark talks about actual code and using the functions and the types of things that you can do with it. So, definitely a video. I suggest you check it out if you want to learn more about the JSON capabilities in PostgreSQL. And again, I should say this is from the second Quadrant.com blog.
[00:07:12] The next post is using procedures for batch geocoding and other batch processing. So for a while PostgreSQL has had functions and we've just seen examples in the content that I've presented where you can do a certain amount of work. Well, in version eleven of PostgreSQL, they're offering procedures. So what's the difference? So functions do all their work and commit, or if there's a failure, essentially it rolls back everything. You can't do intermittent work and commit it and then roll back to some other state if there's a problem. However, procedure supports doing begin in commit or rollback commands within the procedure. Now, I never thought of this before, but as this post points out and discusses and talks about is that this is a great feature for doing batch processing. Because if you have a really long running task, the problem with a function is that it can easily fail and then you need to restart it from a certain point or essentially it's all or none. Whereas with a procedure you can incrementally commit certain work that's been done and then come back to it if you need to, or re execute it and correct issues. So, definitely interesting blog post to check out if you do this type of batch processing or geocoding with your application. And this is from the Postgresql.com blog.
[00:08:34] The next post is a beginner's guide to PostgreSQL's update and auto vacuum. And this is from the CyberTech Postgresql.com blog. And this kind of goes back to how Postgres handles updates and how in their experience dealing with clients, some applications are not taking into account basically how PostgreSQL handles updates and how an update actually creates a new row, retains the old row, so that other connections can still see that information until everything has been committed and updated. And as a consequence, if you have a lot of updates going on, you can really increase the number of rows that are in your table. That is until vacuum comes around. To clean up the rows. But even vacuum can have issues if you have a very high update rate, particularly with long running transactions. So for example, it's saying you have the first connection here. You start a transaction, you do a delete, it's running, doing things, a second connection comes in, it can see the rows, even what's been deleted here. And if you run a vacuum at this stage, it doesn't delete anything because nothing's been committed. So essentially it has to wait until these rows are committed before it can actually vacuum and potentially allow space to be available. So they have a quote here next to a rule that says long transactions can delay cleanup and cause table Bloat. Basically, your table keeps expanding in size, whereas the data size is actually relatively small. And plus you need to take into account that just running a vacuum or an auto vacuum doesn't necessarily shrink the size of the table, it makes space available, but it doesn't literally try to compress all the space in it. You actually need to do a vacuum full, which locks all access to the table, or there's a couple of different tools you can use to actually compress it while access is happening. Lastly, it also goes into the dangers of when you have particular row and you're having massive updates per second that can really hammer the system and prevent vacuum from doing its job unless you really highly tune it. And in some cases it's a lot easier to where they say here, quote, avoid update altogether and instead change your data structure so that you're predominantly doing Inserts. So if you're doing predominantly Inserts, you don't have to worry about concurrency, you don't have to worry about vacuum necessarily keeping up as much. And if you take this technique, then typically you would want to partition that data by some sort of date. Then you can simply drop the old tables when you need to, as opposed to storing all of this data and doing frequent updates. So if you're unfamiliar with how postgres's update and Vacuum and auto vacuum work together in terms of planning how you want your application to function, this is definitely a blog post to check out.
[00:11:27] Now, related to that, the next blog post is Modeling for concurrency because it goes hand in hand, kind of what we were discussing previously. This is from the Tap Oueh.org blog and this follows his PostgreSQL Concurrency, Isolation and Locking series where he's talking about how best to model for concurrency. And basically he has the same conclusions that the previous post had whereby as opposed to updating particular actions with how many in his scenario he's using Tweets as a part of his series. And when you want to track how many retweets are happening or favorites or whatnot to avoid concurrency issues, you actually want to use Inserts as opposed to updates. So he proposes a structure here for converting from updates to inserts, so it gives a more practical example of how to do this switch that was discussed in the previous. Basically focus on using inserts as opposed to updates because that definitely helps with the concurrency. And also table Bloat and auto vacuum issues. Now, related to this, if you're doing a lot of inserts, you're going to be creating a lot of data and when you go to query those many, many rows, it can take a while to retrieve results potentially. So in the next post he talks a little bit about that. So he's talking about computing and caching. And again, this is from the top blog. And with regard to caching, he's talking about some use cases you can use. So views he talks about as a way to efficiently query exactly what you want. However, it always goes to the raw table, so not a lot of caching is involved there. However, he does talk about materialized views. This is where you actually create a separate essentially summary table from your view and you can refresh it on a periodic basis. And this is what he means by caching. So you could refresh a materialized view, essentially have cached data to be able to make queries of potentially Tweet counts more performant. So definitely these are two blog posts talking about concurrency that are related to the previous thing that can give you some ideas on how to better model your data. The next post is actually two parter. So the first one is should I run postgres on Kubernetes part one? And the second is should I run postgres on Kubernetes part two. So the first one covers why would you potentially want to do it. And just to run over the summary, it says basically if you're already using Kubernetes for your application, then it can make sense to do this for postgres as well because it helps you by using it helps you adopt a more DevOps workflow where each development team owns the database. And perhaps these databases are relatively small. And he talks about the reference to zero brain cell admin for small databases. In other words, if you have pretty small databases for an application, like he says, the vast majority of Heroku's hosted databases are less than 1GB, well then you wanted to make it easy as possible to manage these databases and perhaps Kubernetes is the way to do that. You just need to deal with the state issue because you need to maintain state within the Kubernetes instance or instances for your database. And he says one thing about running postgres on Kubernetes is that you could potentially have an easier path to high availability because it has a number of built in features to handle coordinating multiple PostgreSQL instances. Now in the second post he talks about some of the downsides. Number one, if you're not a Kubernetes shop, you probably wouldn't want to do it because you have a whole nother level of infrastructure to learn.
[00:15:13] The second is that everything is alpha. So Kubernetes is still moving fast and there's potentially risk for things not working. So if you're looking for a stable environment, it's probably not the best thing to move to Kubernetes. The other thing he mentions performance. So even though Kubernetes is supposed to be particularly performant, if you're looking to squeeze the most performance out of your database, there's still an overhead to running it. And then of course, he's talking about large databases running in a Kubernetes environment. Environment makes it a little bit more difficult to work with and you may not want to choose that path. So these are two interesting blog posts to check out related to using postgres with Kubernetes. So I definitely suggest checking that out if you're considering doing it.
[00:15:59] The next post is gracefully scaling to 10,000 or ten K postgres QL connections for $35 a month. Part three. Now, in previous episodes of Scaling Postgres, I talked about part one and part two, and it's basically talking about using PG Bouncer to scale your connections for postgres. Now this third part, he addresses how to basically do it using Kubernetes and helm. So basically not running PostgreSQL on Kubernetes, but PG bouncer. So I mentioned this because just to close out this post series and it talks about potentially setting up this new technology helm along with Kubernetes to be able to run multiple PG Bouncers in front of your PostgreSQL instance. So again, if you are interested in Kubernetes as it relates to PostgreSQL and PG Bouncer, definitely a blog post to check out. And this is from the FutureTech Industries blog on Medium.
[00:16:56] The last three posts are all from the several nines.com blog. The first post is custom trigger based upgrades for PostgreSQL. So I find the intro part of this I actually found the most interesting. So it said first rule, you do not upgrade PostgreSQL with trigger based replication. Okay? Second rule, you do not upgrade PostgreSQL with trigger based replication. Third rule, if you upgrade PostgreSQL with trigger based replication, prepare to suffer and prepare well.
[00:17:34] And then the other great part here he says, is that there must be a very serious reason to not use the PG upgrade for upgrading PostgreSQL. And that's basically what I tend to use.
[00:17:47] If you can't afford more than a few seconds of downtime, then use PG Logical. And I have presented in previous episodes of Scaling Postgres how I believe the second quadrant has used PG Logical to do a version upgrade with their system using this. But then if you can't use PG Logical, then use this. If you can't use this, use this. So it's definitely good advice to take. But then at the end here, they talk about how you could probably use trigger based upgrades for PostgreSQL. And that basically is what this post talks about. So, definitely an interesting post to talk about and good advice to take.
[00:18:24] The next post from their blog is PostgreSQL Log Analysis with PG Badger. So this is a way of analyzing your logs. The first thing they talk about is how to set up PostgreSQL to essentially log the most information, like turn on all the logs, and then how to use PG Badger to basically track the status of your PostgreSQL. So if you don't already have a solution for that, this is definitely a blog post to check out. And the last post from their blog is Integrating Tools to manage PostgreSQL in Production. So again, related to their PG Badger post, they talk about different tools you can use for management deployment, how to do backups, different backup scenarios that are available, and they specifically go into Rep Manager to talk about some of its capabilities. So definitely another blog post to check out.
[00:19:18] That does it. For this episode of Scaling Postgres, you can get links to all the content presented in the show notes. Be sure to head over to Scalingposgres.com where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or Bye. I'm.