Partitioning, Logical Replication Upgrade, Columnar Compression, HAProxy Connections

[00:00:00] In this episode of Scaling Postgres, we talk about partitioning, logical replication, upgrades, columnar compression and Ha proxy connections. I'm Kristen Jameson. This is scaling postgres episode 88. [00:00:22] Alright, I hope you're having a great week. You the first piece of content is Webinar PostgreSQL partitioning follow up. And this is from secondquadrant.com. And this is a webinar about partitioning with PostgreSQL, particularly talking about a number of the new features that came with twelve. Most of them I would say probably related to performance, but also some others related to being able to support more foreign key capabilities and things for enhancing referential integrity. And this was a great piece of content. Now it's a little long. The presentation or the webinar replay was over an hour and a half, I believe. There were some questions at the end, but it is filled with a lot of content. It explains partitioning, how you can set it up, things that work well and then some things that don't quite work yet. For example, being able to do a reindex. That command doesn't apply to the partitions. You can run them individually on partitions, but not on the partition table or concurrently. When you create an index that doesn't happen concurrently on the partition tables, you would have to do that manually. This presentation just has a lot of great information. I definitely encourage you to check it out. And you can just click the link here to view the complete recording. You do have to fill out a contact form, but then you're able to watch it immediately. So I definitely encourage you to check it out. [00:01:47] The next post is Upgrading Postgres major versions using Logical replication. And this is from Cyber Hyphen postgresql.com, and they had a previous post talking about different ways to do version upgrades with PostgreSQL. This one is using logical replication and the first thing it discusses are the benefits in terms of minimal downtime required. Basically, once you have a system set up and replicated, you could choose to switch over at any point that you want it's flexible, so you can target just the data you want to replicate over. So essentially it's a logical replication, not a binary direct replication. It's safe in terms you can always go back to the prior data if you need to, and it also gives you a lot of opportunities to test and make sure that things are okay. Now, cons being quite a few steps to set it up. It's always per database. This way of doing an upgrade as opposed to Per Cluster could take a long time for big databases that are highly active and large objects need to be exported and imported manually. So it goes through the process. Basically turning on logical replication on the old primary system, make sure all the tables have a primary key, or if not needing to set Replica identity. And it has a little query so you can identify if there are any tables that match that criteria, set up the new database, and then basically you need to export the schema as well as the roles. So one way you can do it here is using a PG dump all with the globals only and a PG dump for a specific database with schema only. So you need to get that base over there. And during the process of doing this type of upgrade, you kind of want to keep your schema, ideally in a static state. Then you create a publication on the old DB, create a subscription on the target database, and they say warning just make sure you're aware this creates a replication slot on the old DB. So again, if there's any connection that breaks down, this could start causing wall files to grow if the replication stops for any reason. But the primary database keeps running. So just keep that in mind. Then check the replication process, make sure there's no errors. The replication slot is active and all tables are actively replicating on the subscriber side. And they mentioned here PG subscriptionrel SR substate should equal R for all tables, then go through your own verification process and then the switch over time. Now they have a particular case here where they're shutting down the old instance, starting the old DB in read only mode. I don't think this is necessarily a requirement. It's definitely the safer way to do things to make sure that you have all the data, because you could choose to start using the new database at whatever point you want, but you'll have to work out that process of how you do the actual switch over. Do you want to have downtime? How can you mitigate to make that as short as possible, but once you're done, you just drop the subscription on the new DB. So this blog post walks over the process of using a logical replication to upgrade your database using this method. So if you want to explore this possibility, definitely one to check out. [00:05:09] The next post is Building columnar compression in a row oriented database. This is from a Timescale.com blog and basically they've made an enhancement where they've added native compression to timescale DB, and their technique for doing it has resulted in a 90% to 96% compression, which is pretty dramatic. Now postgres of course is a row oriented relational database system. It doesn't really store data in a column or method. So I was kind of interested in what they did here. So even if you're not interested in Timescale DB, I particularly like looking at the different technique that they mention here. So I'm going to scroll down to some of the more technical details. And basically you typically have this kind of row oriented data, a timestamp, a device, a status code, and then a temperature for this set of data. But what they actually do is they clump rows together and then store them in an array format. So this is storing the data by column. So all the timestamps are converted into essentially like a one row array and then all the devices are and the status codes are and it makes reference to probably the placement in the array to know kind of how that equates to the row. So this will always be in the first position of the array in this example. And by doing it this way, it says that they can actually use specific compression for each data type to dramatically get high levels of compression. So for example, this is compressed using a particular method for timestamps or this for integers or this for floats because these are all floats and also just reducing the number of rows reduces the overhead per row of storing data. And they actually use some other techniques for like setting ranges for mins and maxes and actually storing a lot of this data in a toasted compressed way. So for example, they talk about some of that here where you have the timestamp temperature and they have like the min and the max timestamp. So if you're doing queries it can basically pull out this information much faster. And then they had some time performance benchmarks looking at this and the compression did very well and even a lot of cases performed better than uncompressed. So I found this blog post really interesting and fascinating. So we've covered other blog posts that talk about using ZFS file system or other types of compressed file systems to reduce the size of the database as well as in some cases boost performance. But this is another very interesting method that's being used, actually converting the data into a more column based format. So if you're interested in learning more about that, definitely a blog post to check out. [00:08:07] The next post is PostgreSQL application connection failover using HAProxy with X initd. And this is a follow on from their simple application failover using libpq features. And they're talking about basically connecting through HAProxy that will then route you to the read write primary or to read only replicas. And that you can use the load balancing features of HAProxy. And it uses it in conjunction with Xinetd, which is an extended Internet service daemon which can listen to requests on custom ports and respond to requests by executing custom logic. So in this case, they have this bash script that basically gets called to check the status of a database system. And with this query it checks the status of the currently running database and says is it in recovery? If it's in recovery it returns a 206 status code, which means it's a standby, if it's a 200, it's classified as a primary, or if it's another value, then it returns a 503 service unavailable. And this actually gets run to check the status of the database so it makes sure it can route it to the proper location. And they set up Ha proxy to use this xnetd. So if you're interested in using Ha proxy as a means to manage connections between their primary and the replica and potentially do failover activities, definitely a blog post to check out. [00:09:36] The next post is Virtual computed columns in PostgreSQL twelve. So again, this is talking about the new feature generated columns and this is another developer's take on it. And here he's primarily interested in using it for doing text search through JSON. So he's actually taking JSON and then extracting, say, the title and wanting to put it into a generated column field. So extracting the title from the JSON Blob that he received and putting it in this title and it goes a bit further and converts it to a TS vector. It mentions some things you need to be aware of where you need to specify like English or another language because you need to be cautious when using functions that the results are immutable and this language can change. Therefore this function is not necessarily immutable, it can be changed. And even using a function here such as concat doesn't really work when using generated always you actually have to use, I'll call it the concat or the append symbol here two pipes to bring text together if you want to search. In his case, you want to search both on the name and the title. So, a relatively simple post talking about generated columns for a text search use case. Now, related to that, he has a second post called Fine Tuning Full text Search with PostgreSQL twelve and this goes into more in depth. He continues to use generated columns but he's actually going into more depth of doing a search. And he has a scenario here where he specifies three different individuals or wanting to do different types of searches for like a conference that's going on. So he goes about talking about parsing out the data and how to use different dictionaries to get different search results. He goes into applying weights for keywords as well as also going into rankings. But this is a very comprehensive post that covers a lot of detail in terms of getting to learn more about full text search. What's also interesting is the discussion in the comments where a few individuals are talking about the benefits of storing the data using generated columns or maybe doing it dynamically with expression indexes in PostgreSQL. So if you're going to check out this post, definitely be sure to check out the comment section as well. And I should mention, this is from the Rob Connery IO blog. [00:12:03] The next post is Monitoring PostgreSQL databases using PMM and this is from Percona.com and PMM is a Percona monitoring and management open source tool set. So it comes as a docker image and then they go through the process of installing it and it looks like it uses Grafana and perhaps Prometheus to be able to monitor your PostgreSQL instance and it goes through all the process of installing and setting it up. Different permissions required. So if you are looking for another monitoring tool for PostgreSQL, definitely another one to check out. [00:12:41] The next post is benchmark partition table one. And this is from Embeniaarnings Blogspot.com and basically with all the different partition features, they've actually altered PG Bench and I believe this is for PostgreSQL version 13. So this is a patch where they're adding partition capabilities where you can dynamically add the methods and different ones to that PG Bench benchmarking tool. And this goes over some doing some tests and seeing some of the results. So if you're interested in using PG Bench in the upcoming version 13, definitely a blog post to check out. The next post is postgres case statement basics. By example, this is from the Pinopoly blog and it's a very simple post that covers the case statement which is basically a conditional statement within PostgreSQL and the SQL standard in general. But if you want a refresher for that, definitely a blog post to check out. And lastly, how to query with PostgreSQL Wildcards like a pro. And again, this is another simple post talking about different ways that you can query using the Wildcard operator. The percent symbol as well as the underscore. The percent is any character and this is usually one character as well as covering things like like and I like as well as using lower function to be able to search the database ignoring case. So if you're interested in that type of content, definitely a blog post to check out. [00:14:11] That does it. For this episode of Scaling Postgres, you could get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode. Or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 276

BRIN & Correlation, Poor Partitioning, 10 Beginner Tips, Table & Index Usage | Scaling Postgres 276

Episode 283

222 Times Faster Analytical Queries With Columnar Storage | Scaling Postgres 283

Episode 323

PgBouncer Breaking Change | Scaling Postgres 323

Partitioning, Logical Replication Upgrade, Columnar Compression, HAProxy Connections | Scaling Postgres 88