Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about active, active inferiorated memory speed table rewrite causes and postgres. QL Shibolith I'm Kristen Jameson and this is Scaling Postgres episode 146 one alright, like, I hope you had a great holiday and hope you're having a great new year's. And as always, I hope that your friends, coworkers and family continue to do well. Our first piece of content is Active active PostgreSQL federation on Kubernetes. This is from Crunchydata.com and they're talking about a process of essentially creating a set of federated PostgreSQL instances where all of them are essentially primaries or masters. Now, for this post, you can actually ignore the Kubernetes part because some of the interesting parts about it is how you're maintaining these active ones and they're using a combination of logical replication and partitioning. So what they did is they set up three different PostgreSQL clusters or three different instances. Now, their example here uses Kubernetes, but you don't have to use it that way. But basically you have three different instances. They called one Hippo East, one Hippo Central and One Hippo West. Now, I'm going to skip ahead to the partitioning scheme because that helps you understand a little bit about what this post is talking about. So they created a Hippos table, they're using a Uuid for every ID of it. And they're doing this to avoid using sequences because sequences don't get carried over with logical replication. And if you're doing this type of federation, it's easier to just use a Uuid for your IDs in this case. And then they set up a default partition, they just called Hippo Default and then a Hippo East Central and west. So for each geographical server, essentially they set up a separate partition for it. So each instance only inserts data into its own partition. So east will only insert into the east partition. The central server will only insert data into the central table and then the west will only insert into the west table. Now, they did some additional steps here where they set up a way to identify each cluster and they actually insert the data into the default partition and then create a trigger that then moves it into the proper partition based upon what the node name was. But if you have a way to insert the data directly into the table, like your application knows, it's only inserting data into say, Hippo East, it's always going to be setting Hippo East. I don't think you need to do these additional steps that they've laid out here, but basically each instance inserts its data into its own partition. And then what you do is you set up the sharing where you set up a publication on each of the nodes. So for example, you create a publication in Hippo East for the Hippo East partition. You set up a publication in Hippocentral for the Hippocentral table, and then you have every other node subscribe to every other publication. So Hippocentral needs to subscribe to east and west. Hippo west needs to subscribe to east and Central, and of course Hippo East subscribes to Central and west. And once you get this all working, all the data will be synchronized across all the different nodes and they're placing it into the proper locations of each node's partition table. He's showing this example here where he queried east and yet you can see all the data from the different nodes because he's querying it from the parent table and you can see that it is properly placed in all of the different partitions. And if you want to set up another node, you just set up another node. Define a new node name for it, add a new partition for that node to the partition table scheme that exists throughout all the nodes, set up a publication for it on the primary node for that partition to be distributed, and then subscribe it from the other node. And they describe how to do that. Here they do discuss some other alternatives. So for example, you could use the sequence instead of a Uuit by defining that it will be generated, say, on one partition, starting with one, incrementing by three for the next table, started at two, increment by three. Next table, three, start by increment by three. But that's going to be a problem trying to add new nodes to it. That becomes very complicated very quickly. The only downside they're mentioning to the UUID is that you may run into a collision. So you're going to have to plan for that or figure out how to address that. But this was a very interesting post on how you could set up a Federated PostgreSQL with multiple master nodes. So if you're interested in that, definitely check out this post from Crunchydata.com.
[00:04:46] The next piece of content is PostgreSQL 13 benchmark memory speed versus transactions per second. This is from Crunchydata.com and they're examining increasing the memory speeds of a particular instance from 2133. So basically asking the question, does memory speed matter in terms of transactional throughput? And looking through this scaling up with a different number of clients, you can see the more clients you have, the better it gets. But on average from his analysis, it looks like about a 3% improvement for memory speeds. So not a huge improvement, but it can give you a little bit. So if you're interested in this analysis, go ahead and check out this post.
[00:05:31] The next piece of content is what postgres SQL causes a table rewrite. This is from thatguidefromdelly.com and he has presented this great table here that shows you what operations require a table rewrite. So essentially you're going to be doubling the size of the table if you execute these commands on the table. So for example, if you alter the table and set it to a new table space, it's going to rewrite that whole table. So essentially you're going to use up twice the disk space of that particular table, or if you're going to alter the table and set it to be logged or unlogged, both those operations require a full table rewrite. And he's also listed this by versions from 9.5 to 13. And as you can see here, the great improvement that was made in version eleven is defaults not having to require a table rewrite. So that was a great, great enhancement. So as of right now, I mentioned the table space, I mentioned setting, logged, unlogged. Of course when you cluster that's reordering the data, that rewrites the whole table as well as truncating the table. But the thing about this, it really doesn't use up double the disk space. It's basically kind of what, as he says, rewrites the rel file node. And he also lists the vacuum full and then of course altering column types. So when you're going to alter a column type and presumably go from an int to a big int, that does require table rewrite or altering the column of type of text does require table rewrite. So this was a great table and blog post talking about SQL that can cause a table rewrite.
[00:07:06] The next piece of content is the shibolith of PostgreSQL. This is from CyberTech Postgresql.com and he's talking about shibolith, which as Wikipedia defines it as any custom or tradition, usually a choice of phrasing or even a single word that distinguishes one group of people from another. And they're basically talking about the postgres community and really how to write and pronounce PostgreSQL. And the way to do it you can do just do postgres or the other way to do it is PostgreSQL. So those are the accepted ways to pronounce it. Ways you should not pronounce it are postgres or postgres SQL or postgres SQL. So if you're interested in learning more about that, you can check out this post.
[00:07:54] The next piece of content is a YouTube channel. It is the PG day, San Francisco. So for the last month or so they've been posting videos and they've got five on the site now, presumably more maybe coming because they seem to be posting one about every week. And it has content such as explaining, explain, advanced data types, vacuum through pictures, et cetera. So if you're interested in this type of video content, you can check out this channel on YouTube.
[00:08:22] The next piece of content is checkpoints in PostgreSQL and they're talking about what a checkpoint is. This is from Hygo CA and basically a checkpoint is when data is changed in postgres. It's written to the wall and on a periodic basis a checkpoint is done that takes all the different changes that have happened in the wall or that still reside in memory and it flushes those changes to the actual database file. So it's making a consistent checkpoint that the data on the disk essentially matches what's in memory. And he says what a checkpoint does is that a wall record of this checkpoint is recorded in a log. All dirty pages of data are written to disk, the checkpoint is written to the PG Control file, and then it synchronizes all the data files and he says what triggers it starting is the max wall size. So if it reaches that limit of the maximum amount of wall you want to maintain if it runs into the checkpoint timeout, basically a time expires and needs to do a checkpoint. When a backup is started from PG based backup or PG Start backup is issued, or if there is a database server shutdown or an actual checkpoint command is executed by an administrator and then it goes over some of the ways that checkpointing can be controlled through defining the Max wall size, the checkpoint timeout that were mentioned up here, as well as checkpoint completion target how long the checkpointing process happens, as well as a checkpoint warning to let you know if checkpoints are happening, say, too frequently. So if you're interested in learning more about checkpoints, check out this post from Higo CA.
[00:09:54] The next piece of content is how to install and configure PostgreSQL debian Ubuntu for developer use part One and Part two. Part one covers general installation. Now, of course, you could just download the package manager from the distribution and install it, but this post walks through actually using the Postgres repository, that gives you access to all the different Postgres versions that exist for doing. Your installation, as well as discusses how to get it set up and how to connect to it using psql and the different configuration that you might want to do to be able to connect as a user. The second post covers a lot about configuring the system, such as altering log settings, defining a wall level, enabling archive mode as well as an archive command, and looking at PG Stat statements and Auto Explain to understand what your queries are doing. Now again, this is for development servers, not necessarily production servers, but if you're interested in that, you can check out these posts from Dep.com. Also from the same site, three posts were done on what's coming in version 14 of Postgres. The first is PG Stat Statements, where you can track time at which all statistics were last reset. When you do a reset of PG Stat statements, it wasn't recorded where that was happening, but in 14 now you can get that information from the PG Stats Statements Info view, where you can look at the Stats Reset column to know when it was reset. So that's a great addition. The next post is Support for multirange data Types. In addition to supporting ranges within a data type, you can now do multiple ranges that are separated by commas using a curly brace. So then that's an interesting implementation. Now they did mention it doesn't work for Indexes yet, but it's still a ways off for 14, so that may get implemented before version 14 goes live. And the last post is allow subscripting of H store values. So in addition to using syntax with the Hyphen and the greater than symbol to be able to pull out a value, you can now use this subscripting where you use square brackets. Now, he said this doesn't actually work with indexing, but you can use it to pull out values. So if you're interested in learning more about that, you can check out this post.
[00:12:12] Next piece of content is Postgres PL Python and SciPy NumPy for processing images. So this post from Crunchy Data is about using a Django app that uses some of these python tools in order to process images along with postgres. So if you're interested in that, you can check out this post.
[00:12:31] The next set of blog posts are all from Devram Gundaz's PostgreSQL Blog, and they're talking about the Rpm packages that are used on Red Hat, Fedora and CentOS. So the first post is what is new in PostgreSQL 13 RPMs, and he covers all the different changes that have been made for the new ones. The second post covers LLVM issues with PostgreSQL Yum repository on CentOS Eight, basically ways to address that. The third post is Rpm repository for unsupported PostgreSQL releases and distros and basically how to get access to those.
[00:13:08] And fourth is installing postgres 3.1 and PostgreSQL 13 on CentOS Eight. So if you're interested in any of these posts about running postgres on CentOS or Red Hat, Enterprise, Linux or Fedora, definitely check out these.
[00:13:23] The next piece of content is Loading data into PostGIS an overview. This is from Crunchydata.com and it covers numerous ways that you can import data. Namely they say vector data into PostGIS and of course Postgres. And it covers using the command line tools such as SHP Two, PGSQL or OGR to OGR as well as covering using Raw, SQL as well as some GUI options such as QGIS. So if you're interested in that you can check out this post from Crunchydata.com.
[00:13:56] Next piece of content is query load balancing in PG pool Two. So this discusses how to do query load balancing in the load balancer PG pool Two. So if you're interested in that you can check out this post.
[00:14:10] And the last piece of content is Timeouts in PG pool two connections. So this is from Pgsqlpgpool Blogspot.com and describes how you can set up these timeouts.
[00:14:22] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the Show Notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube. Rightunes, thanks.