Postgres 16 RC1, Bi-Directional Replication, All About Parameters, Foreign Keys & Polymorphism

[00:00:00] In this episode of Scaling Postgres, we talk about Postgres 16 release candidate one, bi directional replication, all about parameters and foreign keys and polymorphism. I'm Kristen Jameson, and this is Scaling Postgres episode 280. [00:00:25] Hello. All right, I hope you, your friends, family, and coworkers continue to do well. Well, I'm starting to plan out how to potentially do interviews, but I would like to get your feedback on who you would like me to interview. Would you be interested in talking with developers who use postgres on a regular basis? Would you be interested in talking with people who actually commit to postgres or representatives from a company? To learn more about a technology, go ahead and let me know in the comments. Or if you're part of the email list, you can simply reply to my weekly email and let me know who you think it would be good to interview. But for this week, the first piece of content is PostgresQl 16 release candidate one released. This is from postgresql.org. And there's only one change from beta three, where they changed a performance regression in copy. So pretty much it should be unchanged from Postgres 16 beta three. And it looks like if there's no issues, it could be made generally available as soon as September 14. So that's only like two weeks away. So that's great news. But if you're eager to get your hands on Postgres 16, go ahead and download the release candidate and give it a try. [00:01:36] Next piece of content bi directional replication using origin filtering in PostgresQL this is from Postgresql Fastware.com. And this is something that was mentioned last week in Scaling Postgres by a separate postgresql.com blog post where they were talking about their favorite things from Postgres 16, and they were talking about something leading to multimaster replication. And I haven't really heard that much about it, but this post actually goes into it. So basically, when you set up logical replication, you set up a publisher, you set up a subscriber, and apart from setting up certain filters on the columns and the rows, all that data just gets transferred. But there's a new feature where they're tracking origin, which is to say, did this data originate from the local system, or did it come in from logical replication from another node? And now when you create a subscription, you can specify whether you want that origin to be any. So it just copies all the data. So that's the current default, or you can specify none, which means quote the subscription will request the publisher to only send changes that don't have an origin. So these are changes that originated on the publisher system and did not come from anywhere else. And they have some diagrams to explain that better and actually show you the syntax of how you could set this up. But where it gets interesting is you could set up multimaster replication doing this. So you have a node one primary and a node two primary. You set up publication on it, and you set up a subscriber to subscribe from the publication on node two. And then on node two, you set up a publication and you set up subscriber there to subscribe to the changes from node one, and you set the origin to none for both of these. So basically, the only changes that get transferred from node one to node two are those that originated on node one. And he has the syntax necessary to set up a structure like this. So you create publications on each of the nodes, and then you set up the subscriptions again, specifying origin equals none, and then shows how you can insert data into one, and it won't do an infinite loop replicating itself. And he shows how you can even add a third primary node. But because this is so new, there's definitely limitations. So the first is adding a new primary node that has existing table data isn't supported. So my assumption is it has to do the full copy from whatever it's replicating from to collect all the data. But then the second limitation is that copy data equals true and origin none. Where it's copied directly from the publisher, it's not possible to know the true origin of the data. And then it follows up with cautions and notes where he talks about watch out for conflict issues, watch out for certain restrictions with regard to this and handling backups and restores. So it's definitely early days, but this is very interesting that the postgres core could support a full featured multimaster setup. So these are like the first stages to get that set up. So if you're interested in that, definitely check out this blog post. [00:04:39] The next piece of content postgresql parameters, scope and priority users should know this is from procona.com and they're talking about all the different parameter settings that you can make in postgres and at the level that you can make it. And it actually goes through twelve different levels, which I definitely haven't used that many, but let's go ahead and see what they've listed here. So the first is compile time parameter settings. So at the point at which you're compiling you can make certain changes. I've never used that because I perceive them as more low level controls. Second is data directory initialization specific parameter settings. So these are the things when you're creating your postgres cluster that you can change. One of they mentioned here is data checksums. You can also set the encoding and locale settings at the same time. So I definitely do this when I'm setting up clusters. Third is postgresql parameters set by environmental variables. So when you're starting postgres you can actually set environmental variables like pg data. I really haven't used this a lot, if at all. Four is configuration files, so this is the one of course I use the most. This is the postkerschool.com file, although as they're talking about here I generally like to include separate files and directories to do the configuration and override what's there, as well as the PostgresQl auto.com file which is set by the alter system command. And you generally don't want to manually change that, just use the alter system command to make the changes. To do that, five is command line argument to postmaster. So this is on the command line where you can pass in some of these configuration variables. I haven't had to need to do this, but they say that various utilities take advantage of this. Like for example quote Petroni passes some of the most critical parameters as a command line argument, so you can do it this way. Six is database level settings. So this is altering the settings of a specific database. Here they're altering the database and setting max parallel workers per gather to four. So you can have that set per database and then they talk about context. So what that means is that some settings can't be set at the database level or the user level or the session level. They can only be made for the whole cluster. So an example of that is max connection. That only makes sense with the scope of the full cluster or changing TCP IP settings. Seven is user level settings. So for a user you can set specific parameters. Eight is a database user combination. So you could say this particular user when using this database gets these particular configuration parameters. Nine is parameters by the client connection request. So when you make a connection you can specify some parameters in the client that you're using. Ten is session level settings. So in your console when you're connected up via PSQL you could just say set the certain configuration variable. Eleven is transaction level settings. So within a transaction you can set local particular configuration settings and twelve is object level settings that you can actually do as a part of a function. So you can specify what configuration variables should be set for particular function when it's run. So there's a lot of different places to set these, but the main place to go to to understand where something is set is definitely the PG settings system view. And you can see it here where they're using PG settings to identify many of the different configuration variables. And it has a source telling you where the variable was set as well as if it was from a file, what line number of the file. So very useful. But if you want to learn more, check out this blog post. [00:08:16] Next piece of content. Conditional foreign keys and polymorphism in SQL four methods this is from cybertechgresql.com. They're talking about polymorphism where you have, call it a parent class of vehicle, and then from the vehicle you have a child class of motor vehicle, but that's also a parent to a subclass. This is also a car, and a bicycle has a parent of vehicle but not of motor vehicle. And when you're wanting to reference that in the code, so you have a customer here that has a contract, you can just specify vehicle and all of these different types of objects are vehicles because that is the ultimate parent class. Now the problem with this is if you want to implement foreign keys, because this relationship doesn't really work with foreign keys. So as they show down here, you have a contract table and you have the vehicle id, but what should it reference? Should it reference the bicycle table? The car table, because these are the ones that actually contain data. They actually don't have a vehicle table. Well, this blog post gives you some solutions on how to deal with this, because you cannot directly write a conditional foreign key, meaning that if there was a vehicle type and the type was car, then this references the car table. Or if it's a type of bicycle, then reference the bicycle table. That doesn't really exist. So how do you handle it? So the first method they propose is modeling conditional foreign keys with multiple foreign keys. It's basically your contract table would have an entry for bicycle id for the bicycle table and car id for the car table. Of course, if you have many more tables, you're going to have to have many more columns to deal with that, and you'll have to set constraints to make sure that at least one is not null, which they're doing here. Second method of dealing with it is storing all reference data in a single table. So basically you just have a vehicle table and it has all the different columns that you need now if you still want to reference the car table and the bicycle table, you could do it through views that they're showing here and of course on your contract table. Basically it just references the vehicle id and the common vehicle table that's up here. Method three is still using a single table, but all of those additional columns that are specific to cars or bicycles you would put in a JSON B field. So that gives you less columns. But it does make the SQL not as clean as individual columns. Also, it's harder to maintain data constraints for each column. The fourth method they propose is pointing the foreign key in the other direction and faking it. So actually setting a reference to contract from the car table or the bicycle table, but as they say here, that's not really right. And then lastly they do mention that PostgreSQl does have table inheritance and you could try implementing this, but it doesn't really work. But of course an option they didn't mention here is just not to use foreign keys when having a polymorphic relationship like this. Or if you really need it, maybe you fake it with triggers or constraints to maintain that referential integrity. But for example, Ruby on rails has the ability to support polymorphic relationships and it maintains an id and a type column for your models that are set up that way or your classes that are set up that way. But of course when using that feature, I'm not using foreign key relationships with it. But if you want to learn more, definitely check out this blog post. [00:11:39] Next piece of content, open source bait and switch licensing and beyond. This is from MoMGM us. And he's talking about an issue where companies start off building tools that are fully open source, and then as the company has success and the tools that they've built have success, eventually they start changing their licensing model to a less open source one. And this post is actually a response to something written by [email protected] where he's talking about open source bait and switch licensing and beyond. And he's talking about a company called aka and he had a lot of issues within transitioning from Apache 20 license to a BSL license where it's basically the source code is available, but there's a lot of restrictions on how you can use it. And he definitely talks a lot about this issue. Now, related to this, Peter Eisenhower released a blog post called all kinds of licenses, talking about all the different projects and different licenses that are available. And he references the change that Hashicorp recently did where they're moving from a more open source license to the BSL as well. So this is definitely a resource to look at some of these. And then this article from Theneustack IO is talking about Hashicorp's licensing change as well, and linked to a discussion where members of the community are actually branching terraform, which I think is their primary tool or product to make it independent of a company. So here's a link to that opentf.org, where they're going to be forking terraform and maintaining its open source nature, as opposed to following Hashicorp, which is changing to a business source license. Now, you might be wondering, what does all this have to do with postgres? And the reason I find this interesting is because postgres is open source at this time. It's a community that maintains it. But I think we always have to be aware of these different licensing changes that happen, particularly at companies. We need to be aware that this could happen as well if those companies decide to change their licensing paradigm. But I thought this was an interesting development and wanted to share with you next piece of content scaling vector data with postgres. This is from crunchydata.com and he has a number of recommendations on how to start working with larger data sets of vectors for AI work. His first recommendation is physical separation of the data. So his recommendation is if you have an application database, actually create a separate database to store all of your vector data because you're unsure how much you'll be using, what the load is going to be. It's going to be a very dynamic, changing environment. So go ahead and keep that in a separate database. And he says if you really must access it from your application, you can always use a foreign data wrapper or just a separate connection to access that data. Second is definitely consider performance before you scale up your hardware. So take the time to tune queries to get indexes right. And PG vector, the extension just came out with an entirely new index type and different ways that you can set that up. Thirdly, he talks about categorizing and separating data appropriately, and they even have an example through their postgres AI tutorial that you can walk through that talks about best practices in terms of separating and categorizing this data. The next area he covers is caching, because once you have that particular result, you can go ahead and cache it and just redeliver it, as opposed to having to cache it anew each time. And there's two different ways to do caching. One, he says pre caching, which is caching at storage time and post query caching, which is caching after running a query for a customer. And he talks about different ways to handle that, and his last recommendation is dimensional reduction. So for example, OpenAI has vector embeddings with 1500 dimensions. But if you're able to reduce that dimensionality, you can actually get much better performance, and he goes through the process to do that. So if you're using AI and you're starting to scale up your use of vectors, definitely encourage you to check out this blog post related to that. The next piece of content is Pgvector zero point 50 feature highlights and how to's this is from JCaTs five, and he's talking about the PGvector five release, and he primarily covers the new index type HNSW, which is hierarchical, navigable small worlds. So we talked about this in a previous episode of scaling postgres. But one thing I'll emphasize is that when you're talking about the type of embedding data you get, and in that post, generally IVF flat indexes were faster but had lower recall. But when you're talking about OpenAI style embedding data that you would get back, the new HNSW index actually has much better performance than the IVF flat and seemingly higher recall, although it does take a while to build, he said. So definitely, I think that's the main feature of this release. And he talks about the index building options that are available, as well as how to use the new index and PG vector. The other thing he mentioned is that it also has improved performance of distance functions. So these type of calculations like euclidean distance or cosine distance, the Euclidean had a rather dramatic increase in performance, maybe one third faster, whereas the cosine was a little bit negligible. I mean, it was an improvement, but not by much. But also someone said they saw greater performance on arm 64 systems. And the last big feature he mentions is the parallelization of IVF flat index builds. So he shows you how to configure and do that. So definitely check out this blog post if you want to learn more. Next piece of content data redundancy with the postgreSQL CiTiS extension this is from Procono.com and CiTIs is known for scaling out postgres so you could shard your data and place sharded tables in specific nodes to scale out your database and give it better performance. But he says you could also turn it around and give greater redundancy by doing this as well. So he says, quote much like you can tune a hardware raid for either performance or redundancy. The same can be said for Citus data sharding. So here's an example where he set up four Cytus nodes and maintained two copies of each sharded table across the nodes. So if you lost any one node, you would always still have your data. So you can think of it like raid ten for a disk array. This is essentially your entire database system becomes redundant and it actually goes through code to set this up in LXD containers as a proof of concept. So I find this super interesting that you can do that, and if you're interested in it, definitely check out this blog post. [00:18:15] The next piece of content there was another episode of Postgres FM last week. This one was on decoupled storage and compute. So they were talking about products like RDS, Aurora, Google Cloud's, AlloyDB, and Neon, where they're separating the storage layer from the compute layer, and whether this is something that postgres should develop the capability for. And I actually always thought this would be cool to do. For example, I have some clients that have multi terabyte databases and the issue when you're wanting to scale up and down is that you have to duplicate your storage generally. So every new replica you want to add, you have to add that whole storage solution. And it would be great having the one storage and then being able to increase or decrease the amount of compute nodes needed at a given point in time. Now, your storage system would also have to dynamically be able to handle increased and decreased load with regard to that, but I think that would be an interesting capability to have. And clearly other products are building this capability in Aurora, LodB, neon, et cetera. But if you want to learn more about that, you can listen to the episode here or watch the YouTube video down here. Last piece of content top ten postgres management tasks this is from crunchydata.com and the first one is add a statement timeout. So they give an example of you alter your database and you set the statement timeout to 60 seconds. So any statement above that will timeout. So this protects your database from long running queries, so it's a great thing to have in place. Second is confirm you have enough memory and you do that by checking your cache hit ratio. And if it's not 98 99%, you should generally consider moving up to a bigger instance size or increasing the memory. But they do note here for warehouse analytical workloads, you're probably going to have a lower cache hit ratio just because you have so much data for those types of workloads. Number three, check your shared buffers so the recommendation they have here is to have it between 15 and 25% of your machine's total ram. Four use slts for data in transit, five is setting up backups, six is stay on top of postgres releases and upgrade frequently. Seven is implementing PG STAT statements to track your queries, eight is adding indexes and then nine is checking for unused indexes and then ten is review your connection settings. So how many connections are running to the database and are you running out of them? But check out this blog post if you want to learn more details. [00:20:44] That does it for this episode of scaling postgres. You can get links to all the content mentioned in the show notes. Be sure to head over to scalingPostgres.com where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or iTunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 235

Postgres 15 RC1, ICU Collations, Listen & Notify, Understanding TOAST | Scaling Postgres 235

Episode 148

Hash Indexes, Int Float Numeric Types, Postgres 14 Features, Data Science | Scaling Postgres 148

Episode 373

Planetscale Postgres! | Scaling Postgres 373

Postgres 16 RC1, Bi-Directional Replication, All About Parameters, Foreign Keys & Polymorphism | Scaling Postgres 280