PG 14.4 Release, Subscription Skip, Using Schemas, Open Source Citus | Scaling Postgres 220

Episode 220 June 19, 2022 00:18:57
PG 14.4 Release, Subscription Skip, Using Schemas, Open Source Citus | Scaling Postgres 220
Scaling Postgres
PG 14.4 Release, Subscription Skip, Using Schemas, Open Source Citus | Scaling Postgres 220

Jun 19 2022 | 00:18:57

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss a PG 14.4 release, how to skip transactions with logical replication, how to use schemas and Citus going fully open source.

To get the show notes as well as get notified of new episodes, visit:

https://www.scalingpostgres.com/episodes/220-pg-14.4-release-subscription-skip-using-schemas-open-source-citus/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about a PG 14.4 release subscription skips using Schemas and open source Citus I'm Creston Jamison. And this is scaling postgres episode 220. [00:00:18] You all right? I hope hope you, your friends, family and coworkers continue to do well. Our first piece of content is PostgreSQL 14.4 released. This is from Postgresql.org, and this has been an issue we've discussed in the last couple of episodes of Scaling Postgres where there is a bug since the release of Postgres 14 that can affect indexes that are built concurrently or reindexed concurrently. So there is a potential for a corruption issue to happen. So with this release, it does resolve that issue and they again restate. Here how you can check for corruption using the PGAM Check capability. But if you do have corruption, what you basically need to do is rebuild those indexes by just re indexing concurrently now, because now it should work successfully. And you can do that individually, by database, by Schema. And there's even this re IndexDB command you can use that does the entire cluster concurrently. And you can specify a number of jobs to do it faster. But if you want to learn more, go ahead and check out this blog post, the next piece of content addressing logical replication conflicts using Alter subscription skip. This is from PostgreSQL Fastware.com, and there was a previous article released about two weeks ago talking about a way to skip over transactions during logical replication. And in that post they talked about using PG Replication Origin Advance, and this does have some risks associated with it. However, there's this new feature being added to PostgreSQL 15 called Alter Subscription skip, and it actually makes it easier and safer to skip over particular transactions. So this is definitely an interesting blog post to check out. Now, this is very similar to the one that was presented Scaling Postgres a couple of weeks ago. So I would probably go back and review probably episode 218, or you could just look at the previous blog post with regard to this. But basically they set up a subscription. The source table did not have a unique constraint on an ID. They put a unique constraint on the destination. So basically they're going to hit an error where they're going to send a duplicate, and they put disable on error true for the subscription. So basically the subscription will disable itself if it runs into an error. So they process some transactions on the source database. All these IDs were entered. It hits the conflict on the destination database, and you can only see that five and one were inserted, all the rest were not. So they're going to continue sending data to it by skipping over a transaction. Now, the logs give you some important detail in that it tells you where the transaction finished. So this is the LSN of the finished transaction. So that's what you want to use in the alter subscription skip. So basically you reference your subscription and you give it the LSN that it finished at. Now if you look at the subscription at this point, you see the subscription skip LSN is located here and you can see that the subscription is disabled because the sub enabled says false. You then go ahead and enable the subscription. And now when you look at that table, you'll see it skipped over the transactions that were inserting more rows, but it just inserted the final row here. So it did skip over one transaction. And when you look at the subscription again, you can see it's enabled and that there is no skip LSN. And then you can see information in the logs about this being completed successfully and then they talk about the safeguards that were added to make this a lot safer than trying to advance the LSN yourself. So definitely a great feature coming in postgres 15 and I wonder why they didn't mention that in the previous post that hey, you can use this technique, but hey, here's a better way to do it. But definitely check out this blog post if you want to learn more. The next piece of content using Postgres schemas, this is from Aaronoelis.com and they're talking about schemas. So basically Schemas is basically a namespace for objects in a, you know, tables indexes can fit within a particular schema, within a database and you can have multiple schemas. So they're just like namespaces. And it's interesting, I haven't seen a lot of schema use with Postgres clients. Almost all the consulting clients I deal with, they are only using the public schema. Whereas my experience with Microsoft, SQL Server and Oracle, there's a lot more schema use. I think part of it is I think Oracle creates a schema for each user. So it's basically already has this namespace schema area for each user. But I haven't seen something similar with Postgres and maybe because it can get a little bit complex as you'll see here, the permissions to be able to set up a schema and access it the way that you would potentially anticipate. When you create a table, you can namespace it with the schema followed by a dot and you can have the same name for different tables as long as they are in different schemas. So they created a private schema, named it example and then you have a public schema named it example. The other thing to consider when you're working with schemas is the search path. So it tells you what schemas it's going to be searching in to find those objects of interest and you can change your search path and it describes a little bit how to do that here. And then even once you have a search path, you can also still just explicitly give the fully qualified name to be able to access those particular objects. Then it talks about even though you create a new user create a schema for it, as they're doing here, and giving authorization to that user to use that schema. You can create objects in there and that user will own them, but even the Postgres user can go in and create objects and now the Postgres user owns it. And if that user that was created tries to access the object created by Postgres, you get a permission denied error. So you have to grant certain privileges to the users for the different schemas to make sure it's working as you expect. So this blog post goes into that and probably reflects maybe why in the wild I haven't seen that many postgres installations using schemas other than the public one. But if you want to learn more, definitely check out this blog post next piece of content. Cytus Eleven for Postgres goes fully open source with query from any node. This is from Citusdata.com and they're talking about the release of Cytus Eleven. So again, this is a Postgres extension that allows you to do scale out postgres. So it lets you have multiple nodes and have a coordinator node that can then query data that's on separate nodes. So it allows you to scale out one database across several database servers. Now, with this release, the newest feature is being able to query from any node. Before, you had to do the query from a coordinator node, but now you can query from the individual data nodes as well, apparently. But one thing that makes this really significant in my eyes is that everything in the Citus extension is now fully open source. So previously they had two versions. They had the open source version and what they called their Enterprise version. And of course the open source version or the Community Edition had less features than enterprise release, but apparently they've added all those features into the open source version. So to me this is fantastic. And they mentioned some different features here that have been enabled for the open source version. Talking about rebalancing shards. So you can rebalance data between your data nodes using logical replication to avoid blocking writes. So basically you can rebalance shards without downtime multi user support, tenant isolation for multitenant applications, fine grained control over inter node authentication, routing internal connections via connection pooler, and performance optimizations for data loading. And of course this blog post goes into each of these features and details and how to do the upgrade. But if you're looking for a scale out solution, Cytus appears to be now fully open source. So if you're interested in that, definitely check out this blog post. [00:08:31] The next piece of content case insensitive pattern matching in PostgreSQL. This is from Cyber. Copy and postgresql.com. They're talking about searching for text and you want that search to be case insensitive. You want to ignore the case. So the easiest way to do that is just to lower both the column you're searching and the search string you're receiving so that they will be the same case. Now personally the way I prefer to deal with this is at the application level. If I know I want everything lowercase like an email address, I go ahead and force it from the application side to just make everything lowercase. That way I don't have to worry about a function on the column I'm using and then I don't have to worry about functional indexes, I just have to lower what I'm received from the user, for example. But that's the way I prefer to deal with it. But when you do this technique you're definitely going to want to use a functional index. But he says there are some disadvantages of this and that it can take a while to convert this whole value to a lowercase. So there's a little bit of a speed difference with that. Now he didn't talk about indexes in this post so I guess he was just doing all of his speed tests without indexes. The next way to deal with it is the sitext extension or the case insensitive text extension. So basically it's an extension you install and it becomes a data type CI text. Now he says the performance can get bad for longer values because internally it basically calls lower and then a different correlation for it. And regular express matching is not case insensitive and you have to use the case insensitive operator explicitly. But I mean that's not much of a downside. The other technique he explored is using case insensitive ICU correlations. But to do this you have to have postgres built with ICU support. So it won't work if you haven't built postgres with that. In most instillations I encounter haven't done this. So he explores this but it actually has problems with doing a case insensitive pattern match. Doesn't really work and then he goes into the explanations of why that is given different languages. Particularly he focuses on german here and it looks to me like very difficult issue to deal with. So it doesn't look like that's a great route to go necessarily. But then he gets into some performance checks. So he looks at a query looking for an exact match, a like match, we're changing the collation and doing an I like. And this other looks like a regular expression check here. And in all the cases pretty much the Citex or the CI text wins. Except for this latter one here. But it didn't lose by that much. So again, ignoring indexes, it looks like the CI text may be the more efficient way to go for doing case insensitive text, but you can run into an error if you're going to try to use these ICU correlations to do it. But if you want to learn more, definitely check out this blog post. [00:11:27] The next PostgreSQL fuzzy text search. Not so fuzzy to fuzziest. This is from the [email protected] and he talks about all the different ways to do fuzzy searches from least fuzzy to most fuzzy and he even talks about how to enhance their performance by what kind of indexes you can use. The first example he gives it simple pattern matching. So everybody should be familiar with this using like or I like with a wild card to try and match something in the database. The other way you could do this is using a regex. So you could use the similar keyword that gives a little bit of a more friendly way to do a reg x expression or you could use the Tilde symbol to kind of give the possex syntax. Now to improve performance, the only way you can improve it is if you do a prefix search. So basically you're doing a left anchored search and your wildcard is at the end. That's the only type of search you can get a speed up for with indexes. And basically you use a standard Btree index but you can use text pattern ops or the varcar pattern ops operator. Next he looked at something that's more fuzzy ish which is text search vectors. And basically this is the full text search feature of postgres. So you're converting your text to a TS vector and you're taking your query and converting it to a TS query. So basically it breaks up things into words, potentially simplifying words. That's why he calls it fuzzy ish and it does searches to try and find matches with them. Now of course with a full text search your gen index is going to give you the most speed up you can use. It just index two, but I tend to use gen indexes. The next area fuzzier are trigrams. So basically you take your text and you break it up into three character sequence, hence a trigram and you can do matches against that using similarity to find how similar your query is to what's being queried. There's also word similarity functions and strict word similarity functions. And with that you can use just in gen indexes to speed up the performance of the searches. With that, the next fuzzier one is Levenstein distance. So it's a different algorithm to be able to calculate similarity basically from, I believe what he said, turning one string into another to calculate the distance. But unfortunately with this one there's no way to give a performance improvement to it. So it's going to just be slow without an index possible. The next one that he calls very fuzzy is phonetic similarity. So using functions like soundx or MetaPhone or demetaphone to look for things that sound similar. And you can speed up these methods using a function based index because essentially these are just functions. So this is a pretty interesting blog post that shows all the different methods. You can do fuzzy searching with it or similar searching in postgres. So check out this blog post if you're interested. [00:14:28] The next piece of content quick and easy postgres data compare this is from Crunchydata.com, and they're talking about a technique of where you have, say, a production database and a reporting database and you want to make sure that the data between them matches. How could you do that? And the technique that they use to do this is actually a foreign data wrapper. So they basically enabled to query both of them and put the results in one table. And how they did it is in this one table that's collecting the results. They give a source name that identifies where the data came from either the production database or reporting database, the ID column to identify the row. And then they take all the data columns that exist in that row other than the ID or the primary key and hash it to an MD five hash. And basically if the hashes are identical, the row is identical. And you can see the results of that here. So where the MD five S are identical, it's an equal value. Where it's different, essentially have a difference. But you can also have instances where a row is not in a target or the row is not in the source and it gives you the query that you would want to do against these tables to set this up. And again, the primary way that they're doing this is just using foreign data wrappers so that one database can query another. But if you want to learn more about how to do this, definitely check out this blog post. [00:15:50] The Next Piece of Content five Minutes of Postgres episode 22 reducing AWS Aurora I O Costs with Table Partitioning and Understanding Partition Pruning this is from Pginalyze.com and he's talking about a blog post that was put on the AWS blog where they were showing how they could significantly reduce aurora IO costs by using table partitioning. And I think the costs dropped up to a third. Now, this post actually came out last week, but I didn't cover it. The reason being is because at least with IO, that's kind of the differentiator of aurora and postgres. So I didn't see this as something that you could replicate with just open source postgres. Now you can definitely do partitioning and it uses partition pruning, and there are definitely advantages, but a little bit of apples to oranges to me. But in this episode he covers that and he also discusses what is partition pruning, how is it beneficial if we want to use it with postgres. So if you're interested in that, you can definitely check out this blog post. [00:16:55] The Next Piece of Content oracle versus Postgres transaction Control Statements this is from Megops.com, and if you're looking to convert from Oracle to Postgres, you'll definitely see that the transaction control statements behavior is a little bit different in terms of how transactions are begin and committed and things of that nature. So this blog post goes over those differences. So if you're interested in that, definitely check out this blog post. [00:17:21] The Next Piece of Content handle empty strings when migrating from Oracle to PostgreSQL this is from Amazon.com, and they're giving some advice with regard to how to handle empty strings when you're going from Oracle to postgres. And they say here quote Oracle databases treat Null and empty strings as the same, whereas with postgres they're different. So Null means unknown, and an empty string is a string of link zero, as they say here. So if you want to learn about the differences and how to manage them as you're transitioning from Oracle to Postgres, you can check out this blog post, The Next Piece of Content. The PostgreSQL Person of the Week is Vignesh Warren C. If you're interested in learning more about Vignesh Warren and his contributions to Postgres, definitely check out this blog post and the Last Piece of Content we did have another episode of the Rubber Duck Dev Show this past Wednesday evening. This one was on Raptors, which are actors for Ruby, so they're the active model of concurrency. They were introduced in Ruby 30 as being experimental, and we talk about them and look at their performance. So if you're interested in that type of content, definitely welcome you to check out our show. [00:18:32] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode episode, or you can subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 286

October 15, 2023 00:12:59
Episode Cover

20 Times Faster Text Search & Ranking? | Scaling Postgres 286

In this episode of Scaling Postgres, we discuss a new extension that promises substantially faster text search and ranking, an AI content storm, how...

Listen

Episode 319

June 09, 2024 00:12:05
Episode Cover

When Postgres Development Stopped! | Scaling Postgres 319

In this episode of Scaling Postgres, we discuss a time when Postgres development stopped, two new extensions pg_lakehouse & pg_compare and the upcoming event...

Listen

Episode 11

May 07, 2018 00:12:13
Episode Cover

Serializable, JSON & JSONB, Fast Data Loading, PG11 Features | Scaling Postgres 11

In this episode of Scaling Postgres, we review articles covering serializable, JSON & JSONB data types, fast data loading with Ruby and Postgres 11...

Listen