Collation Conundrum | Scaling Postgres 307

[00:00:00] I hate to admit it, but collations are one of those things that I learned just enough to get by. Kind of the same thing for time zones as well. So if you're anything like me, you may appreciate this first post that I cover this week because it gives some insights into collations and some of the decisions we should be making when setting up our databases. But I hope you, your friends, family and coworkers continue to do well. Our first piece of content is using binary sorted indexes. This is from PostgreSQL Verite Pro, and he's referencing a post he made previously where he alluded to covering the content in this post, namely that when you have Lib C changes, usually the operating system is upgraded to a new version and the Lib C changes. As a consequence, sorting gets changed and you have to re index all of your indexes. Otherwise you can run into corruption problems. Or similarly, if you do an ICU upgrade and you're using ICU correlations, that will require re indexing all of your text based indexes. Buddy says if you are using a correlation or sorting that uses a bytewise method as opposed to a linguistic method, then you don't need to do that. So for example, when you're doing a bytewise sorting, capital B beta appears before lowercase alpha. So linguistically that's not very good, but that's how the bytewise comparison is handled, like Nc, for example, whereas the linguistic comparison sorts lowercase alpha before capital B beta, because a should always come before b, no matter what the case is. And he shows a comparison of the different orders here based upon his assessment that when you do things in linguistic order, it's easier to use, it's more readable for humans, and you can do range search, meaning when you're looking for strings in certain bounds, it's easier to do it using this order, whereas binary order excels with performance, portability, real immutability, and you get a like prefix search. But I would argue with the btree indexes you can create using text pattern ops, you can pretty much get a lot of the performance as well as like prefix searches. So really it's just about immutability and portability, whereas linguistics, it just makes sense for humans essentially. [00:02:32] So he takes a look at where you can set these and what decisions you can make. I'm actually going to go in a reverse order because I like to think of it from top down. So basically, when you create your cluster, your postgres cluster, you can define what collation you're going to use. And how you initialize that cluster is how it sets the template one database, so that all future databases use that particular collation or locale you've specified. But you can also change it per database, so it inherits what the cluster has, unless you change it in the database. And then further, you can define the collation on particular columns of data within tables. And that can be different than what the default has been set. Now, he says, the different collations you can use are C. You could use possex, which is just an alias for C, or C UTf eight, which is not exactly like C, but it attempts to mirror it as close as possible. [00:03:29] And he asked the questions, why do most installations not use binary sort? And that's because initialization of the cluster is not using the locale argument. So it basically relies on whatever the local language is set. So how people tend to get their collations is they have a language set on their operating system that becomes the essentially locale when you create the database cluster. Therefore, that's how the other databases get created you're going to use. That determines what the correlations for the comms will be, and thus the indexes. So, for example, when I initiated my database on my system, it gave me english us UTf eight. And as a consequence of this, whenever my lib c version changes, if there is a sort order change, I need to now reindex all of my indexes. So he's asking a very good question. How often do we need this type of sorting? Could we get by just using C collations so that we can avoid this problem? [00:04:31] But he says, even if we start going down that path, C or Possex has its own issues to deal with. Namely that it's primarily done for English sorting. So it's pretty much the US ASCII character set. So anytime it has to deal with another language or accents or other types of characters, it can get confused. Like this is just using the inetcap function. It's clearly capitalizing the wrong thing. And you can run into similar behavior with upper and lower functions as well. So he says, the C UTF eight collation tries to alleviate some of that. And as you can tell in itcap does work more accurately here, but it has its own issues. And he advocates, if you want full support, you should definitely use the version in Libc 2.35, which was released in February of 2022. So you need to make sure it's current enough. And it still has, as he says, some os dependent issues as well. And then even if we did this, we moved to using bytewise ordering and C or C UTF eight if we needed to sort more linguistically, we now have to remember to collate columns we want sorted linguistically differently when we create them. [00:05:46] And I actually think that might be more of a burden. But I will say this, ever since I've started doing logical replication upgrades where you don't have to worry about Lib, C or ICU collation changes. Basically you're just copying the data from one place to another. And if the sort changes subtly, well, it does it on an entirely new database. So maybe your sorting behavior is slightly different because of a version change, but you're not going to break all your indexes and have to reindex 100 gigabyte or terabyte index if you have a really huge table. So I frankly think that's probably the better way forward. Go ahead and keep your specific locale for your language that's English, UTF eight, or whatever it may be, and just rely on logical replication for major upgrades or for cases where you want to change the underlying operating system. But I thought this was a great blog post, giving more insight and understanding into collations and how they work, and I definitely encourage you to check it out. [00:06:50] Next piece of content transforming and analyzing data in postgresql this is from redgate.com, and he's talking about instead of doing the classical term extract, transform and load, where you extract data, you transform it in some presumably intermediate system, and then you load it in a destination system. He says, why don't we do more? Extract, load, transform. So basically extract the data, load it into postgres, and do your transformations there. And how easily or hard is that to do? So he goes through a process where he grabbed data from advent of code 2023, specifically day seven, and he's wanting to take data that looks like this. So these represent playing cards. So it's a hand of five playing cards, for example, this is two kings, a six, and two sevens. And the second part is actually a bid value that someone is making to solve a particular puzzle. And basically he wanted to transform this into a better format. So he transformed it. I'll show you the end result here, where he has a column for each hand, each card that was there, the position of that card in the hand, if that's relevant, and then the bid. So this blog post talks about that transformation of that data and all the different functions he's using. Like he uses a regex split to table a split part to split things up, some regular expression matches, some string to table functions as well, some window functions to do some of the ordering, and a cross join lateral basically to produce the output he was looking for. So if you definitely want to hone your SQL skills, definitely check this out and get to learn some of the functions that he used to transform this data. [00:08:41] Next piece of content there was another episode of Postgres FM last week. This one was on transaction timeout and they're talking about a new feature that has been committed to postgres 17. Hopefully it will make it through till the end, but this allows you to timeout transactions as opposed to, for example, just statements. And Nikolai in particular definitely advocated for having some sort of default statement and or transaction timeout set for postgres. Like set it at 30 seconds, he was saying, or 60 seconds or some sort of default for postgres. The only downside I see to that is that postgres out of the box. If you go to create an index on a table, you're going to hit that timeout and the index fails. So I don't know if that's a great first experience. It would definitely be a teaching moment. Why can't I create my indexes? Because even concurrently we'll timeout. Historically, I've actually not liked global statement timeouts because it impacts me, the dBA, but I think I kind of agree with them in that there should be some sort of timeout globally. But for the session that I'm connecting with, either remove the timeout or extend it so you can execute some of these long running maintenance tasks that are needed, like even validating a constraint that presumably should not do a bunch of logging will know some of these timeouts, but definitely another interesting show from them. Encourage you to listen to it or watch the YouTube video down here. [00:10:09] Next piece of content enforcing join orders in PostgresQl this is from cyberkyphen postgresql.com, and he's talking about when you are wanting to join tables. So he created just five basic tables here, and he joined them all together that your explicit join order, the planner will not necessarily follow it. It will use what it knows about statistics and the table layouts to determine if it should join, say X five to X three first, and then to x two and then to x four so it can change the entire order of it. And as a consequence of doing that, you'll see that the planning time continues to increase the more tables you add to it, because it's trying to figure out the best way to join them together. But he says this behavior can be changed and that is with the join collapse limit. By default, I think it's set to eight. So up to eight tables it will do that. But if you set it to one, then it's basically going to follow your explicit join order that you've set here. And as you can see, the planning time now dropped to maybe five times faster than it was before because the planner is not making any decisions, it's just taking the join order that you've specified. [00:11:21] Now, as they say here, it's not recommended to set this definitely at a global level or even probably for a session level, but if there's a particular query that you need to adjust this for, that's where they would potentially advocate using that or changing it. But definitely check this blog post out if you want to learn more. Next piece of content speed up postgresQL pgvector queries with indexes this is from eversql.com and this is another pgvector extension and IVF flat and HNSW index post. But again, I always like seeing these because it helps me get more familiar with it because if anything, I think we're going to be using more of this data going forward. But as you can tell, this is quite a comprehensive blog post, so there's definitely more stuff you could probably learn from it. [00:12:11] Next piece of content how to use PG bench to test postgresql performance this is from Dev Two in the Francisco Tizio section, and this is just a basic post of how to get started using PG bench for benchmarking. So check this out if you're interested in that next piece of content. Trying out cloud native PG a novice in a Kubernetes world, I would say this is probably my byline as well. Definitely a novice in the Kubernetes world, but this is from enterprisedb.com and he's basically going over getting started with it and getting up and running and connecting to it externally. I know because this was a little bit of a hang up I had is actually okay, it's all running now, how do I connect to it? But check out this blog post if you want to learn more about that. The next two are also related to cloud native PG, and this is the operator created by staff from EDB to run postgres in Kubernetes. There are other operators, of course, but this is done, I believe, by the main contributor to cloud native PG, where he covered setup and recipe, one that was on last week's episode. This one he talks about looking at the default resources that are set up with it and then the third one is talking about super user access. Basically there is none but if you happen to need to do that he talks about how to enable that here and the last piece of content managing time series data using timescaledb powered postgresql. This is from precona.com and they're showing how you can take the timescaleDB extension and set it up in your own database system to start using it for time series data. So if you're looking for a guide on how to get started with it you can definitely check out this piece of content. [00:13:55] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content mentioned as well as sign up to receive weekly notifications of each episode there you can also see a podcast audio of the episode as well as a full transcript. Thanks and I'll see you next week. [00:14:17] You close.

Show Notes

Episode Transcript

Other Episodes

Episode 255

Integer Overflow, User-Friendly Permissions, Dump & Logical Replication, Worker Config | Scaling Postgres 255

Episode 17

Real Time Analytics, Index Decrease Performance, work_mem | Scaling Postgres 17

Episode 220

PG 14.4 Release, Subscription Skip, Using Schemas, Open Source Citus | Scaling Postgres 220