Multi Column, Multi DB, Dos & Don'ts, RUM Indexes

[00:00:00] In this episode of Scaling Postgres, we talk about multicolumn, multidatabases do's and don'ts and run indexes. I'm Kristen Jameson and this is Scaling Postgres episode 65. [00:00:20] Alright, I hope you're having a great week. Now our first article is multicolumn indexes, and this is from Medium in the PG mustard area. And they're talking about an index over multiple columns and usually how you would start indexing your database or your tables or your application would be putting on single column indexes. But then if you start getting performance issues, a lot of times you move to multicolumn indexes. And this is kind of a counterpoint in that there's kind of no magic bullet and it talks about some disadvantages. So they talk about when malduk column indexes fall flat and that's predominantly when you're needing to search on the second column in a multicolumn index or the third column or the first and third thin column of a multicolumn index, then it won't really work. And they go into some clear examples about this and why it won't work. And they even say how to spot some inefficient efficient index use. And that's when you have like a high amount of rows removed by the filter, when you look at the query plane, when you run Explain or Explain Analyze. And they say when you're sticking with just single column indexes, it basically can do a scan on each of those single column indexes and then do a bit scan between them. And that can work in a lot of cases. There's just cases where multicolumn index definitely excels. So if you want a refresher on what cases you may want to move to a multicolumn index, definitely a blog post to check out the next post is managing multiple databases in Rail Six. And this is from the Citusdata.com blog. And with the new Rail Six they have the ability to support talking to two different database systems. I'm sure other frameworks have that, but this is calling out Rails because this could be advantageous if you have a primary PostgreSQL database and then a Replica PostgreSQL database. Here they call it a follower and basically you would direct all data changes to the primary and then all essentially read only queries to the Replica or the follower. And they talk about how to configure Rails to set this up to work in Rail Six, and how a lot of times, basically you just tell Rails to automatically utilize the reading role, which is hitting the Replica for getting head requests and probably leave post patch put things of those other http verbs for just hitting the primary database. So if you use Rails or another web framework and you want to look into how they're offering the ability to talk to multiple databases, definitely a blog post to check out. [00:03:01] The next post is don't do this. This is from the Wiki postgresql.org. Now I saw this post earlier, but heard some mention of and I'm going to mention another blog article related to it. I'm not quite sure who authored this, but it does have some, I would say, opinionated choices on why you should and should not use different features. Now, some of them are clear in terms of my opinion, but some of them not so much. Like, for example, a lot of ones that definitely disagree with are don't use the psql capital W or include the password, don't use between. [00:03:40] I agree with I always have to take a step back and think about it, usually with dates, I definitely do what they're suggesting here, or at least timestamps, but like, things that's different is that I frequently do use timestamp with that time zone just because of the framework that I tend to use. As that framework offer different options. I might use timestamp including the zone with it, but a lot of that is just due to the framework that I'm using. Text storage. Basically they're saying you should mostly use Vericare or text and rarely use Vericare in, although I find it beneficial. And they say that this is when you should do it is using a limit when you actually want to limit the text. And a lot of times with my columns I do want to limit because I don't want people to be able to insert data into the database in an unbounded fashion. A lot there's one here that says don't use serial. But again, a lot of the frameworks they're using the serial data types as opposed to the new identity columns. Now, hopefully the frameworks that I use will be switching over, but as of now they are not. So it's an interesting blog post and has a lot of advice to look into, but I guess I wouldn't necessarily take it as gospel. And again, for everyone, they say when should you? I guess some of these when should you? I would say what is your framework or are there things that you're using like an application framework that makes it more difficult to kind of do some of the recommendations in here? I guess I don't have a high difference of opinion with a lot of these, but a lot of the reason I do do some of these things is because of the application framework and what capabilities it offers. [00:05:22] The next post is Table Inheritance what's it good for? And I believe this is kind of mentioned because in here it says don't use table inheritance, essentially. And it says when should you with regard to table inheritance? Never, almost. But here this blog post by Perspectives on Ledger SMB is talking about a use case that they felt inheritance is good. Basically. [00:05:47] They had some sort of objects and those objects had notes for each of those types of objects. So they were using a parent notes table and then did an inheritance to like an invoice notes table and then other different types of notes table and they found that this particular inheritance feature works for them. I wasn't 100% wedded on the idea as I read through this blog post as something that I would do, but I could see it as a use case where you want to make the decision to do it. Definitely the different developers need to be on board to understand how this works, but it's definitely a use case I could see where table inheritance could possibly be useful. So if you're interested in looking into that, definitely blog post to check out. [00:06:32] The next post is using docker Hub PostgreSQL images and this is from second quadrant postgresql.com and they're talking about a docker Hub and it has images for PostgreSQL. And basically they started using it for a particular use case and they're thinking that a lot of people are going to use these because it's the PostgreSQL image on docker Hub. But what I found interesting about this, they talk about different warnings that you should consider. Like number one is it's ephemeral by default. And that's the thing that has me nervous about using containers with PostgreSQL is making sure the data just doesn't go away. And apparently this docker image on Docker Hub is ephemeral by default. So you can essentially, as he says here, quote by default, these images make it very easy to lose your data forever. So I myself haven't looked into using Docker with PostgreSQL and part of the reason is the fear of this losing the data. And then another one he talks about is data safety. And here's a warning docker's default storage configuration on most installs may perform poorly and isn't necessarily crash safe. The docker documentation says as much. So again, more warnings about this. And then he talks about how the locale and encoding and collation is set up, that it's using root by default, how to handle connections between containers and some other information to help you understand and use the PostgreSQL container on Docker Hub. So if you're interested in looking into that, definitely a blog post to check out. [00:08:01] The next post is actually presentation. It's Streaming Replication the Basics. And this is from PG Day in Belgium, 2019 by Stefan Forkat. And it's at Pgsth. GitHub IO. [00:08:14] And this talks about streaming replication with PostgreSQL and it talks about the basics but it definitely goes on a little bit further than that because this presentation is 54 pages in length. He goes over the basics of Wall and how streaming replication works, how to set it up, but then he also goes into issues such as monitoring, so how you monitor using PG Stat replication, the PG Wall receiver, how to do failover, promote the different databases and even failback using PG Rewind. So it does start off with the basics, but then it goes in much more deeper about all the issues related to replication, even mentioning synchronous replication. So if you're interested in getting an overview of streaming replication and then going much more in depth. Definitely a blog post to check out. [00:09:07] The next blog post is Indexes and PostgreSQL part eight, Rum. This is from Haber.com and I believe this was posted on Postgrespro Ru and it basically explains the Rum indexes. So it's kind of like the, as they say here, quote the next generation gen has been called Rum. So it's the next version of the Rum index because typically you use gen with data types like for text search, for JSON, B queries, for arrays, things that have multiple values in the data type. But they're saying there are certain limitations in gen and basically Rum has been developed to get around those. Now, it's not offered as a part of the core PostgreSQL install, but you can install it as an extension. And it talks about the use case here. [00:10:04] Now, one of the things they mentioned here is they actually added more information to help things with ranking and make it more efficient, better for tech search with Rum indexes. And at least one disadvantage here is like looking at the size. A gen index for something that they were looking at was 179 megabytes, whereas the Rum index was 457 megabytes. But basically it makes search much faster. There was a presentation that I found from Postgres Open 2016, and I'll include this YouTube channel link in the notes, but it talks about create index using Rum and by Oleg Bartunov. And he's describing the Rum index and he was seeing performance improvements like six to ninefold in text searching because, again, the gen index has problems with ranking because it doesn't have enough information in it to do that. And it can also do some things like also order by date time when you're searching on a particular text query. So if you use a lot of full text search in PostgreSQL, these are two posts to check out to see if you may want to look into using the Rum index. [00:11:14] The next post is a glance at PG Cron to automatically schedule database tasks. And this is from Luca Ferrari at fluca 1978 GitHub IO. And he's talking about something that was developed by Citus as a way to store scheduled jobs in PostgreSQL using PG Cron. So this is an extension that you can install and it has a Cron like syntax and you can set up to run database jobs directly within your database system. [00:11:45] I'm probably going to still be sticking with Cron, but if you have a use case where you actually want to store the jobs in the database itself to complete on a periodic basis, definitely a blog post to check out. [00:11:57] The last post is an overview of Sharding in PostgreSQL and how it relates to mongoDBs. And this is from Percona.com blog. Now this is related to a YouTube video that was shared a few weeks ago in Scaling Postgres, but it talks about first partitioning in PostgreSQL and particularly declarative partitioning. And then it goes into Sharding with a concept of you're storing one of these partition tables in a totally separate server instance and using a PostgreSQL foreign data wrapper to actually communicate with that secondary instance that has that partition table on it and it goes through the code and shows you how you can do this today. And they also give some advice on when does it make sense to partition a table and then when should you take the next step, potentially and look at sharding. Now, again, there are solutions that do this out of the box, like Cybus, but this is where you're just doing it using the PostgreSQL tools that are available today in essentially the community edition. And of course, the biggest consideration they talk about is how you shard the data, because ideally you want the queries to all take place on that one shard. Like for example, if you have a system where your customers are set up by accounts, maybe you would want to shard by account, so that when you're processing one customer's data, it all happens on one of those database shards. So if you're wanting to look into PostgreSQL for sharding, even though it's not here yet and there's some disadvantage with regard to it, this is a blog post to check out if you're interested in that. And it also does a brief comparison of MongoDB, which has essentially replication and sharding kind of built into it on how it compares to PostgreSQL and what you can do today. [00:13:45] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 149

ARM Tests, Tips & Tricks, Hierarchical Structures, Benchmarking Framework | Scaling Postgres 149

Episode 243

IN vs ANY, Ghost Conditions, Percentage Calculations, Variadic Unnest | Scaling Postgres 243

Episode 0

Citus Open Source, Time Series Performance, Subscripting Updates, target_session_attrs | Scaling Postgres 155

Multi Column, Multi DB, Dos & Don'ts, RUM Indexes | Scaling Postgres 65