Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about enums versus check constraints faceting with roaring bitmaps better scaling in NDB business logic. I'm Kristen Jameson, and this is scaling postgres episode 245.
[00:00:20] One our. Our. All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is Enums versus check constraints in postgres. This is from Crunchydata.com, and they were discussing an issue that was brought up in their internal Slack channel where they were asking if anyone actually uses enums out in the wild. And the surprising conclusion was they basically don't. And if you're not familiar, they show you how to create an enum. Here you basically make your own type. So you do create type, say, order status as an enum, and then you can use that as a data type within the tables. And you can see that they created a column status with the type of order status or the data type of order status. So that's how you can do enums. But apparently they're not used that much in the wild, and frankly, I have never used them either in postgres. Now, I've handled it different ways, but the way that they propose is the easiest way to do it is to use check constraints. So you could quite easily define your table and just say, these are the valued values for this particular column. And I know the programming language that I tend to develop in Ruby and Ruby on Rails. They have their own way of how you can create an enum, but essentially just creates an integer in the table for, say, the status field. So there's no way to really run an SQL report and get the actual data. You actually have to go through the object model in Ruby on Rails to do it. And that's not a good solution, whereas I think using check constraint actually is a very good solution instead of these different methods of using enums. But if you want to learn more about that, you can check out this blog post next piece of content, faceting large result sets in PostgreSQL. This is from CyberTech postgresql.com. They're talking about a situation where, say, you have an online store or you're looking to search the inventory of something like an inventory of vehicles with different attributes associated with it. And you want to see basically what the count is. So say, how many hats are blue or red or green or what type of clothing is available. So they created a schema for a document table where you could place particular items in a category and each of those documents has given tags. You can associate it with it along with some other data. And basically you want to see the distribution for all of this by the type, the category, the size, tags, start and end timestamps. Now they have a little bit of a discussion about Facets and how there's categorical variables that typically have a limited set of small values. There are continuous variables where there's a wide range, and in that case, maybe you want to bucket particular ranges. And then you also talk about a little bit about composite variables. But let's say you wanted to render this type of information for each of these facets. For the search, one solution is to do multiple queries where you look at the type, count all the rows in the document table and group it by the type. But you would need to run this query for every facet that you wanted to measure. So that's not a very efficient way to do it. But one way to handle this is to actually use a lateral join. And with one query, you can get all those results back. But it's still not very fast to do that because essentially you have to read through the entire table to pull back the data that you would need. But they said there is a way to make this faster. And the key is roaring bitmaps, which I haven't heard a lot about, but apparently there's a psql extension that does it, and it looks like the extension is called PG underscore Roaring bitmap. But what this post actually did is it took that and created their own extension called PG Faceting, to make the implementation of it easier. And then they rewrote the query like this, and it resulted in a significant improvement in performance because it uses an inverted index to index all of these different values. So if you look at the GitHub page for PG Faceting under a how fast is it? Section, they show that doing a parallel scan with the lateral join way of doing it took 18 seconds. If you only went down to a single core, it was 222 seconds. But using the PG Facet extension and a single core, so this is equivalent to this, 222 seconds. The results returned in 155 milliseconds. So that's an enormous performance improvement using this new type of Roaring bitmap indexing. So if you need to present data in this fashion, I would definitely encourage you to check this out, as well as the next piece of content, which is Roaring bitmaps and PG Faceting fast counting across large datasets and postgres. This is from Pginalyze.com and this is the article that Lucas covered this week, and he includes a lot more information on it. So if you want to learn more about this, I definitely encourage you to check out his episode covering this topic as well.
[00:05:15] Next piece of content ISO better Scaling Instacart drops postgres for Amazon DynamoDB. This is from TheNewStack IO, and I was looking through this article on some of the reasons why they decided to make the transition. And I noticed a few points. One where they said, quote, as busy as the daytime evening hours are is equivalent to how quiet the nighttime very early morning hours are. So basically, there's one time of the day where they're very, very busy one time where they're not so much busy. And the main reason that they're exploring this is due to notifications. So presumably they have a ton of notifications during the daytime evening hours, and it drops off the cliff, presumably on nighttime, very early morning. And the key is that Postgres didn't scale based on demand, so that's something that DynamoDB could do. The other point is that Instacart's main concern about DynamoDB was cost, not latency or scaling requirements. So basically it seemed to be not a performance reason, but how can we get the lowest cost and how can we scale down resources that are not used during particular times of the day? Now, upon reading this, I also thought about some other providers that are trying to separate compute resources from storage resources. And if you were able to do that, you could keep your storage resources consistent, but then scale up and down your compute resources to save money, potentially. So at this time, it seems like just third party providers are doing this. But I wonder in the future if Postgres will have an open source solution to this type of concept to be able to dynamically scale up and down the resources that essentially your database needs over the course of a day without having to pay for one large server that is scaled to a peak capacity model. And what about all the other times that it's not operating at peak capacity? Essentially you're not utilizing all the resources of that system and spending more money. So I wonder in the future, will some open source project be able to implement something like that? But if you want to learn more details about this article, you can check it out.
[00:07:21] The next piece of content is the Rubber Duck dev Show episode 68 should you store business logic in your database? So this is our weekly show that we had, and we discussed a database topic, should You Store business Logic in Your Database? And we basically discussed the concept. Some people consider the database just a dumb data store, and other people want to put all sorts of business logic and functions and procedures in the database. Not so much today, but definitely historically. That was done a lot with some application development, although the pendulum seems to be swinging there. If all you want to write is front end application code and have Postgres, the database maybe consume and push out JSON files. But we discuss on that continuum of no business logic versus the majority of the business logic being in the database, what makes the most sense from a performance perspective and from a maintenance perspective. So if you're interested in learning more about that, we definitely welcome you to check out our show next piece of content, debugging Postgres Wall events with PG Wall Inspect. This is from Postgres.com, and there has been a utility called Pgxlog Dump and PG wall dump where you could actually take individual wall files and analyze them. Well, now there's a new tool as of Postgres 15 called PG Wall Inspect, which is a new extension that allows you to use SQL to actually query the contents of the wall files. You take a Start LSN and a Stop LSN, and you can analyze the different contents of the wall files just by using SQL queries against the database. So this is super interesting because I've used PG wall dump before to analyze wall files and having it right in the database to be able to put a load on and do an analysis within SQL that would have been very beneficial. But if you want to learn more about that, you can check out this blog post next piece of content. The bountiful world of postgres indexing options. This is from Kmopple GitHub IO, and he's talking about a post that we actually covered last week in Scaling postgres that compared using two individual indexes for doing a query in two columns versus using a compound or a composite index that covers both columns and how the composite or the compound one gave much better performance. And he tried to replicate that and then said, hey, let's try using different index types. So he said, what about using a covering index for getting the second data in? He also examined hash indexes, and he publishes his results here. And the covering index comes a little close to what the composite index results in, but the hash merge was even worse than two B tree indexes that were merged together. But if you want to learn more about some of these tests, you can definitely check out this blog post. Next piece of content. There was a final blog post posted for PgSQL Friday. This one is what is the PostgreSQL Community? To me. This is from Andreas. Sherbond la. So if you want to learn his perspective on the postgres community, definitely welcome you to check out this blog post next Piece of Content postgres support for JSON is ten years old. This is from Crunchydata.com, and this is a rough overview of JSON within postgres and some practical advice with regard to storage queries and filtering it, as well as different query performance based upon the indexes you use. So if you want to learn more about that, you can definitely check out this article. Next piece of content. There was another episode of Postgres FM this week. This one was on timestamps. So if you want to learn all about timestamps in Postgres, you can definitely check out this episode and the last piece of content. The PostgreSQL Person of the Week is Alish Zelini. So if you're interested in learning more about Alish and his contributions to Postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks.