222 Times Faster Analytical Queries With Columnar Storage | Scaling Postgres 283

Episode 283 September 24, 2023 00:16:00
222 Times Faster Analytical Queries With Columnar Storage | Scaling Postgres 283
Scaling Postgres
222 Times Faster Analytical Queries With Columnar Storage | Scaling Postgres 283

Sep 24 2023 | 00:16:00

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss how to get 222 times faster analytical queries with columnar storage, a Postgres 16 feature review, the birth of a feature and fuzzy search.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/283-222-times-faster-analytical-queries-with-columnar-storage/

View Full Transcript

Episode Transcript

[00:00:00] How would you like to have 222 times faster analytical queries? [00:00:06] This is something that Hydra has shown is possible. But before we get into that, I hope you, your friends, family and coworkers continue to do well. And also be sure to watch till the end because I actually changed my outro for the first time in over five years. But let's take a look at Hydra. So this is from Hydra. So and this blog post is announcing its general availability. And it is an open source extension, kind of like cydus, kind of like timescale DB is now, but it focuses on column storage for analytical workloads. But it says it also supports hybrid as well, apparently. So this particular post goes over what column storage is. It basically stores columns first, meaning the same type of data is all stored together. This example, OLTP, is here, and it's easy to pull out one record of information. So if you want all the information about Joe, that's easy to get, all the information about Felicia, that's easy to get. But a columnar based storage stores similar data types together. So if you look down here, the order of the data shows everybody's name first and then their IDs, then the colors, et cetera. Now that makes it fast for counts, sums, averages for particular column of data. Plus you can also get really efficient at compressing that data, which can lead to further speed improvements. So, like I said, Hydra is an extension that you can use by default. The tables it creates are columnar, but you can create standard heap tables using the using heap command and you can convert them back and forth. I don't know how efficient this would be if you have hundreds and millions of rows, but it says you can do it. Now, this is the five minute video included on the page. And I'm going to focus in at one point where they selected all the rows from a heap table with 100 million rows in it and then did the same count from a columnar table. And the columnar table was 222 times faster. But if you notice this query, I don't think there are any indexes on it as well as it's not narrowing down any data at all. So I think this might be a very specific use case where you get this type of immense performance. And then he showed something that is about 600 times faster, where he did a sum of one column, an average of another column, and just got a count and it was over 600 times faster using the columnar storage compared to the heap storage. And in the blog post, they have a link to benchmarks. So this is a link from a benchmark Clickhouse.com, looking at Clickbench, which is a benchmark for analytical DBMS. And for the comparison listed down here, they have Hydra TimescaleDB with compression, citus, aurora timescale, DB presumably without compression, and standard PostgreSQL. And at least for the queries that this benchmark is testing. Hydra beat every other solution here, and then if we look at the GitHub page, they have a quick start. So you could use their Hydra service in a free tier, or you can run Hydra locally using a docker container. But I wanted to focus on the FAQs here where they say why is Hydra so fast? It's basically columnar storage query, parallelization vectorized execution, column level caching, and tuning postgres. So there's a lot of things they're doing to get all of that speed, and of course there's compromises for doing that. But they say, what operations is Hydro meant for? It's basically great for doing aggregates count sum average where clauses and bulk inserts updates and deletes. So the key is bulk. So they intend for you to load data in a bulk fashion. This isn't to replace your transactional database. And they say here what is columnar not meant for frequent large updates, because those don't really work well with the column oriented structure of the data and small transactions. So this is not going to replace your typical transactional database. There are some unsupported features at this time, such as logical replication, and columnar tables don't typically use indexes. Now, as a contrast to this, CyberTech UL.com did a blog post on Citus row store versus column store and PostgreSQL, and they discussed the difference between row store, column store, the focus on analytical use cases, and by default Citis creates heap tables. But you can specify using columnar to use their column based storage. He is using the scale out capabilities at Citus to create distributed tables and loaded in some time series data. And the other thing I should mention is that the columnar storage does tend to save on space. So he witnessed five times smaller storage for columnar based data. Again, similar data types, I believe they can compress better than doing the heap row storage. And the same size savings can be seen in Hydra and of course TimescaleDB as well. Now, his counts aren't as impressive. He only sees about a three times faster query between heap storage and columnar storage. But he wasn't using any where clauses or anything like that. He was still just doing accounts from a table, the way Hydro was doing, and then on a single column, he could even count the value group by it order, by it, and it still gave better performance, although not as big as before. So I always have to take these performance improvements with a grain of salt and testing out these solutions with your particular workload to see what kind of performance benefit you can get. But it's great to see this new option in the community to support analytical workloads. And if you do a lot of those types of queries, maybe you want to check out Hydra or even the capabilities of Citus or Timescale DB, but let me know in the comments if this is something that would be of potential interest to you. Next piece of content discussing PostgreSQL, what changes in version 16, how we got here and what to expect in future releases. This is from PostgreSQL fastware.com. And out of all the Postgres 16 feature posts that have been coming out, I think this is the best one and covers the most. Now, he does a little bit of a historical review as you can see here, as well as looks at postgres 17. But for the 16 features he really covers everything. I mean, I'm not going to go into everything, but he talks about all the different changes to logical replication enhancements to storage system, wherein we got the potentially up to three times faster copy happening as well as the PGSTAT IO view new SQL features, including adhering to the new SQL JSON standard, as well as the collation changes, additions to security and different privileges, as well as general miscellaneous performance improvements. But of course, what also interests me about this post is what's potentially coming in postgres 17. And there's a lot of exciting improvements coming to logical replication in particular. At least I hope they can get most of these in by version 17. The first is DDL replication. So this is huge, being able to send table changes to a subscriber replication of sequences. So this is a long time coming, particularly if you want to try to use logical replication for upgrades, you always have to bring over the sequences. This is another huge one. Synchronization of replication slots to allow failover. This is another hugely important one because once you fail over, you kind of have to start logical replication from the start again. But there are some other interesting ones too, like incremental backup, just a lot of interesting changes. So if you're interested in that, definitely check out this blog post. In addition, if you're liking this content, feel free to like and if you're not subscribed yet, hit the subscribe button. [00:07:47] Next piece of content active Active in Postgres 16 this is from Crunchydata.com and they're talking about what historically has been called multimaster replication. So basically having two primary databases running and synchronizing changes between each other. And this is using the logical replication changes in 16 that we discussed before, where you can now track the origin and basically only synchronize data that hasn't come from another replica. So that enables this Active Active replication to happen. So he talks about the origin filter that's now present in the communication between them to be able to discern where an origin comes from. He sets up a sample environment, inserts some data in there and gets the logical replication set up with the publishers and the subscribers and emphasizing origin none to be able to support that. Active Active and then a replication test where he updated some data and it updated appropriately in each system. But this is so new, you can't say we have multimaster replication today, so there's a lot of things to be careful about and confirm that it's working as intended. So he presents an example where you could actually get conflicts between the two that will not eventually be resolved. So basically you need to manually go and fix corrections such as these. So again, I like where this is going but it's definitely still early days with regard to it, but check out this blog post if you want to learn more. [00:09:13] Next Piece of Content JSON updates in Postgres 16 this is from Crunchydata.com and this covers changes to Postgres to match the new SQL JSON standard. So he shows things like checking if the value is a JSON object, looking at JSON array array aggregate, JSON object, JSON object aggregate and he goes through examples of each of these and use cases for them. So I'm not going to go through all of this, but definitely a great resource if you are looking for the changes in Postgres 16. [00:09:44] Next piece of content is postgres 16 is released. This is from Pganalyze.com and this is Lucas's take on the changes to Postgres 16. So I encourage you to check out his piece of content with regard to that. And lastly is the slides that Bruce Mongian did for a presentation on major features in postgres 16. So these are some additional resources to check out the new features in postgres 16. Next Piece of Content Grouping digits in SQL this is from Peter Eisentrout.org and this was a very interesting blog post where he covers, as he says here, the story of a feature. So this is how this feature came to be and the feature is where you can put a delimiter in your numbers to be able to make it easier to read. So for example, there's two underscores here, but Postgres still reads it as 10 million and he says normally you would propose this patch to Postgres, but sometimes that's a bit of a struggle because they don't really want to make changes that is going to go against what may be coming in the SQL standard. So he actually turned it on its head and decided to quote, get this into the SQL standard first. So first they got this into the SQL standard and after that was done, he covers the issues of actually implementing the feature, where Postgres would do some crazy things. Like it would parse this. Select one, two, three, ABC. As being select. One, two, three, and call the column ABC, which is a little bit crazy to me, but apparently that's how the parser worked. So they actually had to make changes to adjust that as well as a number of other issues including having to deal with some SQL JSON issues. So definitely a fascinating post of how a feature got into Postgres. So check it out if you're interested. [00:11:28] Next piece of content hired versus fired. Fuzzy Search and PostgreSQL this is from Cyber Postgresql.com. They're talking about fuzzy search and particularly some of the functions available in the Fuzzy String match extension or the Fuzzy STR match extension. So he just set up a table, inserted five rows into it and tested some different functions. The first one is Soundx, so this is supposed to phonetically sound similar. And he demonstrates when you put it through a soundx function, you can get fired, returned even though you misspell it. So that's one way Fuzzy String Search will work, and this will work with an expression index. So you could generate the soundx as part of the index build. Then looked at the Levenstein function, which basically determines how many letters you have to replace between the word in the system and what was in the query. And he shows an example of doing that here. And then lastly he looked at MetaPhone Search, which he says is similar to soundx, but you could define a length of output so you could see that it takes a string and converts it into what it sounds like and you specify the length. So this is a four character length of this phrase and this is a six character length of this phrase. And then of course, you can compare them between each other. But check out this blog post if you want to learn more. Next piece of content there was another episode of Postgres FM last week. This one was on Logical replication. So they talked about logical replication in general. They didn't spend too much time talking about the Postgres 16 features, but spent a lot of time talking about logical replication in upgrade scenarios to upgrade a database. And even called out this Twitter thread right here where they say, quote GitLab upgraded multi terabyte, heavily loaded clusters with zero downtime essentially using logical replication. So if you want to listen to that episode, you can check it out here or watch the YouTube channel down here. [00:13:20] Next piece of content enter the matrix. The four types of Postgres extensions. This is from Tembo IO, and he's talking about extensions pretty much fall into four buckets. Either they require load or they don't, or they require running create extension or they don't. And there's different reasons for those use cases and how you can actually get and enable extensions typically. And apparently they're working on some features to be able to load any extension in Postgres easily. So I don't know if the intent is some type of library installer. Like, for example, I use ruby, and they have gems. And installing a Ruby Gem is quite easily to just install it using a toolkit called Bundler. So I don't know if that's kind of where they're going with this, but how I install extensions today is basically check. Okay, does it need to be in the shared preload libraries that they mentioned here? Put it there if it needs to be. You need to do a restart of Postgres. And then secondly, do I need to run create extension for it or not? That's basically how install extensions now, but this goes into a lot of detail about extensions, so if you're interested in that, check out this blog post and the last piece of content. Cytus. Seven commonly used advanced SQL tools. This is from Cyprusgresql.com, so he identified seven commonly used tools for postgres and how it works inside us. The first one he talks about is naming Cytus databases, and basically it's difficult to create a database within Cytus. And the takeaway he posts is quote, just use the postgres database and you'll be okay. Okay. Two is loading data using copy and it works just fine to use copy. Inside us. Three, advanced SQL using ordered sets. And the use case for this is if you're wanting to find the median, not necessarily the average or the mean, but you want to find the actual middle value, and for that you have to order the set of data. And this does work. Inside us. Four is window functions, and he was able to get a window function working inside us. Five was grouping sets and those really didn't work. He ran into an error, but he was able to accomplish the same thing using Union. All six is using triggers with Citus, and what he found out is that it's not possible to use triggers on distributed tables. So definitely some things to keep in mind if you're exploring using Citus. I hope you enjoyed this episode. Be sure to check out scalingpostgres.com for all the links in this episode as well as the video, the podcast, replay and script for the show. Thanks and I'll see you next week. [00:15:57] Our.

Other Episodes

Episode 243

November 27, 2022 00:10:25
Episode Cover

IN vs ANY, Ghost Conditions, Percentage Calculations, Variadic Unnest | Scaling Postgres 243

In this episode of Scaling Postgres, we discuss how PG15 helps reduce replication lag, how to get the parameters used in prepared statements, PostGIS...

Listen

Episode 51

February 18, 2019 00:15:58
Episode Cover

Value of SQL, Window Functions, DB Migrations, Data Storage | Scaling Postgres 51

In this episode of Scaling Postgres, we review articles covering the value of SQL, window functions, scaling database migrations and efficient data storage. To...

Listen

Episode 92

December 02, 2019 00:13:17
Episode Cover

Book Sales, B-tree Boost, More Postgres 12, Using pgBackRest | Scaling Postgres 92

In this episode of Scaling Postgres, we discuss book sales, boosts to b-tree indexes, more Postgres 12 features and how to setup and use...

Listen