2 To 4 Times Faster With Vectorization | Scaling Postgres 291

Episode 291 November 19, 2023 00:18:15
2 To 4 Times Faster With Vectorization  | Scaling Postgres 291
Scaling Postgres
2 To 4 Times Faster With Vectorization | Scaling Postgres 291

Nov 19 2023 | 00:18:15

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss Timescale's performance improvements from adding a query vectorization pipeline, doing blue-green deployments for databases, using reserved_connections and two improvements to null handling.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/291-2-to-4-times-faster-with-vectorization/

 

View Full Transcript

Episode Transcript

[00:00:00] I don't normally choose a non open source post for the first article that I cover on scaling postgres, but I had to make an exception in this case because I thought the engineering was quite interesting and educational. But before we get into that, I hope you, your friends, family, and coworkers continue to do well. Now, once again, at the end, I'm going to have some more information on my postgres optimization course that's coming up, so be sure to stay till the end to learn more about that. But the first piece of content is teaching postgres new tricks, SIMD vectorization for faster analytical queries. This is from Timescale.com, and what they've done in Timescale DB version two point twelve is added a vectorized query pipeline that makes use of SIMD or single instruction multiple data vectorization. So let's kind of walk through what that means. So Timescale has had columnar storage, and I think an earlier release this year they added high levels of compression to it as well. And they also offer a hybrid solution where you get recent data at the row level, but then older data is put in this compressed columnar storage, and they show an example of what they mean by columnar storage. A typical row based disk output looks like this. You have a row with each column of information, but in column based storage you store data a column at a time. So all the device IDs are in, say, a dedicated file. All of the timestamps are in a dedicated file, so it makes it much harder to find all the information for one record in this format. But it's great for doing aggregations and averages, calculations that are taking place over a particular column or counts. Now what's interesting about this, if you look at it, the data type is the same, so the device ID has all the same data type, same data type for the timestamp. So if you're going to be doing any kind of operations on a particular column, it will be the same operation. And because of that they could leverage this single instruction, multiple data. Maybe I should just call this SIMD, if you know what I should call it. Put it in the comments, please. And CPUs have SIMD instruction sets, and it was developed in the late ninety s, and it was predominantly used for media work with audio and video, or of course video games. And as opposed to taking two values, conducting an operation to get a new value, the CPU can work with vectors. That's what it means by vectorized. So you're converting multiple values into a single vector or an array of values, and you're doing the same operation on all of those values in parallel. And the CPU has a way to do this which was needed for the media processing and video games. So they wanted to develop a pathway of using this for their comma stored based data, because it's all the same data type and they can conduct the same operation, but just do it in parallel. So they developed this new vectorized query execution pipeline and they go into more details about it, and you can read up about that if you want to learn more. But they're talking about the different stages. So they have an example query here. And basically they devised ways to do vectorized decompression of the data because it's typically already compressed or column storage way to do vectorized filtering. So in the where clause to do that, do the expressions that they run, and this is for some of their time, bucket functions, and then vectorized aggregation, which is doing the sum of the value and adding it all up. They had these kinds of results that are shown here. So they have just select the sum of a value from a metrics table. And the other one, they're just adding a filter, a where clause to it, where a metric is greater than some value. So in the first example for the sample of the data they had, it took 2 seconds to do this with row based storage, doing columnar compression dropped it down to 1.4 milliseconds the same query. So compression did give an advantage and then vectorizing that dropped it all the way down to 353 milliseconds. So from the point of row base to the vectorized aggregate it was 5.8 times faster, with the vectorized portion making it about four times faster. And in the query below, having a filter in does slow it down a bit more. The row based is actually faster than without the filter, but it's slower in some cases with a columnar storage and being vectorized. But still just the vectorization was about twice as fast. So I found this super interesting. Just the technology involved in doing it definitely encourage you to give it a read. And if you have a need for column storage and analytical queries, maybe you want to check out timescale. [00:04:50] Next piece of content there was another episode of Postgres FM last week. This one was on blue green deployments. So Nikolai and Michael were talking about a blog post on AWS about how RDS and Aurora now offer a blue green deployment procedure. And they have a link to that post right here. I personally kind of find Bluegreen being a little bit of a misnomer, because looking at the blog post, they're essentially doing logical replication to do an upgrade. So basically creating a new instance, getting it logically replicated with the current instance, and then doing a switchover procedure to it, which is what I've done for a number of customers recently to do their upgrades. But now it looks like RDS and Aurora come out of the box with being able to do something like that. And Nikolai and Michael discussed that. They seemed a little bit iffy on the blue green terminology as well. And I think Michael was also saying that normally you think about supporting rollbacks in that case, so if there's anything that's happened, you're able to roll back, whereas with the solution they're proposing, that doesn't seem as easy to do. And Nikolai gave some of his perspective on doing logical replication based upgrades as well. So definitely encourage you to check out their podcast, which you can listen here, or check out the YouTube video down here. [00:06:08] Next piece of content Connection Management in PostgreSQL Reserved Connections this is from cybertechn postgresql.com. They're talking about the new feature in 16 that allows you to reserve additional connections to be able to use the database when for everyone else they've run out. So if you look at Max connections, you just join show Max connections and by default it's set to 100. Well, for the longest time there have been super user reserved connections and the default is three, meaning that the system keeps three connections around at all times to allow a super user to connect to the database to do whatever they need to do. Now what that means is there's actually only 97 connections available for all other users, but the new feature is reserved connections. So this allows you to set up reserved connections for people who are not super users. By default it is set at zero, but that means if you set it to ten, well now as opposed to 97 connections being available for everyone else, it will be 87. So you have to subtract the super user Reserve connections and reserve connections from the Max connections to see how many are normally available. And they show how to set that up here. Basically you just create a user, it's creating Joe, and then you're granting the PG user reserve connections to Joe. So now Joe has the capability to always connect to the database. But check this post out if you want to learn more. Next piece of content Two exciting PostgreSQL features to improve null handling. This is from PGRs. Net. The first thing it covers is null values in unique columns and that null is not equal to null. Therefore, a unique column can still have multiple null values. But what if you don't want that? Well, in Postgres 15 there's a new feature called Unique nulls, not distinct, and that allows only one null to be present in the column's values. The next feature is ensuring a single column has a value. Now what he's referring to here is when you have, say, a polymorphic table where you have a notifications table, and there are many types of entities that can have notifications. For example, a company, an order, a user, and ideally you just want one of these references populated for each row. So this is a company notification, this is an order notification. [00:08:31] Now historically you could do a constraint check that the company ID is not null or the order ID is not null, or the user ID is not null. But that just makes sure that at least one is populated. But it doesn't prevent multiple ones being populated. Like there's an order and a user ID with the same notification, which in his case he didn't really want. But there is a function called Num nulls and num non nulls, and you can create a check constraint using the non non nulls function and include the different columns and say it must be equal to one. And now you get an error message if you try to insert a value in all of these. So this row insertion would have failed. So if you want to learn more, definitely check out this blog post. [00:09:16] Next piece of Content top eight PostgreSQL extensions you should know about. This is from timescale.com and we'll just take a look at the list here. And their first is PostGIs, which I'm a bit surprised by. The number one that I am used to seeing is PG StAt statements, but PostGis is definitely one of the most popular extensions. Number two, they do have PG Stat statements. Number three, they have PG crypto. So I could see that I haven't used it in a lot of projects. And actually there was a post maybe not in the last weeks, but week before last that someone was complaining about PG crypto. And I think some of the key management capabilities I know what I tend to do is actually use encryption keys on the application servers and encrypt the data there before storing in the database. So I haven't used PG crypto in my projects. Number four is PGPartman. So a lot of people like using this for partitions. I tend to like to roll my own solution, postgres foreign data wrapper Yep, I could definitely see that for being able to access an external postgres database. PG vector definitely. With all the AI startups, this is super popular nowadays, HDR, although ever since JSON Veek came out. Why is anyone using HDR anymore? I don't know. And eight is PGPCRe, which I've never heard of and I've never used, but it provides perl compatible regular expressions. So I think probably the last two of these are definitely not super popular. Maybe it should have been the top six, but one thing I'm surprised by is it didn't have something like PG trigrams. That's something I know I've used a lot and other clients have used as well, but check out this post if you want to learn more about it. [00:10:59] Next piece of content GitLab's challenge with Postgres Lightweight Lock Lock Manager Contention this is from PGinalyze.com and this is Lucas's five minutes of postgres, and he's covering a post from about seven months ago where GitLab was having some issues with the lock manager and basically I guess they were doing a lot of joins or had a lot of indexes or maybe even partitions and a lot of activity, which is probably preventing the fast path to happen in query planning and maybe execution. So he goes over their post and some insights that they found out. He even includes some potential solutions for dealing with this. So he says in Postgres 16, according to Andres, where some of the behavior was improved with lightweight lock lists, so that's faster. So moving to 16 could help alleviate some of this problem. The other thing he mentioned is using prepared statements more heavily because this issue tends to happen mostly at planning time. So if you could reduce your planning time by using prepared statements, that would help. So maybe if you use some of the newer versions of PG bouncer or other poolers like PGCAT that support transaction pooling with prepared statements, that could help alleviate some of this contention too, if you're seeing it. But check out this blog post if you want to learn more. Next piece of content vector indexes and postgres using PGvector IVF flat versus HNSW this is from Timbo IO and this is a post about the PGVector extensions, talking about the different indexes IVF flat for large language model work as well as the hierarchical Navigable Small Worlds Index. So you could definitely review this part for general education. And then at the bottom they do a comparison of them to give you kind of a decision table you could use that's shown here where if you're looking for short build times, definitely go with the IVF flat versus HNSW. If you're looking for the smallest size, IVF flat tends to be smaller than the HNSW. If you're looking for speed, HNSW is definitely it, IVF flat is slower, and the change in recall. This is kind of a proxy for accuracy upon updates. It gets really far off with IVF flat, whereas HNSW tends to remain consistent in its recall. Also, from other posts we've looked at, the recall tends to be higher in HNSW two, but check out this post if you want to learn more. Next piece of Content A complete guide to creating and storing embeddings for PostgreSQL data this is from timescale.com. This is more PG Vector AI related content, but what I found was interesting is the scenario that they were proposing here or the use case. So they had a blog and they really didn't want to change any of the table structure or alter it too much. So they came up with this architecture for creating a semantic search solution where you have your blog table and the recommendation would be apply a trigger to it that updates a blog work queue table so basically any change that happens here recorded in this work queue table. To know a change was made, you need to do something about it. Then you would have an embeddings job that would pull work from the queue, create the embeddings and update the blog embedding table so I thought there was an interesting architecture to add embeddings to existing applications without altering the way the existing application works, and they are showing it in the same database. But there's no reason why you couldn't put this blog embedding table on its own dedicated database, which I've seen recommendations for, I think crunchy data tends to say to put the embeddings in a separate database. But I thought this was interesting and they go into the Python code to get it set up in this example. Now there's a related post to this, and that's this post how we designed a resilient vector embedding creation system for PostgreSQL data. They repeat a lot of the first post here, but then get into some of the actual database functions related to it. So if you want to learn more, definitely check out this blog post. Next piece of Content PostgreSQL 17 Reset Shared statistics at once with PG Stat Reset Shared this is from dbiurfaces.com. Now PGSTAt Reset Shared has worked in versions before Postgres 17, but you have to put in what you want to reset. So here you're resetting the PG rider or the archiver and the stats for each of these areas, the PG stat wall, the PG stat IO get reset based upon that. Well now in 17, if you give it an empty parameter, so you just do that, it actually resets all the stats. So check this blog post out if you want to learn more. [00:15:42] Next piece of Content fun with Postgres floats, positioning and sequencing this is from Crunchydata.com. Now this contains spoilers about how he solved the 2022 Day 20s challenge of the advent of code. So this is quite a long time coming. If you want to check it out, feel free to review this post. [00:16:01] Last piece of Content Kubernetes running the Cybertuck PostgreSQL Operator this is from Cybertechsql.com and it seems crazy that there are so many operators nowadays for postgres. So EDB has one they're supporting. There's the Procona operator, the crunchy data operator, the Cybertech operator. I'm sure there are other ones. This is an area I would kind of like to have less choice available in terms of running postgres and Kubernetes, but they talk about how to get started setting up minikube to get started with the operator. So definitely check out this blog post if you want to learn more about that. And before we go, I do have a few things to mention about the postgres optimization course I'm planning. So there is a courses section on the Scaling Postgres website Now that has the free course, I set up the PostgresQL Performance starter kit. Well, starting on Black Friday, which is November 24, I will be posting the course there that has information about it and how you can become an early adopter of the course if you want to for a 50% discount. But I will be sending out an email announcement as well on Friday, November 24. So if you want to receive that announcement, go ahead and sign up for the email [email protected]. And just to reiterate a few things about the course, it is focused on performance optimization, so not so much on configuration, and I intend to create as massive a database as I can. So we're talking multiple terabytes with billions of rows for some of the tables to try to get a real world sense of how to optimize queries in that type of environment. It won't just be logging into the course and watching videos and being done with it. It's going to be released a module at a time and there will be some live sessions associated with it, so I really want to make it as engaging as I can. So look for the announcement this coming Friday. [00:17:54] Hope you enjoyed this episode. Be sure to check out ScalingPostcres.com where you can find links for all the content covered as well as the podcast version as well as transcripts of the show. Thanks and I will see you next week.

Other Episodes

Episode 92

December 02, 2019 00:13:17
Episode Cover

Book Sales, B-tree Boost, More Postgres 12, Using pgBackRest | Scaling Postgres 92

In this episode of Scaling Postgres, we discuss book sales, boosts to b-tree indexes, more Postgres 12 features and how to setup and use...

Listen

Episode 161

April 18, 2021 00:18:11
Episode Cover

Scaling Lessons, Synchronous Commit, Connection Management, BLOB Cleanup | Scaling Postgres 161

In this episode of Scaling Postgres, we discuss 5 years of scaling lessons, working with synchronous commit, managing connections and cleaning up BLOBs. To...

Listen

Episode 257

March 19, 2023 00:19:16
Episode Cover

Adopting PgCat, Time Bins, work_mem Settings, Bad Constraints | Scaling Postgres 257

In this episode of Scaling Postgres, we discuss Instacart adopting PgCat, binning or bucketing your data by time, the best settings for work_mem and...

Listen