23 Or 1.1 Times Faster Performance? | Scaling Postgres 284

Episode 284 October 01, 2023 00:18:59
23 Or 1.1 Times Faster Performance? | Scaling Postgres 284
Scaling Postgres
23 Or 1.1 Times Faster Performance? | Scaling Postgres 284

Oct 01 2023 | 00:18:59

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss whether something is 23 times faster or 1.1 times faster. We also discuss the release of Timescale Vector and whether you can build a queue on Postgres.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/284-23-or-1-times-faster-performance/

 

View Full Transcript

Episode Transcript

[00:00:00] So, you know, in last week's episode where I showcased a really large performance improvement, but I did say to take it with a grain of salt. Yeah, you should do that. But before we get into that, I hope you, your friends, family and coworkers continue to do well. And before we start, I do want to thank everyone who gave their feedback with regard to my changing of the intro, basically removing the intro or coming up with a faster one. And it seems pretty much universal that it was good to get rid of the intro. So thank you. But the first piece of content is we're actually looking back to what I covered last week, and that was the general availability of Hydra, and specifically to look at the Clickbench benchmark again, because thanks to a viewer, Guillermo, so shout out to him, he said, that seems like an enormous performance difference. And he sent me an email and know, you can go in here and click this psql Tuned button to see what Tuned postgres looks like. So if I uncheck it, this is the result that was presented last week where Hydra pretty much exceeded the performance in this benchmark for all these other database systems. But once you click PostgreSQL Tuned, it becomes the second fastest. Now, it's not ahead of Hydra, but it's a mere maybe ten, maybe 11% difference in performance, so it's not as huge a performance gain. And looking at the rest of these, I wonder, are these untuned as well? I don't see a Tuned option up here. But then the question is, well, what does Tuned actually mean? And if you go to the Clickbench source code, you can see all they're doing for the PostgreSQL Tuned is apparently increasing shared buffers to 8GB, which is something everyone should be doing to match the memory of your system. Increased max parallel workers to 16, which is just fine for this type of database type. Increased max parallel workers per gather to eight, which is fine, and increasing the max wall size to 32GB. So nothing too extreme. And if we look at the details for each query and you can see the details of the query by hovering over it here and it tells you what it was done, you can see there are certain queries where Hydra really performed poorly in this first column here versus PostgreSQL Tuned, which is what this one is. Whereas there are other queries that Hydra outperforms what PostgreSQL is doing. So it is really, on a case by case basis, what the query is asking for and what performs better or worse. But looking at the overall performance of this benchmark, the difference is only maybe around 11%, which is why you always have to take performance gains with a grain of salt, because I feel like reporting performance is a bit like finding an accountant. So when you're looking for an accountant, presumably you need to ask them what does two plus two equal? And the correct answer is not four, it's well, what do you want it to equal? So definitely there's so many ways to alter performance findings for something and even this, I'm sure there's issues with this benchmark. And the truth is you have to do your own proof of concept, you have to use your own data and your own queries and see how different solutions perform to see what works best for you. But again, shout out to Guillermo for contacting me about this. But the true first piece of content for this week is how we made PostgreSQL a better vector database. This is from Timescale.com and as a heads up, this is a 24 minutes read, so there's no way I could cover everything in this particular blog post appropriately. But I guess this is a new product they're calling Timescale Vector. I don't see that they have released an extension for this or an open source extension for this. So I don't know if it's open source or you can only get it through their cloud based Timescale product. That wasn't clear to me. But what's interesting is that they say with this solution they've developed a different index type they're calling Disk ann or approximate nearest neighbor that can even outperform the Pgvector HNSW indexes. This is the hierarchical navigable small worlds and this is the index type that saw a huge improvement with the most recent release of PG Vector. Well, they're indicating from their test, which as you know, you have to take with a grain of salt, is about 30% faster than that one. So this blog post goes over this particular product and they say it's also been optimized for hybrid time based vector search type queries. And so what is this new type of index? It is focused on being disk based. So so many of the indexes are built to be essentially a memory based database. But this has been optimized to be used with SSDs and it was created based upon Microsoft's disk ann. And they indicate it has different trade offs and unique advantages compared to HNSW. So again, there may be certain use cases where HNSW wins. There may be certain use cases where disk ann wins. They also say that the index allows you to do quantization optimization, which is a way of shrinking the actual index size. So as an example, they show the Pghnsw Vector index and the Timescale Vector was approximately the same size, but with the quantumization, it drops it to, I guess, about 10% of its size. So that's definitely helped. They talk a little bit about the benchmarking and the methodology that they used and they talked about they saw this performance benefit both in a single threaded mode and in a multi threaded mode. And lastly, they also talked about index build time. And as was mentioned in a previous episode of Scaling Postgres, where we covered the Pgvector HNSW index that it took a really long time to build, but IVF flats were shorter. The Timescale vector seems to be kind of in the middle and the Quantumization adds a little bit more time to that index build as well. And then they cover basically how this new index type works, going over some graphs and describing it. Now, related to this post is another from Timescale.com where they're talking about a Python library for using PostgreSQL as a vector database in AI applications. So if you're looking for a library to help you work with this, they've released this as well. But if this is of interest to you, definitely check out these two blog posts. [00:06:09] Next piece of content, choose Postgres queue technology. This is from Adriano FYI and he is endorsing using the notify and listen capabilities of postgres as well as for update skip locked to use your queues. And you don't need to use these other sophisticated queuing products necessarily. And that if you're familiar with a given technology, use postgres for this purpose and this kind of repeats, the know postgres is good enough to do a lot of these things. Now, he talks a little bit about the background job landscape and that so many solutions that I'm familiar know they use Redis. There's also Apache Kafka and Rabbit NQ and Amazon SQL, which is a service of course. And if you're looking at different libraries to do background jobs, you can see Sidekick for Ruby, rescue for Ruby, RQ using Python hangfire using C Sharp. And Asyncq for Go, and I think all of these pretty much use redis, and he know redis is great, but it's another piece of technology you have to integrate into your environment, if you're not already using it. So he's advocating you can just use Postgres again using the Listen notify and the for update skip block. But the other thing he mentions is that he's actually built his own background job library in Go that he's calling Neoq. But what he chose to do is not tie it directly to a particular backend, meaning you don't have to use redis. But he set it up so that, as he says here, you can switch between queues without changing any application code so you could choose to do it in memory. You can choose to do a Postgres implementation or a Redis implementation. So personally, I rather like this approach where you can choose your queuing technology. Personally, for my own application, I haven't started using Sidekick because I didn't want to introduce Redis as another dependency and have to worry about it and do backups and be able to test your stores and all of that. So I just use a postgres table for most of my queuing, which we'll look at another post in a second here talking about that, but I thought this was a great post and encourage you to check it out. Also, this post was mentioned on Hacker News and there was a lot of discussion and a lot of people mentioned just doing simple implementations in postgres. Of course, there was also the counterpoint and pushback against that. But you can check out this piece of content too if you're interested in learning more. Next piece of content. All the ways to capture changes in postgres. This is from blog Sequin IO and this is a product that basically allows customers to interact with a local database and it sends changes to interact with a separate API, specifically the Salesforce API, the HubSpot API. So if you have one of these products and as opposed to integrating directly with an API, you can actually talk to a database in your environment and it will send those API changes to your accounts on these services. So you could pull data down, you can insert data, update data, and it translates those database changes into API calls to these services. And this blog specifically describes them exploring different solutions on how to build this. The first one they tried out was listen, notify along with for update skiplock. But the issue that they had with it was it's at most once delivery and it only receives notifications from that moment forward. And if your session breaks, you may miss messages that way. And they also had some issues with the payload size only being 8000 bytes. The next solution they looked at is pulling the table. So checking for changes in the table, but they couldn't really handle deletes that way. Now they could get around it by doing an on delete trigger to update a separate table, but even handling it that way, you're still not going to get good diffs of something. So for these reasons, it's not something they considered. They looked into doing replication, so basically doing logical decoding from the wall stream to get the information that they needed. Now, it looks like this could have worked because they didn't have anything that stopped them in their tracks, but they didn't like the complexity and having to learn a new replication protocol. Essentially the next one they looked at is capturing changes in an audit table. So any change to the main tables, you just update an audit table with the information and that way you can get the diffs and whatnot. So this potentially could have worked. They also explored foreign data wrappers, but again, this was something that did not fulfill all of their requirements. But this blog post is a great summary of all the different change data capture options that you can do in postgres. Now they have a second blog post that they link to here called we had no choice but to build a Postgres Proxy. And this basically fulfilled all the requirements that they were looking for. So, much like PG Bouncer is a proxy, it can essentially speak postgres. They built their own solution that can speak postgres and is essentially wire compatible with a postgres protocol. Now, as they go into this, they kind of explain how things were working when they started. And when they started, they were doing essentially change data capture, which was what the previous blog post was about, and that captures things after the fact. So essentially, when they write things to the API or make changes to the API, they're sending something in an asynchronous fashion and they questioned whether that was good enough. And it basically can create a whole host of problems because how do you roll back something that's already committed in the database? Because the API had issues doing an update. So they kind of decided they wanted to do something synchronous. But what would that look like? They had a few different requirements. They needed to support insert, update and delete. They wanted to support returning clauses for their inserts. They wanted a commit to translate to a single API request. And errors must happen during the commit. So if the API operations fail, they should receive a postgres error. So first thing they looked at is doing synchronous replication. So synchronous replication of two postgres database systems. A change is only committed if it's committed on both systems. So they said, well, could we set up an application that acts like postgres? And it's just doing synchronous replication with the database that they're updating data with. But they had some issues with this with regard to supporting the returning clause. And also in a failure case, that would require them to really figure out how postgres emits errors for some of these things. Next option they looked at is using for data wrappers, but they would really have to write their own foreign data wrapper. And because a lot of customers used hosted postgres, you can't just install any foreign data wrapper on there, so that really wouldn't work for them. So then they decided to do the proxy option. So it's something that sits in the middle, receives the change, sends the API request. If it's successful, then it can update the database. So essentially it had to learn to speak postgres wire protocol. And in the end, that's the solution that they were looking for. So again, I found these last two blog posts really fascinating in terms of figuring out how to build this solution with postgres. And I encourage you to check it out. Next piece of content evolution of logical Replication this is from Amitcapila 16 blogspot.com. And this is another great post talking about how logical replication has evolved over the years, as well as talking about the changes in postgres 16. And just as a quick review, it's being able to prevent loops and bi directional replication. So this is supporting this active, active logical replication that's possible. Now allowed logical decoding to be performed from the standby server. So that's huge perform operations with table owners permissions. So as opposed to using the privileges of whoever created the subscription, you can actually make it the Table owner who does the operations on a table. Next is nonsuper users can create subscriptions. So this is great for people using Postgres as a hosted service. You can create subscriptions as nonsuper users. Large transactions can be applied in parallel, so you can enable this with streaming equal parallel and they have seen performance gains of 25% to 40% with this. Next is Logical Replication can copy tables in binary format, but if you use it as you need to have both the publisher and subscriber beyond version 16 or higher. And then lastly, indexes other than the primary key or the replica identity can be used on the subscriber so you can actually define those. But another good post talking about all the different features in Logical Replication. Next piece of content postgres 16 a quick perf check. This is from Kmopple GitHub IO and he typically does a quick performance review of new releases of Postgres. And he did it as well for 16. And the results were, he said, not really interesting because maybe the performance was 1.2% difference. I mean, some things were higher, some things were a little lower, but on the whole not much of a difference. But you can check this blog post out if you want to learn more. Next Piece of content there was another episode of Postgres FM last week. This one was all about postgres 16. So they discussed the most important new features, what it means for users and whether and when you should upgrade. And you can listen to the podcast or watch the YouTube video down here. [00:15:17] Next piece of content rolling the dice with PostgreSQL random functions this is from Crunchydata.com. They're talking about a random function that returns essentially a float data type between zero and 1.0. So you can use this along with like a Generate series function here, to create several random numbers. And you can do things to create numbers in a given range by multiplying or adding different numbers to it, or even subtracting. And you can even produce random words as long as you provide it some sort of a dictionary of words that it can pull from based upon, say, an index array, which he's doing here, as well as even ordering by different random values. And he also talks about a new feature in postgres 16 which is random normal. So you can actually get a normalized distribution of data that is random. So he shows an example of doing that here with a random normal function. And you can see the actual distribution of the data here, that it follows a normal distribution. So if you want to learn more about this, definitely check out this blog post. [00:16:16] Next Piece of Content functions to validate users input this is from Fluco 1978 GitHub IO. He's talking about two new functions in postgres 16. One is PG input is valid. So it helps you say validate timestamps to make sure that they're valid or not. And it returns true or false. And if something is false, you can actually get the input error info using the PG input error info function. So you just give it the same input parameters and it tells you essentially what is wrong. And these are great if you're building functions or procedures. So check this out if you want to learn more. Next piece of content is four improvements in postgres 16 that you might not know about yet. This is from Pginalyze.com and this is Lucas's five minutes of postgres so you can check out this content to learn more about the different content covered. Lucas also produced Explain Analyze Buffers in interpreting shared Buffers hit in nested loops. This is from last week, and he covers how you can actually get a really high hit count for shared buffers when doing Explain Analyze buffers when a nested loop join is used. Because however many times that nested loop runs, it's going to hit that memory multiple times. So it doesn't necessarily result in more disk access, but more memory is set multiple times. So check out this piece of content if you want to learn more about that next piece of content. Setting per user server Variables this is from Momgn US and he shows three different ways that you can set parameters for users. The first way is that you can set it on login by using Alter Role Set for a given user role. Second way is actually creating a security definer function, and this is where you can actually explicitly say what value someone can set a particular parameter to. So he shows how to do that in this function here. And then thirdly, as of postgres 15, you can grant set on a particular parameter to a user so you can allow them to change the log statement parameter, for example, in this case. But check out this quick post if you want to learn more. The last piece of content is using traditional calendars with ICU. This is from PostgreSQL verite pro. He's talking about how Postgres uses the standard Gregorian calendar. However, the ICU library that's available with the ICU underscore extension allows you to work with so many more traditional calendars, such as the Buddhist calendar, Chinese, Coptic, et cetera. So if you're interested in learning more how to do that, definitely check out this blog post. [00:18:37] I hope you enjoyed this episode. Be sure to check out Scalingpostgres.com where you can find the links to all the content discussed in the show, as well as get access to the audio podcast and the transcript. Thanks and I will see you next week.

Other Episodes

Episode 18

June 25, 2018 00:17:31
Episode Cover

Impact of Virtualization, Compressing Data, Parallelism | Scaling Postgres 18

In this episode of Scaling Postgres, we review articles covering the impact of virtualization on databases, techniques to compress your data and parallelism. To...

Listen

Episode 330

August 25, 2024 00:14:58
Episode Cover

Splicing Elephant & Duck DNA | Scaling Postgres 330

In this episode of Scaling Postgres, we discuss the merging of Postgres and DuckDB via the pg_duckdb extension, how this can help the analytics...

Listen

Episode 175

July 26, 2021 00:13:33
Episode Cover

Zheap Status, Sequence Gaps, Efficient Pagination, Set Up Patroni | Scaling Postgres 175

In this episode of Scaling Postgres, we discuss the current status of zheap, the gaps in sequences, how to efficiently paginate and how to...

Listen