Rust for Extensions, Timescale vs. Postgres, Uninterrupted Sharding, Data Flow | Scaling Postgres 234

Episode 234 September 25, 2022 00:16:20
Rust for Extensions, Timescale vs. Postgres, Uninterrupted Sharding, Data Flow | Scaling Postgres 234
Scaling Postgres
Rust for Extensions, Timescale vs. Postgres, Uninterrupted Sharding, Data Flow | Scaling Postgres 234

Sep 25 2022 | 00:16:20

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss using rust for Postgres extensions, performance comparisons of TimescaleDB vs. Postgres, uninterrupted writes when sharding in Citus and the Postgres data flow.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/234-rust-for-extensions-timescale-vs-postgres-uninterrupted-sharding-data-flow/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres we talk about Rust for extensions, timeline versus postgres, uninterrupted sharding and data flow. I'm creston. Jameson. And this is scaling postgres episode 234. [00:00:24] Hope you, your friends, family and coworkers continue to do well. Our first piece of content is Postgres. ML is moving to Rust for our 20 release. This is from Postgresml.org, and I'm not that familiar with machine learning, but what I did find interesting about this post was this is an extension for Postgres, and they're moving to Rust from I believe they were using python before, and they chose this route because they did some testing for calculating how fast a dot product could be produced. And if you look here, the performance of Rust compared to Python or plpgsql or SQL was Astounding and they showed the different test scenarios that they went through to come up with this. If you're curious, Blas is basic linear algebra subroutines. But based on this performance they went ahead rewrote their extension using Rust as opposed to Python and they are using an extension that I've covered in previous episodes of Scaling Postgres. They're using the PGx extension, which is a great way to package up extensions and deploy them. And that is written in Rust as well, apparently. But look at these performance improvements from version one to version two of this extension. You can hardly see the bars in this chart for version one, whereas the performance gain in 2.0 is immense from this transition to Rust. And that follows through on a lot of these other graphs as well. The improvement is quite remarkable. So if you use this extension, it's definitely something to look at. And if you're considering extensions, maybe you would want to write it in Rust too, because maybe some of the functions that you would rely on would be better performant. But if you're interested, definitely check out this blog post. [00:02:08] Next piece of content PostgreSQL plus TimescaleDB 1000 x faster queries, 90% data compression and much more. This is from Timescale.com and part of this I would call a marketing post for TimescaleDB, but it does have performance comparisons between stock postgres and postgres with the Timescale extension. And frankly I would prefer to go this route versus some other solution like you can only get in a hosted environment like AWS or Google maybe. I like the fact that they're using Stock postgres and develop Timescale as an extension to it. And I would say the other post that we're going to look at that does it the same way is Citus. They are using stock postgres and they have an extension Citus that allows them to do essentially their scale out. But it should be emphasized that this is for time series queries, this immense performance improvement over stock postgres. And it's not for all queries. One of them got up to 1000 times faster, some of them are smaller and we'll take a look at that in a second here. Now, how they achieved this is by a number of enhancements they've added to this extension. One is handling continuous aggregates as opposed to doing something like a materialized view. Another is their immense compression that requires less data transfer up to a 90% or 95% compression ratio with certain data. Part of that may be due to the column or format that they can utilize, but if you're able to compress your data size down sufficiently so that you're writing a lot less or reading a lot less from the disk, that'll give you better performance in and of itself. Now, I should say a lot of these are optimized for time series queries, so think of append only data with a date timestamp. If your data doesn't necessarily follow this, maybe some of these features wouldn't be of benefit to you. And here's where their query latency deep dive is. And you can see one of them was 1000 fold improvement, but there were some fours, some fives, some seven performance improvements, so not everything was a thousand fold faster. One query was, but they found relative improvements across the board. Now, what I really found interesting, if you scroll down a little bit further, it talks about their ingestion rate. And Timescale had done a post on this a number of years ago using a different version of Postgres, and it showed the ingestion rate of postgres dropping off a cliff at a certain point, whereas now it seems relatively consistent. They're showing a little bit less than timescale, but that's fine. Basically, it shows that Postgres eventually gets around to improving and coming close to matching the performance of a lot of these other add on solutions like Timescale or Citus. So presumably they see what the market is doing and eventually move to it. It's just with open source software, it takes a little bit longer than a company can do on its own, but I think this is expected. So if you need this kind of raw performance today, go ahead and get a Timescale, go ahead and get a Citus or another solution. And eventually it seems that the open source version of Postgres will catch up in various different areas to it. But if you're interested in learning more about that, definitely check out this blog post. The next piece of content that I alluded to is Citus 11.1 shards your postgres tables without interruption. This is from Citusdata.com, and they have added a number of new features to Citus 11.1 that basically allow you to move the shards all around without blocking rights to the database so things can still operate normally. So you can distribute postgres tables without blocking rights. So that's basically targeting a table and say, okay, now scale it out to multiple shards. You can isolate tenants without blocking rights. So my assumption with this is that you place a tenant in its own shard. Like if you have a really, really large tenant that's taking over a particular shard. You can dedicate a shard to it, I believe. Third is increase shard count by splitting shards without blocking rights. So again, this is moving the shards all around transparently without blocking rights as the cluster is running and rebalance the cluster in the background without having to wait for it. And finally, this is a big significant part that I find appealing about Timescale and Citis is that they're extensions. It doesn't take them very long to catch up to the most recent version of Postgres. So with this release they are supporting PostgreSQL 15 beta four, and they've mentioned lower down that they are going to be doing a recent release of Cytus once postgres 15 is finally released. And as a reminder, this is all open source, all their features are all open source now. Or you can of course get their cloud hosted solution. Same thing with Timescale DB as well as far as I understand it. But if you want to learn more about either one of these solutions and their enhancements, definitely encourage you to check out these blog posts. [00:06:53] Next piece of content postgres Dataflow this is from Crunchydata.com and this was a super interesting post where he watches the flow of a statement from the application layer all the way down to essentially the data on the disk and how many layers of caching it goes through. So he mentions all the different ones, but it basically comes down to this really complex diagram that he mentions at the end here, where you have all these different layers that it goes through and at each layer pretty much you can encounter caching. So there's browser with potential caching. There's Edge or web cache or proxy servers. There's the web server itself that can cache things. You can do things at the application layer that goes through an application pooler, then a database pooler like PG bouncer and then to the individual postgres backends which consult the shared buffers in the memory. And if there's not a hit there, it could hit the OS cache, which is what Linux uses. If any memory is still available, it uses it as a disk cache. And then even once you get down to the actual hardware you can have on device disk cache and then eventually the physical storage. So this was a great post that breaks down all the different levels that statements pass through as they're being processed in postgres from a high level view. So if you're interested in that, definitely recommend checking out this blog post. [00:08:12] The next piece of content PostgreSQL sequences versus invoice Numbers this is from Cybertechup and Postgresql.com and they're talking about using sequence numbers potentially for invoice numbers. And their requirements for the invoice numbers is that they're constantly increasing and there are no gaps. Now of course, once you hear no gaps you definitely don't want to use a sequence because sequences frequently have gaps. So how do you avoid the gaps and that's what he covers here. And he goes through several different implementations. The first way he does not recommend doing it is doing an insert into the invoice, selecting the max ID from the invoice table and then add one to it that'll give you the next invoice. Now, the issue with this implementation is, as he mentions here, concurrency. So if someone does the select at the same time trying to insert the value, they could potentially insert a duplicate value which will of course fail because of the primary key constraint that's on the invoice ID. So you wouldn't want to use this route unless you wanted to rescue from that. The next scenario to avoid that type of problem is to actually lock the table. But this is a super heavy lock and access exclusive lock and that's going to hinder reads to the table. So you definitely don't want to do that. That'll definitely slow things down. But he actually has two ways that are viable solutions to this. The first solution is to use serializable transactions so that ensures that only one transaction can be completed at one time. And basically you start a transaction begin transaction with the isolation level of serializable, and then you simply do your insert into the invoice table, selecting the max invoice number from that table and you won't run into any conflict issues as long as you're doing serializable. But he did mention another alternative, and that's creating a dedicated ID table so that you can minimize locking. You essentially only have one value in this table and you just update that value. Now you could run into lock constraints potentially because there is only one row in this table, so I don't know how that would perform versus serializable, but it's another type of implementation. And what you do is use a CTE and you basically update that value, return what the value is and do that insert into the actual invoice table. So this could be a potentially viable implementation, although you may have lock issues. I guess my personal opinion, I'd probably go with a serializable if you're looking for something that's gapless. But if you want to learn more, definitely check out this blog post next piece of Content PostgreSQL 14 Internals part Three this is a third release of the free PDF book PostgreSQL 14 Internals. And this one focuses on, as they see here quote, the diverse world of locks. So if you're interested in getting this type of free content, you can download the ebook from here. And this is from Postgrespro.com, the next piece of content. Fun with postgres Functions this is from Crunchydata.com and they're covering a variety of functions that are available in postgres that you may not be familiar with. He covers an assortment of date functions so you can do date calculations within the database without having to worry about doing something at the application layer. And he goes through several different use cases. He talks about text functions, so how to manipulate text and combining first names and last names into a full name. Doing a knit cap for the first letter of a particular name. It talks about aggregate functions and different ways you can aggregate the data to return what you need as well as different math functions. So if you're interested in learning more about that, you can definitely check out this blog post next piece of content. What is lateral? What is it for and how can I use it? This is from Depeche.com and they're basically talking about using lateral joins and how I think about a lateral join. It's basically an internal for loop you can do to a separate table of the table that you are querying. So for example, he's querying an events table and this is essentially what's happening in an inner for loop. It's counting up how many rows exist for this given value in the separate table he's consulting. So if you want to learn more about that, definitely check out this blog post next piece of content. Column Lists and logical replication publications and overview of this useful PostgreSQL feature. This is from PostgreSQL fastware.com. They're talking about an enhancement coming into postgres 15 where you can essentially do filtering at the publication level for logical replications. So essentially you can send less data to these subscribers. Now, they talked about the Where feature that basically reduces the rows that are sent to a subscriber. This one focuses on the lists of columns that are sent to the subscriber and it's called a column list. So basically you can define the specific columns that you actually want to send over logical replication for a given publication. So if you want to learn more about this feature, definitely check out this blog post. [00:13:12] The next piece of content, pipeline Mode in Psychopg and this is from Blog Dalibo.com, and they're talking about the release of Psychopg Three. One that supports the pipeline mode of postgres. So it basically allows you to talk to the postgres in an asynchronous fashion. Normally when you send a query, the client has to wait until a response comes back, but with Pipeline Mode you can send multiple queries over the same connection and eventually you'll start getting the results returned to you when you're ready for that. And they've added this functionality to essentially the main postgres client for Python. They're talking about where this is particularly useful when you're doing updates, deletes or inserts, you're trying to stack up together. If you're doing a select, it may not make that much sense because you need to do something with the data that you get back. Although if you're running multiple selects to try to render some type of page, there could be an advantage with Pipeline Mode with that. But this is a great enhancement and if you want to learn more about it, definitely check out this blog post as well as five minutes of Postgres episode 36 optimizing postgres query throughput in Python using Pipeline Mode and Psychopg Three One. So Lucas covers this post in depth and goes over a lot more detail, so check his episode out as well. [00:14:29] Next piece of content using the Timescale gem with Ruby. This is from Idea Me, and if the Timescale post and all the performance gains are appealing to you and you're using Ruby or Ruby on Rails, maybe you would like to check out this post to see how you can get started using the Timescale extension for it. And he actually has done some live coding on Twitch and the episodes are here, so definitely check that out if you're interested. [00:14:57] Next piece of Content how to make Pgpool Two leader switch over seamless on AWS updating the route table. So definitely check out this blog post if you want to use PG Pool Two for this purpose. [00:15:10] Next piece of content. There was another episode of Postgres FM. This episode was on Index Maintenance where they say how do we know if or when we need it and what we need to do? So you can listen to the episode here or click here to look at the YouTube video. [00:15:27] Next piece of content. The PostgreSQL Person of the Week is Peter Smith. If you want to learn more about Peter and his contributions to Postgres, definitely check out this blog post and the last piece of content we did have another episode of The Rubber Duck Dev Show this past Wednesday evening. This one was on rails and postgres scaling with Andrew Atkinson. So if you're looking for additional scaling ideas and you're okay listening to our long form podcast, we definitely welcome you to check out our show. [00:15:56] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks. [00:16:15] Our channel.

Other Episodes

Episode 16

June 11, 2018 00:11:57
Episode Cover

Storing Bits, Autovacuum, Assertions, Logical Recovery | Scaling Postgres 16

In this episode of Scaling Postgres, we review articles covering efficiently storing bits, autovacuum, setting up assertions and logical replication recovery. To get the...

Listen

Episode 37

November 05, 2018 00:17:35
Episode Cover

Postgres Goodies, Materialized vs Rollup, Column Oriented | Scaling Postgres 37

In this episode of Scaling Postgres, we review articles covering Postgres goodies, materialized vs. rollup, column oriented data and parallel scans. To get the...

Listen

Episode 70

July 01, 2019 00:12:57
Episode Cover

Default Security, Max Connections, Progress Reporting, Migration | Scaling Postgres 70

In this episode of Scaling Postgres, we discuss the default security, max_connections and pooling, progress reporting and migrating to Postgres. To get the show...

Listen