TPS Benchmark, Partition-wise Join & Aggregate, Partitioning Billions, Posgres 16 Features

[00:00:00] In this episode of Scaling Postgres, we talk about postgres TPS benchmark partitionwise join and aggregate partitioning billions and postgres 16 features. I'm Kristen Jameson and this is Scaling Postgres episode 279. [00:00:24] All all right, I hope you, your friends, family and coworkers continue to do well. Well, I wanted to thank everyone who took the time to respond to my call for recommendations for scaling postgres, in that I was thinking about doing some changes to the format of the show and I was looking for feedback, and a number of people did make comments in the video as well as send thoughts and recommendations via the email list. So thank you so much for that. I really appreciate it. And the recommendations ranged from some of what I suggested make sense in terms of maybe having some interviews or maybe doing a question of the week type thing. But then there were also people that says no, don't change anything, because they really like the focused format of the show. That's basically everything you need to know in a week, in ten or 15 minutes or so. So I'm definitely going to give this some thought. I may do a survey as well, I'm not quite sure yet, but at this point I'm thinking keeping the core show as it is, except reducing it down to just the top ten blog posts, but keeping all the links, of course, for all the content I found, and maybe just incorporating a question of the week to get feedback from viewers as well. And if I do interviews, I might follow someone's recommendation to have that as a separate video. So that way I don't have to feel the burden of having to get an interview every week. But it's kind of a special event whenever I get one, and I'll post it to the channel, but I'm still thinking about that and I'll provide more information in future episodes. But again, feel free to leave a note in the comments or reply to the email if you have any other thoughts with regard to that. But the first piece of content is EDB postgres advanced server 15 TPS benchmark. This is from enterprisedb.com. Now this is their advanced server, which of course is based on open source postgres. So it's definitely not postgres, the community edition. But what I found interesting is that most of this post was actually talking about PGBouncer and how important it can be. So they were basically analyzing TPS performance with and without PG bouncer in a couple of different configurations. So they have the systems they set up here, they have the example database host and their PG bouncer host their configuration information. And there's two different pieces of information plotted on each of these graphs. So basically the blue points go together and the red and green go together. So the dark blue is TPS with PG bouncer. And in their test series you could see it maintained over 750 tps for the particular test that they're doing. Whereas when they didn't have PG bouncer, the TPS was looks like 200 or less. So it was four times faster with PG bouncer. Now you'll notice at the top here it says with connection overhead. So that means they're including the connection overhead with doing these transactions. So it's not just that a connection was established and then they just did transactions, they made the Creston of the connection part of this test. And I should also mention that they're running PG bouncer in session mode. Now, the red and the green plots are latency, the green is with PG bouncer and the red is without PGBouncer. So as you can see, once you get up to a thousand connections, the latency without PGBouncer is huge, again about four times the latency compared to PG bouncer. So this is definitely suggesting if you're having a lot of connections with the database, PG bouncer definitely eases the burden on the database if you use it. Now, this next plot does it without connection overhead. So presumably you have connections that are being maintained and it's just putting transactions through. And here the results normalize quite a bit where it may be so close PG bouncer or not, it may be in the margin of error. The only exception is you have probably three times higher TPS throughput when you're not using PG bouncer going directly to the database, and you only have 50 connections. So at 5100, 150 connections, it looks like not using PG bouncer gives you more throughput. But again, this is assuming you're not including connections with it. But if your application, for example, has its own pooler and it's going to maintain those connections, then maybe this is something more like you will see not having a connection overhead, but definitely interesting pieces of information. And then they just ran the test for a duration of ten minutes to see how things changed with connection overhead and without connection overhead. And there's a little bit of randomness to the results, but I didn't see anything else too much out of the ordinary with the previous charts, although that spike without PG bouncer when you're going directly to the database has disappeared after that ten minute point. So maybe you can't really say that that is a sure thing you're going to find. And then the next charts actually looked at. Should you have PG bouncer on the database host or on its own server? Now, without knowing the cpu utilization, I'm not sure we can gain too much information from this, because how loaded was the system when PG Bouncer was running on the database host? Clearly on its own system it's not going to run into contention with the database, but they plot those results here. So definitely running PG bouncer, even in a session based mode, allows much higher transaction throughput, seemingly due to the overhead of connecting to the database a lot. So if your application has that type of profile, you should definitely consider using a connection pooler to help out and give you better transaction throughput. Next piece of content how partition wise joins and aggregates improve query performance this is from pganalyze.com and this is a post we did cover. I think it was two weeks back on scaling postgres, and that was the episode that I went through 24 plus different blog posts, and this particular one that Lucas is covering, I blew through super quickly. But he goes into much better detail about what this post was covering, and specifically the performance improvements that they were looking at was enabling partition wise joins. And as Lucas says, this is disabled by default because when you enable it you incur relatively high impact to the planning time. So if you have a very short query that doesn't need to do a lot of calculations, it seems the postgres team has said, well, let's not introduce this as a burden, but it's an option you can turn on. And he shows how you can turn it on basically per session and then run the queries that have that benefit from it. So overall your query performance will be better, the planning time will be increased, but the execution time hopefully will be reduced. But Lucas goes into a lot more detail about the blog post and some other information with regard to partition wise joins and aggregates, and I definitely recommend you check this out. Next piece of content presenting partitioning billions of rows at SFPug, August 2023 this is from Andy atkinson.com and this is a little over an hour YouTube video where he did a presentation for the San Francisco PostgresQL Users group and talked about his experience at his company partitioning a couple of billion row table while avoiding downtime of the application. Now for this particular one, they were using PG slice, which is a utility that has some tools to be able to do partitioning without downtime. The caveat with it from his presentation is that it only really works for a pendonly table. So if your table is receiving deletes and updates, you can't really use this particular technique. This tool is not the best option to do your live migration partitioning, and I found it interesting at about the 20 minutes mark he makes the comment that so many people say that the reason they were doing partitioning was for better data management, so they were retaining a lot of data and they wanted to start only retaining, say, two years worth of data. So again, it's much easier if you partition your data into multiple tables to be able to drop the oldest partition periodically. So they were partitioning monthly and then at the end of every month after two years. So the 25th month got removed from the database by just being able to drop the table. But he did say he hoped for better performance, but that wasn't the main goal or the reason that they were doing partitioning. And what I thought was really great about this talk is that he talks about the successes but also the issues that they ran into, because a lot of times those have a much higher benefit for learning. But definitely encourage you to check this out if you're interested. [00:08:25] Next piece of content Postgres 1614 Cool new features this is from cyber hyphen postgresql.com and we'll just run through the features. The first one is everybody's favorite. You no longer need an alias for sub queries in from, so I definitely agree. I'm looking forward to that. Although it's mostly just an irritation. I usually just put an x at the end of it if I ever realize that's the issue. Two is improved performance for vacuum freezing, so looks like 16 will be able to freeze all rows in a block, so that should lead to some better performance. Three is new system viewpg stat IO and we've talked about this a lot in previous episodes of scaling postgres on better IO tracking with regard to what's going on with your database system. Four is new predefined role PG maintain so basically now you can easily add someone to this group to allow them to do vacuums or analyze or refresh materialized view or do reindexes. Five is make the create role property of a user secure. So basically it was easy for a user to become a super user with this capability, but now this has been adjusted so you can't do that. Six is first tentative support for direct I o via the debug I o direct parameter. So this is kind of on the path to direct I-O-I guess it's a little cool, but I don't think there's anything that a general user would use with this in version 16. Seven is regular expression matching for database and role names in pghBa.com. That's a great feature. Eight, enhanced support for ICU correlations, so they continue to add more capabilities for more easily using ICU coalitions, and the support just keeps growing with that. Nine, streaming replication standby servers can now be logical replication publisher servers, so this just allows you greater flexibility in building out your logical replication infrastructure. Ten, logical replication now allows cycles with disabled statements. So apparently this is on the road to achieving some type of multimaster replication, and it enables you to disable sending statements from logical replication to prevent, I guess, circular logic, or they call it an echo chamber problem. So that's interesting. Eleven, in copy from you can specify a string that makes postgres insert the column default value, so that's good. Twelve, specify integer constants in Hex, octal or binary. 13, allow the client library LibPQ to specify acceptable authentication methods. That's great for security. And then 14, use the LZ four NZ standard compression methods for PG dump. So that's awesome, that's great. But if you want to learn more about that, definitely check out this blog post. [00:11:01] Next piece of content there was another episode of Postgres FM last week. This one was on self managing. So basically this is what you need to be aware of when you're managing your own system. So you're not using rds. You basically have brought up a server, or you have a local server and you want to install postgres on it. How do you manage it? So they discuss the practicalities of doing so, as well as some of the managed service style tooling available, and they have a whole recommendation down here of different ones. So you can listen to it up here, or you can watch the YouTube video here. Next piece of content PGSQL Friday twelve what excites you about Postgres 16? This is from software and booze.com and it looks like the next PGSQL monthly blogging event is on Postgres 16. So check out this blog post and the guidance with regard to blogging for this event. Next piece of content introducing PG later asynchronous queries for postgres inspired by Snowflake. This is from Timbo IO and this is a new extension that enables you to send a query to the database and the database processes it asynchronously. It says it uses PGMQ, which is a messaging queue system as a part of it as well. So it starts a background worker, produces the results and stores it in a JSON B column so it can be retrieved. So I have an example of using it here. First you install extension, of course. Then you initialize it to create a job queue and some metadata tables. And then you in queue the job you want to run using the syntax here. And later when it's complete, you can use this command to fetch the results back. And they have a whole set of features they're looking to potentially build out for this as well. So if that's of interest to you, definitely check out this blog post. [00:12:43] Next piece of content postgres subquery power tools, ctes, materialized views, window functions, and lateral join this is from crunchydata.com. And really these are all different tools, I would say help you build subsets of data. So the first thing they cover is what is a subquery or what is a subselect. It's basically a select statement that you put within params, typically in your where statement. So here you're selecting particular skus from the products table and you want to display product orders where the SKU is in that list. Now you can also achieve this through joins as well. And sometimes, like we've seen in previous episodes, those can actually be more efficient depending on what you're doing in the subquery. Next they cover is postgres view. And a view is basically a virtual table. It doesn't really exist, but it's a definition of a query that you can run, essentially calling on that virtual table. So you can think of it kind of like doing a subquery type role. And I should mention under each of these, they have their own perspective on when you would use these particular tools. Next is a materialized view. So as opposed to being virtual, it's an actual table. So you create this materialized view. The benefit is that now you can add indexes to it and you can refresh it on a periodic basis as they show the command to do that here. Next is a common table expression. So these are like sub queries, but it looks more like you're defining a function. So with huge savings you could think of this as a function and then it's called down here to produce the results of that function. So I think for this reason a lot of programmers really like using ctes. Next one is recursive CTE. So these are really beneficial when you're looking to deal with hierarchical data. So you can recursively have a CTE call itself to produce the data that you need. Next is a window function. So basically this is a function that operates on a specific window of data, so only maybe the row above and below each row do a particular calculation based upon it. And this one here, they're just doing a summary by this queue, and the last one is a lateral join and how I like to think of it, it's a for loop that runs for each row you're pulling back. So for each row it's going to run this query against it to give you whatever calculation or whatever information you're looking for. So if you want to learn more about this, definitely check out this blog post next piece of content the state of the database landscape this is from software and booze.com, and this is another database survey, but it's not postgres specific, it's for all types of databases. So if you want to participate in this one, the link to the survey is right here, and the last piece of content is configuring PostgreSQL and LDAP using start TLS. This is from procona.com and they're talking about how you can configure your postgres server to authenticate with an LDAP service, and specifically using TLS to secure that communication between postgres and the LDAP server. So if you want to learn more about that, definitely check out this blog post that does it for this episode of Scaling Postgres. You can get links to all the content mentioned in the show notes. Be sure to head over to scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube rightunes thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 95

max_locks_per_transaction, Logical Replication, Backup & Recovery, Pgbouncer | Scaling Postgres 95

Episode 369

Move Off Of RDS? | Scaling Postgres 369

Episode 168

Managing Autovacuum, Better JSON, Avoiding Updates, OS Tuning | Scaling Postgres 168

TPS Benchmark, Partition-wise Join & Aggregate, Partitioning Billions, Posgres 16 Features | Scaling Postgres 279