Are You Hurting Your Performance? | Scaling Postgres 371

[00:00:00] Sometimes when it comes to performance, you could be your own worst enemy. What that means is how you use the database can actually dictate how it performs. So if you do a ton of manual locking of things, you could impact the database's performance. [00:00:17] So the first blog post this week is actually talking about a better way to do locking for updates under most conditions. [00:00:27] Also, be sure to stay till the end of the show where in my consulting corner. I'm going to talk about some downsides of some upgrade methods for Postgres, but I hope you, your friends, family and co workers continue to do well. Our first piece of content is select for update considered harmful in PostgreSQL. This is from CyberTech. PostgreSQL.com and select for update has been around for a while, but this post talks about a better way to do it. Now, why you would want to use a select for update is because you want to try to avoid race conditions. So meaning you want to try to select some data and you don't want any other session to update that data before you send your update. And how you can write that code is you start a transaction, you select the data that you want to change and say for update. So that essentially locks that row until you send your update statement and you commit the transaction. Now other sessions can alter that row. But he says, quote but attention, the above code is wrong and I think he means wrong for performance. In other words, there's a better way to do it to avoid too much locking. And then he talks about foreign keys, and particularly when you're talking about reference rows. When you do an insert into a child table, it puts a forkey share lock on the referenced row and in the parent and this action is actually blocked by select for update transactions. So it shows that example here you start a transaction, you select from the parent table for update, and then when you try to insert into the child table in a second session, it actually blocks, it hangs. So what should you do other than for update you should actually use select for no key update. So as long as you're not updating a key value, meaning something that's part of a foreign key relationship or not a primary key, you can use no key update and that actually does not block these types of transactions. So for no key update should give you better performance. [00:02:28] And this is definitely something easy to forget, particularly if you're interacting with an ORM that has magic behind it that does these things in postgres. [00:02:39] So make sure you're using the right features of your ORM to be able to send these type of explicit row locks. [00:02:46] And he says you can do this unless you plan to delete a row or modify a key column, a primary key, a foreign key, etc. But if you want to learn more, definitely check out this blog post. [00:02:57] Next piece of content comparing native Postgres, Elasticsearch and PGsearch for full text search this is from Neon.com and this is mostly a text post, but the comparing the different solutions for doing full text search in postgres. [00:03:12] So first they talk about the problems with postgres full text search and I guess I would call it definitely a battery is not included. You have to do a lot of work to get it set up to query how exactly you want to do it. You have to establish the TS vectors and then have your queries with TS queries. And maybe you have to concatenate various pieces of content together and introduce ranking as a part of that if you're doing it. But the worst part about it is is performance degrades dramatically at scale. [00:03:42] So they have an example here you have a query that runs in 50 milliseconds on 10,000 rows, but when it's 10 million rows, suddenly it's now multiple seconds to return a result. [00:03:53] Second thing they mention is that it suffers from limited functionality in terms of using a gen index, whereas using something that is equivalent to a BM25 index that allows really fast ranking and allows fuzzy matching as a part of it as well. Because full text search is basically exact string and this is kind of like batteries. Not included is the development experience is a little bit cumbersome to work with. Then they move on to elasticsearch and they say there is all of these benefits for a lot of the negatives that were mentioned up here, it is a little bit of a better experience. They do use the BM25 search engine to get really fast ranks and fuzzy matching and the performance is more consistent as data sizes grow. But the cost of elasticsearch is okay now you need entirely separate database to manage and everything that comes along with it. And then they talk about an alternative to these is PG Search and they call it the best of both worlds because it actually embeds a search engine powered by tantive within Postgres and it creates BM25 indexes and it gives you performance and features very similar to Elasticsearch. [00:05:02] So to create this BM25 index you just say using BM25 and you indicate the columns you want to create that index on and then you just do a search with the three AT sign operator. So for example, look where the title is like postgres and the content is like search and then they talk about the numerous features, but then what's interesting is the performance. [00:05:25] So they actually did some benchmarks of a 10 million row data set and they said PG Search consistently outperformed Postgres full text search by 20 to 1,000 times faster. Like for example, looking at top N ranked Results, it took 81 seconds in PG search, whereas it took 38 seconds in for full text search. So I know this post is from neon.com, but PGsearch is an extension that you can add to any Postgres database. So. So if you're interested in this, definitely encourage you to check out this blog post to learn more. [00:05:59] Next piece of content. The non effect of Primary keys on bulk Data Load Performance this is from enterprisedb.com and he's talking about the normal way that you would do a bulk load is you usually drop all of the indexes and the constraints from the table and then you load all the data into the table and then re add all of the constraints and indexes because usually that's a faster way to load data. One thing you don't want to do is load data into the table, leaving everything in place, because that takes forever to run. But what he was surprised by is actually pretty efficient to still leave a primary key on, because he said he was looking at EDB's data migration service and they actually leave the primary keys in place when they're migrating data. And what's interesting, when I do logical replication upgrades, I do leave the primary keys on essentially the primary key indexes when I'm doing the initial data load and then I add all other indexes at the end. So it's very similar to what he's describing here. And he did a test with 300 million rows and doing it the recommended way, in other words, removing all indexes, all constraints, it took 10 minutes to load the data and 20 minutes to build the primary key index. So that's 30 minutes in total. But when he left the primary key in place, it took 20 minutes to load the data and of course zero to build the primary key. So essentially took 20 minutes in total, so it was 10 minutes faster. And he tried all different ways to say surely this is incorrect. But basically he couldn't figure out how to prove that not having a primary key index on it made it faster. So definitely when you're looking to upload data, maybe it Makes sense to leave the primary key on for that table. But check this out if you want to learn more. Next piece of content, PostgreSQL and Dux the perfect analytical pairing. This is from MotherDuck.com and they're talking about DuckDB and Postgres and putting them together. [00:07:56] And there's actually three ways you can do this type of integration, which I actually wasn't familiar with these. So the first way is using a DuckDB Postgres extension. [00:08:07] So this is actually using DuckDB but installing a Postgres extension in it. So this is not an extension you stole in Postgres, this is an extension you stole in DuckDB. [00:08:17] And what that does is allows you to connect up to Postgres to get raw data to do the analysis within DuckDB. So this is how you install the Postgres extension within DuckDB. Then you load it. Then you do this attach command to attach an entire database and you essentially give it a postgres connection string and then you can run your queries against it. Another option is you can do a push down scan of an individual query and this is how you do that. And again you include the postgres connection string. [00:08:49] So the good stuff about this is that it's very simple to do, so you don't need to modify your postgres server at all. This is just another connection to it. Another is flexibility. You can run DuckDB wherever you want, you know, on your laptop, on prem in the cloud. Another is isolation. The postgres database doesn't have too much of an additional analytical load on it. It's basically just pulling some additional data from it. Next is you get consistent reads because it's using transactional snapshots and easy exports. The trade offs is you now have a network bottleneck of transferring the data to DuckDB. There's limited support in the push down and you have a performance ceiling because the format is not something that's native to DuckDB. So that's interesting. The other option is using the pgduckdb extension in Postgres. So here you install the pgduckdb in Postgres. So you can see create extension pgduckdb. And now you can use the DuckDB engine to query particular tables. Again, it's still using row stores tables, columnstore tables would be better for it. Maybe there's a way you can create those. But the other benefit is now from within postgres, you can also read parquet files to do analysis that way, or even iceberg files. By installing a duckdb extension to be able to read iceberg data. So that's interesting too. [00:10:09] Now one thing they say is critically important is you don't want to run this on a primary. You probably want to run this on a replica so that DuckDB doesn't take too many resources against essentially your primary database. So they say best practice, install and use pgduckdb on a dedicated postgres read replica. And they say they showed one tpcds query showing a 1500 speedup in testing with using the duckdb engine compared to using postgres to essentially pull the same data. [00:10:38] They do talk about a motherduck integration as well. So that may be of interest to you. But the pros for using the pgduckdb extension is definitely analytical performance. [00:10:48] You have data locality because all the data is in postgres right there you can do hybrid queries. I'm assuming that means you can pull from parquet files as well as data in the database and you get columnar access because you can access parquet and iceberg files right from within postgres. In terms of the cons, there is a resource risk. That's why they say use a dedicated read replica. You do need to install extension on postgres. So if you're hosting your database and they don't allow this extension, you won't be able to use it. And a little bit of operational complexity. Then the other option they mentioned is essentially doing ETL or change data capture to move the data out of postgres and then use logical decoding to move that into another destination. So they talked about Supabase as a service that offers some of this. I'm assuming MotherDuck does as well. But if you're wanting to get more into DuckDB, more analytical queries, you may want to check out this piece of content. [00:11:46] Next piece of content PostgreSQL active active replication. Do you really need it? This is from percona.community and we talked about a blog post saying that AWS last week open source Active Active, which is, I mean I'll call it a multi master setup to be able to replicate data from for example different regions using eventual consistency. And he did say that he did not find any example of anyone actively using Active Active. Not to say there aren't people doing it. But a key question he has is do you really need it? And that's what this post covers is what are some of the good reasons he believes so. The first one is business continuity of across regions. So you have extreme high availability needs for example 5 nines of availability. So having your data accessible in multiple different regions is definitely a benefit of this type of technology, but you also have to be aware of the downsides of the eventual consistency. You'll have to deal with the Another reason to potentially use it is that you get write availability during regional failures. So maybe if one country or one region's area goes down, all the other regions are unaffected by that one region being down. And he lists some different cloud infrastructure outages that have happened recently, or even name resolution and routing issues. [00:13:04] Another reason why you may want to do it is you're actually migrating from legacy architectures where the distributed write semantics are already built in. So if you have something like that, this might be a use case for it. Next, he says application performance can be beneficial having a local experience, because this would help to avoid latency if you have a globally distributed application. So if someone's directly connecting to a server in their region, that could essentially give them better performance as opposed to having to jump to another continent to actually get access to their data. And lastly it says maybe you have local high availability in disconnected or semi connected environments. In other words, connections are constantly coming offline and breaking. This might be a good benefit for that. But if you want to learn more, definitely check this out. Next piece of content Preserve optimizer statistics during major upgrades with PostgreSQL version 18 this is from Cyberdeck PostgreSQL.com they're talking about the new enhancement in PG18 that actually maintains the optimizer statistics. So he's saying be aware, there are the PGSTAT system views. Those are not what's being transferred. What is being transferred is the optimizer statistics. For example, things from the catalog tables like PG class, PGStatistic and PGStatisticxt ext data. And he's considering these essentially optimizer statistics. And now you can actually export these using pgdump. You just say with statistics. And because PG backup can use PGDump, they're essentially using that to extract statistics for doing a backup. Now one thing that is not brought over right now is the extended statistics, but they have a solution for that in that after you do a PG upgrade, you can actually run this command vacuumdb all analyze only and missing stats only. So it will find all the missing stats that exist in the new database and go ahead and refresh those. But if you want to learn more about this new feature, definitely check out this blog post. [00:15:09] Next piece of content How I dropped the production database on a Friday night. This is from Vents. [00:15:16] This isn't necessarily performance related, but this is an always interesting piece to read and I always think about how would I have done things slightly differently. So basically they set up their database with a lot of foreign key constraints and made the cascade delete. And then a user had a problem with something and he said a user couldn't update their user profile. He says, simple fix. I thought I'll just delete their user record and recreate it with the correct fields. But the problem is the cascade delete now suddenly deleted presumably all the data that they owned. So essentially everything vanished. They talked about they were using row level security as well. I don't know how relevant that was to it. Maybe that's the reason for the foreign key constraints, but definitely not good. But the bad thing is this happened on a free plan where they had no backups. But he says, well, maybe if I switch to a paid plan they are actually doing the backups and the data will be there. And he was right. The downside is that he did lose 22 hours of data, but he was able to recreate that by backfilling critical records using the application logs. So presumably he got the most important data back. [00:16:26] Anyway, I found this interesting. Feel free to check it out. [00:16:30] Next piece of content. There was another episode of Postgres FM last week. This one was on Mean versus P99. And they're essentially talking about an article that Michael did last week that I covered on Scaling Postgres where he came up with a way to calculate an approximate P99 using information in PGStats statements. And they talked a lot about the benefits of looking at things like P99 and P95. And for example, that means what's the performance, for example, of your slowest query. So if you're looking at, say, query latency, if you're looking at the mean, there could be a wide distribution of query timings for that particular query. Whereas if you look at the P99, you're essentially looking at the slowest 1% or P95 looking at the slowest 5%. So you know, how bad can things get? [00:17:17] They talked about, you know, PGSTAT monitor does offer P99s and things of that nature, but they were unsure of what the performance cost of that is, because anytime you're monitoring something at a detailed level, there is a performance cost to doing that monitoring. They talked about whether they think PGSTAT statement should incorporate it or not. And they had a fair number of statistical discussions on whether you should say, add one or two standard deviations to the mean to get an approximate gauge of how widely something is distributed. But they also talked about how they would really like histograms of performance. [00:17:50] So a lot of talk about observability and statistics as it relates to those. So if you're interested, you can listen to the episode here, watch the YouTube video down here. [00:17:59] Next Piece of Content it looks like most of the talks from pgconf.dev have been posted at this YouTube channel, so if you're looking for more video content, you can definitely find it here. [00:18:10] And the last piece of content speed without building the modern PostgreSQL for the analytical and agentic era. This is from Tiger Data, which was formerly timescale. So essentially this is the announcement that timescale is now Tiger Data. So if you're ever looking for timescale and you find Tiger Data, it's the same thing now. [00:18:34] So they go a little bit over the retrospective of their company and how it's grown, but basically this is announcement that they've decided to change their name. So if you're interested in that, you can definitely check it out. And now for my consulting corner. [00:18:47] So again, last week I was saying how this summer seems to be the summer of upgrades, and I'm doing three different upgrades for different clients. [00:18:57] Some were talking about using Blue green deployments in AWS, others were actually migrating into AWS, or another were doing a logical replication upgrade from one EC2 instance to an upgraded database and another EC2 instance. And what I found surprising so far is some of the limitations of Blue Green deployment that AWS offers. So if you want to try to do a rollback, I haven't been able to find a great way to do that. Now for MySQL they actually posted a document about how you can actually do a rollback, but there's no such document that exists for postgres, and I haven't been able to find a way to do it. The other thing that's a restriction or concerning is that it seems to require very specific sets of configuration for different instances. And the cluster itself, meaning Maxwell senders needs to be of a certain size and different parameters affecting logical replication. Now that's fine, but normally I try to work in the constraints of what exists for the system today to do the logical replication so I don't have to do a reboot. But the Blue green deployment seems like, okay, your configuration is wrong, we're going to do a reboot. And this is to set up essentially the green cluster. So now even if we don't rely on blue green deployment to change those settings, we would still do a reboot so they match by the time we want to set up the green cluster. [00:20:21] I mean it is great. It is able to create a new cluster super fast because I think it's using copy on write technology to create the green cluster, which is awesome, you know, and everything's automated. But there are some downsides that directly impact availability of your currently running production cluster. Anyway, we're working through the best way to kind of deal with that, but have you had any experience with Blue green deployments with aws? Please let me know in the comments. [00:20:50] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content discussed, as well as sign up to receive weekly notifications of each episode there. You can also find an audio version of the show as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 271

30% Faster, UUID Downsides, Growing WAL, Processes vs. Threads | Scaling Postgres 271

Episode 328

Implement Get Or Create | Scaling Postgres 328

Episode 61

Constraint Triggers, Simple Shards, Data Streams, OIDs | Scaling Postgres 61