Index Overhead? | Scaling Postgres 399

[00:00:00] The best practice for loading a lot of data into a postgres table is to basically remove its indexes or make sure they don't exist, load the data and then apply your indexes. Because it can take a really long time to update all of those indexes as it's loading the data. [00:00:19] So generally it's faster to load data without indexes and then apply them at the end. [00:00:25] And then even when your database is at a steady state, it makes intuitive sense that the more indexes you have, the slower inserts and updates will be. Well, the first blog post this week actually takes a look at measuring some of that, but I hope you, your friends, family and coworkers continue to do well. Our first piece of content is what is index overhead on writes this is from depeche.com so he was taking some measurements of what we were talking about. So he created a table and added 20 big INT columns to it and loaded a million rows with random values for those big integers. And he ran a battery of 22 tests. So one with no indexes, one with just a unique index on the ID, one with a unique index on the ID and a normal one on C1 and then the same thing adding an index on C2 and on and on. Now I should note this is using copy. So this million rows was created in a file and then that file was loaded using copy. But here are the results and for each of the tests he shows the table size, how long it took to load, and then the load speed, which is just the total table size divided by the time to load I believe. So you can see the fastest is of course no indexes. With the unique index on the ID it's a little slower. Then it drops by almost half when you add another index to it. And that's probably because of the random IO and then the performance kind of drops from there. Although the net effect of adding each additional index once you get above 7 or 8 doesn't diminish too much for each one. So he said speed definitely decreases. The decrease is not linear, it's worse, although I think it's a hard drop off at the beginning and then incrementally it gets not as worse. He also tested just one index, but very wide on 10 columns and the speed of loading was 50,000 kilobits per second. So that was significantly faster than 10 separate indexes. [00:02:38] So apparently wider or a multi column index covering more columns is more efficient than many individual indexes. And maybe that's because of the random IO of building those indexes that have to be done compared with building one. But that's just my theory. He didn't actually mention that making some of the indexes partial indexes definitely sped things up because you don't have to update as much data now. He did say this only applies to B tree indexes and other index types like Gin or Jest you're going to see different performance for. And he says, you know, you can test this out yourself by creating a table, copying data in, get the performance of it, and then recreate it again, add the index of interest and then copy the data in and see how long it takes. I don't know if we can draw definitive conclusions with regard to inserts and updates and things of that nature with this, but definitely in terms of loading data into a postgres table, I think it definitely is relevant, so feel free to check it out if you want to learn more. Next piece of Content There was another episode of Postgres FM last week. This one was on Postgres year in review 2025 and they covered all things Postgres in the year 2025. They did talk about Postgres 18 and different features that they added of course, like UUID version 7, the ability to better handle lock manager contention by adjusting new configuration options on the Async IO. Nick mentioned that it actually did result in faster backup, so I wonder if that is something people are experiencing. Are there backups faster on Postgres 18? They mentioned all the sharding projects that are being worked on, including PGDog and multigres in and Nikki, which is actually a project from PlanetScale again similar to Vitess, like Multigres is similar to Vitess. [00:04:29] They did talk about the hacking incident that happened at the US Treasury. They mentioned how Oracle cut jobs at MySQL and it seems at least in 2025 a lot of resources have been pulled into the Postgres side as opposed to MySQL. I mean PlanetScale is a perfect example of this, where it seems they have gone all in on Postgres Metal and now they're building a postgres version of Vitess. [00:04:53] And you see so many companies coming up with Postgres compatible versions of their own databases. So they're talking about Amazon Aurora and their D SQL Microsoft Azure coming out with Horizon DB. I think they also had DocumentDB which is a MongoDB compatible using a Postgres backend. [00:05:14] Google has AlloyDB as well. [00:05:17] The other thing they mentioned is with all the AI and LLM advancements that are happening, there's a greater need for these to communicate with databases. And therefore database cloning has really come to the forefront. And a lot of platforms like Neon and others and I believe postgres AI can do some of that as well. This concept of cloning or thin cloning has come to the forefront. They talked a little bit about PGvector and how this year it was people hitting limits on it. Like you can't really have 1 billion vectors in an index. You can only have millions. So what does that look like going forward? There have been some advancement on the analytics front with enhancements to DuckDB and other utilities that can read, say, parquet files. And of course there were two major acquisitions. One was Snowflake acquiring Crunchy Bridge and Databricks acquiring Neon, but a lot of topics were covered. If you want to learn more, you can listen to the episode here or watch the YouTube video down here. Next piece of content databases in 2025 a year in review. This is from CES CMU EDU and this is Andy Pavlo, and apparently he does one of these yearly and it's not postcards related. But of course the first topic he mentions is the dominance of PostgreSQL. Continues. And a lot of the things he mentions in this covers some of the things that were mentioned in Postgres FM's episode. So he talks about the release of 18, some of its features, talks about the acquisitions that were mentioned. He talks about major vendors doing distributed PostgreSQL. This is from the perspective of the large tech giants having customized solutions like Horizon DB and AlloyDB, Amazon's D SQL, but as well as some of the smaller players doing sharding such as pgdog and Multicres and Nike. Talks a little bit about the current commercial landscape with all the different vendors working on postgres. [00:07:16] And he talks about MCP for every database. So so many vendors are coming out with MCP connections to their servers for LLMs to be able to access them. [00:07:28] And we'll look at a blog post in a second that kind of summarizes some of my thoughts with regard to it. But he also mentioned some other databases in his post, so if you're interested, definitely check this one out. Next piece of content, the state of Postgres MCP servers in 2025. This is from DBHub AI, and when they say a picture is worth a thousand words, this is definitely it. [00:07:50] So this is an example of two contrasts. So you have PostgreSQL here on the left and you have massive amounts of Postgres users. The number of people actually building Postgres is rather small. [00:08:03] Now contrast that with what's happening with the Model Context Protocol or MCP servers. So you have very few users of the MCP at this point, but massive amounts of company and energy behind building MCP servers. And I'm actually someone who hasn't actually walked through the MCP door here for postgres yet, because I haven't seen a compelling use case like my use of LLMs. I'm using it to do coding. I'm using it especially to learn languages I don't necessarily know that I need to do some work in. I use it there, but my interaction with the database I already have set in place frameworks that I use. [00:08:45] So I don't see any speed advantages to using an MCP server yet. So I'm kind of waiting for something that's going to make a big difference to me, but I haven't seen it necessarily yet. But if you use an McP server for Postgres and you're getting advantages to doing it, please let me know in the comments below. But this post talks a little bit about the current state of the MCP servers now, so you can check it out if you want to learn more. Next piece of content PostgreSQL an introduction to row locking this is from dbi services.com and this is exactly what it said. It's introducing you to row locking. So he's basically creating a table, doing an insert, and then querying the lock table to see what kind of locks happen based upon what you do. So when you just query the locks table, you actually create locks yourself. You can see a relation lock, an access share lock, as well as a virtual XID lock. And these are basically coming from your back end. [00:09:44] Now this is a really good introduction because he starts introducing multiple sessions, so he does one work in another in an open transaction, and then you examine it in the second session or even introduces a third session at one point. [00:09:58] So you can see how the different rows in pglocks get created. And he explains what Postgres is doing in each of these steps. So it's going to be very difficult to explain this in a video or in an audio method, but I do encourage you to look at this blog post if you're looking for an introduction on how row locking actually works. Definitely encourage you to check this out. Next piece of content PostgreSQL table rename and views an OID story this is from database rookies.WordPress.com and he said we had to populate a very large table with a new UUID column. It needed to be not null and have a default. And of course that means you need to backfill it for all the existing rows. [00:10:41] Now, they couldn't necessarily do this alter table, add column, set the default to UUIDv7, and then do a not null. [00:10:49] Why? Because a UUID version 7 is a volatile function. So basically they opted to do a different approach where they create an entirely new table, optionally unlogged. You copy the data from the old table to new table and then swap the table names. Now, if this is an active table, this is really hard to do too, because you're going to be missing data. You really need to set up active triggers and do it in a different way. But apparently an active table wasn't too much of a concern with doing this. So what they did is basically created a table like an existing table, then they added the new UUID column, make it not null, and default UUID version 7 to the new table, and insert all the rows from the existing table into the new one. And then at the end do a transaction where you swap the table names. The problem is, when they did it, they got an error. [00:11:47] Why? Because there was a view that was contacting the existing table. And why this happens is because views work against the OID of the table, not the actual table name. And it's not just views. It can affect all these other things mentioned here, like grants and privileges, row level security policies, triggers, foreign keys, logical replication, publications, statistics objects, and atomic functions that reference the table. [00:12:12] So basically the best way to handle my opinion and option one here is just recreate the views. But another scenario they say is you can avoid table swapping if you do batch updates. And generally if you have an active table, normally you're going to want to put table triggers that are updating the current data, then backfill the old data and then at some point swap the tables over. But remember those dependencies. But if you want to learn more, check out this blog post. Next piece of content. The hidden cost of invalid indexes in postgres. Yes, even on Supabase, RDS and Neon. This is from Postgres AI and they're saying when a create index concurrently or reindex index concurrently, postgres leaves behind an invalid index. [00:12:55] But the problem is that these can be harmful to your database health. Now you can check for invalid indexes by looking on PGIndex where Indus valid is false. So they have a query here to check for it and they say these can cause numerous issues. [00:13:13] Now I do have a problem with how they are doing their tests in this example because they set up the table, inserted some data into it, created the index, but then to make an invalid index they simply change the indus valid to false and indus ready equals true. So the problem is this is a very specific state for indexes. I believe this means it is still tracking changes to that index, but it's not being used for queries. So of course the whole system is keeping this index up to date. It's being vacuumed, it's going to appear in wall files and all the different things they're going to mention here. The problem is a lot of times my experience with invalid indexes is that they go to get created and then they fail for some reason. So maybe a lot of times they are not indexes in a ready state, so they're not being kept up to date. And I actually did a test like this where I set a statement timeout of 10 milliseconds and then tried to create a concurrent index on a million row table. [00:14:19] It pretty much immediately fails because it hits the statement timeout. But then whenever I'm inserting updating data, that index block count wasn't changing like they are showing, like it does here. [00:14:31] So what they did in this test is that they installed the pageinspect extension and checked how many leaf items existed on the index. It was 100. [00:14:41] And then once they added a new value and then ran the count again, it was 101. Well, when I did this test, my index was in such a state, it didn't have any leaf items and whenever I inserted data the leaf items didn't change. Now if I did d and describe the table in psql, I could see the index there and it said it was invalid. But it was in such a state, it's not tracking any changes at all. So it definitely depends on what state you're in. Invalid index is, but if you have one, presumably in this ready state it's actively tracking changes. It needs to be maintained on every write, exactly as they say here. It needs to be processed from vacuum. It does generate wall, it will block hot updates from happening. So you may not have a valid index on the column. But now you can't do heap only tuple updates. It can pollute your statistics and cause planner overhead and lock contention. Now I don't know, but even the index that I created in an invalid state, it still may have planner overhead. I didn't check that specific thing out. And of course if you have an invalid index that you've created concurrently, you can't just recreate it again because it won't work. You actually have to drop that index concurrently ideally, and then create it again concurrently. But if you want to learn more about this, definitely check out this blog post. Next piece of content dissecting PostgreSQL data corruption this is from Creditiv.de and they're talking about the great benefit that Postgres 18 has where checksums are enabled by default. But if you do happen to run to the situation where you have a corrupt database and you realized it's corrupt, maybe from doing PG dumps or something, or maybe you're having specific queries failing this blog post talks you through how you could potentially maybe not resolve the corruption, but bypass it to at least get the system in a functional state, sometimes using the extension pgsurgery. So if you are brave enough to do something like that, you can definitely check out this blog post. Next piece of content waiting for PostgreSQL 19 implement alter table Merge split partitions command this is from depeche.com and this was introduced back in Postgres 17, but it never made it into the final release. So being able to merge partitions and split partitions is back. [00:17:08] But he did say this is a quite naive implementation which works in a single process and holds an access exclusive lock on the parent table during all operations, including the tuple routing. So this is quite heavy on the lock side, so it definitely can't be recommended for large partition tables, particularly under load, but it at least gets this feature out there and then maybe others can add enhancements to reduce the level of locking it has. [00:17:39] So again it offers merge and splitting capability. So if you want to learn more about this you can check out this blog post. Next piece of content PGACM is here. The this is from hdobrovoskaya.WordPress.com and this is a extension she created and PGACM stands for postgres Access Control and its intent is it's a framework that lets you create schemas that have predefined roles and privileges that are independent of one another. So basically she wanted a way to achieve complete isolation between different projects using the same database and allow authorized users to manage access to their data without having super user privileges. So if you have a need for something like that, maybe you can check out this utility and the last piece of content Quick and dirty loading of CSV files this is from depeche.com and this is a tool that he called pgcsvloader and it very simply just accepts a CSV file and it creates a table immediately based upon that CSV file. It automatically figures out the delimiter, the quote character column names, data types, and will load it into the default database. So check this out if you're interested in a utility like that. I hope you enjoyed this episode. Be sure to check out scalingpostgrads.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode there. You can also find an audio version of the show, as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 337

77 Times Faster In Postgres 17 | Scaling Postgres 337

Episode 138

Scaling Connections, TimescaleDB, Time-series Gaps, Red & Golden Signals | Scaling Postgres 138

Episode 140

Replication Conflicts, Postgres Governance, pg_crash, MobilityDB | Scaling Postgres 140