The Cost of Arrays | Scaling Postgres 400

[00:00:00] With a lot of Postgres features. The amount of performance benefits you get depends on how you use those features. There may be some ways you're using them that you actually hurt your performance, whereas other ways may give you a performance boost. Well, we're going to be talking about some of that today. [00:00:18] Also, be sure to stay till the end of my consulting corner where I talk about AI giving index recommendations, but I hope you, your friends, family and co workers continue to do well Our first piece of content is the hidden cost of PostgreSQL arrays. This is from boringsql.com and he's talking about array types, where you basically store more than one value within a single column in Postgres. So you can have integer arrays, you can have text arrays, pretty much any primary data type you can have in an array. Although he mostly talks about integer integers here. Now he says to be cautious of just using this as a document model for your database. So be careful you're not straying too far from the relational data model that Postgres depends upon. Because once you start using arrays, you may compromise some referential integrity. [00:01:15] And the key indicator he says to watch out for if you are storing an integer that exists as a primary key in another table, you might want to reconsider using arrays for that use case, because really that should be its own table with foreign key relationships to maintain that referential integrity. Although I will admit I have violated that use case in some instances where I thought it was warranted. Because arrays are, as he lists here, super storage efficient and even more so than JSON object. Because whenever you need to create a relation table in a framework, normally you have to have its own primary key if you're not using natural keys. And you are also typically having a created at date an updated at date. Now do you need those columns? Not necessarily. You can get around them. But if you do need a reference to an id, an array is a much more compact way to store that information. But in general I do agree with his two statements here. [00:02:18] Arrays are for data that shares the same life cycle as the parent row, not for relationships spanning across different tables. Now first he mentioned some gotchas, one that arrays don't have to start at 1. [00:02:31] Now by default SQL arrays do start at 1, but you can create one with arbitrary bounds. So his recommendation if you're using them is to always use array lower and array upper to get the bounds. [00:02:45] Now he also says when you're creating a table and an array, normally you expect pretty strict typing, but that's definitely not the case in a multi dimensional array. For example, this 2D matrix he's created here, that this structure is enforced, but it is actually not. So for example, you can just insert a regular ray and it inserts just fine. And if you want to do this type of enforcement, you're going to have to use a check constraint. So he shows an example of that here, creating a check constraint for this. But he says it does enforce uniformity across arrays at every nesting level, so it will reject sub arrays of different sizes. And he shows an example of that here. He then talks about accessing arrays and some care must be taken. For example, this representation of array is an accessor, whereas this representation is a constructor. And because accessing arrays has forgiving behavior, it can be difficult to actually find bugs. But it says it was probably more confusing for developers is that postgres treats multidimensional arrays as a single matrix and not an array of arrays. And if you wanted to get unexpected results from a programming language with this type of an array, you actually have to un nest, slice and then reaggregate the results. So a bit of a pain. He says you could also try converting it to JSON B, but that has its own issues to deal with. And he says frankly, if you need multidimensional arrays, you may want to consider just storing them in JSON B, because in his assessment, it will do exactly what you expect. [00:04:21] Now he then discusses indexes and how a B tree index is insufficient for arrays because there are multiple values in the column. So, so what you really want to use is a GIN index. And then in terms of actually querying the data so that you are using the index, you want to make sure to use the right operator. [00:04:40] So arrays and gen indexes have specific operators that need to be used. So there's the containment operator, the at symbol in the greater land, and then there's also the overlap, which is the double ampersand. [00:04:54] So containment is does it match all the rows? And then overlap doesn't include any of the items. Next he talks about the trade off of GIN indexes, meaning that GIN indexes are slow to update. So by default a GIN uses a fast update mechanism where new entries are added to a pending list and then merged into the main index later during vacuum operations. [00:05:21] So it's definitely important to keep any table that you have with a gen index on it, well, vacuumed. But there is some customization you can do with GIN indexes so you can turn this fast update off. So what that means is that the index will be updated slower. It doesn't use the pending list, it goes ahead and makes changes to the main index in real time. [00:05:43] Now why would you want to do that? You would do this if you have a read heavy workload. So if you want the fastest read performance and you're okay with slow updates, you could turn fast update off. But if you want good update performance and you're okay with the selects being a little bit slower, you could leave the default where fast path is enabled. And I would say this is the biggest thing to watch out for. He mentions here storage and modification is that quote, if you need to modify a single element of array, postgres must copy and rewrite the entire row. [00:06:17] So really arrays aren't great if you're going to append a value that is going to be relatively slow cause bloat. And it gets even worse if this column must be toasted because it's grown to such a size. Now he mentions you can adjust compression from the default of PGLZ to use LZ4 compression. So that helps. But really the best practice with arrays is make as few modifications as you can. So if you have 10,000 integers and you're just going to be inserting it into a row once, that's fine and you're not going to be modifying it, that's great. But be cautious updating an array too many times because you're going to get performance issues. And he says, you know, he's basically been saying everything that's wrong with arrays, but they do have benefits. [00:07:06] Again, it was alluded to, they are very efficient for storing data. And he says here the fastest way to insert 5,000 rows isn't a loop in your application and it's definitely not a single insert with multi values. It's actually un nesting the data and storing it in array columns. So an array of sensor IDs, an array of float values, an array of timestamps if you're tracking measurements, for example. Now he did mention there's an int array extension that makes working with arrays a lot easier and it has great functions like sort and unique, whereas it would require, you know, multiple steps to do it without these methods. But it makes sorting and finding unique values much easier in arrays. And the extension also has specialized query syntax that simplifies complex Boolean logic. And he shows an example of that here. But this is a great blog post and if you want to learn more, I definitely encourage you to check it out. Next piece of content dealing with integer overflow and sequence generated Primary keys. [00:08:10] This is from Cybertuck PostgreSQL.com now the general recommendation for creating primary keys if you're using sequence generated ones is to use big ints, because in the most extreme cases you don't have to worry about running out of bigints. But if you have been using integers, you are at risk of running out of them pretty easily nowadays because it's not just having that much data, but if you are inserting and deleting rows, you're going to run out of those sequences as well. Now, you might think it's pretty easy to just change it to a bigint. I just alter table, give it the table name, alter the column type, make it a big int. Now you can definitely do that and it does change it from an int to a big int. The problem is, as he says here, quote PostgreSQL locks the table in access exclusive mode, which blocks all concurrent access to the table. So reads, writes, everything stops. [00:09:05] So that's usually not possible in a database system, but there is a way to do it in a multi step process without causing too much exclusive locking. [00:09:16] So the next section he talks about is actually monitoring for integer overflow, and he has an example query here that you can use to check for that. Although if you have identified a primary row that's getting close, you're going to want to check foreign keys as well, because you're probably going to have to update those to a bigint as well. But he then goes to the better way to transition from an int to a big int with minimal downtime. So first he says add a new big int column and a trigger. And that trigger basically Updates the new BigInt column with the value of the primary key whenever a row is inserted or updated. [00:09:58] The next step is to backfill the new column with existing rows and ideally do it in batches and do vacuum between each update. Now, I've done this using shell scripts or even done using a programming language or even as a procedure in PostgreSQL. Next step, he says, is to create a not null constraint and a unique index on the new column, although to avoid the not null, frankly I would probably think about adding this column as not null with a default of zero. [00:10:30] So that way I don't have to do this series of steps to do the not null, because you can always drop the default of zero, but you definitely need a unique index on it. And then in this one transaction you finalize the new primary key. [00:10:44] So you set a locked timeout, you drop the current primary key column, you drop the trigger, you drop the function, you change the name of the new column to what the old column was, and then you add that unique index as a primary key and you commit. So if you want to learn more about that, definitely check out this blog post. [00:11:06] Next piece of content Idle session triggers a transaction wraparound this is from render cse.medium.com so what would cause a transaction wrap around from an idle session? And and the reason is because it involved a temporary table and temporary tables reside in backend local memory rather than shared buffers and autovacuum cannot access or process them so they cannot reuse any transaction IDs the temporary table was using. So definitely something to watch out for. What you could also do is set an idle session timeout. That way any idle sessions will be closed after that timeout. That would be a way to help mitigate some of this. But check this out if you want to learn more. [00:11:51] Next piece of content Database transactions this is from planetscale.com and this is a relatively basic topic talking about database transactions, but I thought it was super interesting, the graphics and explaining how transactions work. They talk about multi row versioning in Postgres and how that works as well as the different isolation levels that exist now. They do cover MySQL and Postgres features so just be aware of that. But I thought this was a super good resource so check that out if you're interested. Next piece of content PostgreSQL as a graph database who grabbed a beer together this is from cleoba.com and he's talking about a graph database which basically represents things in terms of nodes and links. [00:12:42] So the nodes or objects or entities if you want to call it and then the links or the relationships are the edges of the graph and these are great for representing relationships and seeing relationships between data. So he's looking at a peer social network and he says this is how you would do the entity relationship diagram in Postgres, a relational database, whereas this is how you would model it in a graph database. So you would have different nodes, user a check in a beer, a brewery, a venue, and then you would have different relationships between those nodes. And he shows an example of a graph model here showing the nodes and the relationships between everything. [00:13:24] Now he's actually creating this using the Apache age extension that lets you create a graph and nodes within Postgres and then define each of the relationships. [00:13:35] Now what's interesting about a graph representation of data is of course the relationships. And he mentions finding paths here, and he compares SQL versus Cypher, which is the query syntax to pull out data. So the SQL approach is a recursive CTE that looks like this. It is very long and complicated, whereas the Cypher approach is this. [00:13:58] So this is probably a tenth of the number of lines of code to extract the same information. [00:14:05] Now this Apache Age extension, which age stands for a graph extension, all of the data gets stored internally as postgres tables. So you can still query the tables with regular SQL if you want. And I think it also takes advantages of indexes as well. He does compare Apache Age to pgrouting and they kind of have different use cases like pgrouting is great for shortest path calculations, things like vehicle routing and things of that nature, whereas Apache Age is great for finding patterns in relationships. But if you want to learn more, definitely check out this blog post and the last piece of content why AI workloads are fueling a move back to Postgres this is from TheNewstack IO and this talks about some of the things that were mentioned in the Postgres FM episode on the state of Postgres 2025 and how AI has really driven a lot of people apparently back to postgres and also running their own systems as opposed to a hosted provider. And they're using the phrase byoc bring your own cloud or your own compute, because as you're using especially a lot of these AI workloads, they require a lot of iops, a lot of bursty performance that can sometimes be hard to do in a standard cloud infrastructure. They talk about the importance of cloning and branching for doing multiple experiments with AI as well. So if you want to learn more about this, definitely check out this article. [00:15:33] And now it's time for my consulting corner. I want to just talk very briefly about using AI to find index recommendations, because what I've noticed as different people have come to me and said, hey, I think we need to create these indexes. And some of the cases they've said, you know, they used AI to come up with some of these recommendations. I noticed they frequently recommend very wide indexes, meaning many columns in the indexes, say an index covering four columns or five columns, which I hardly ever do that because that's definitely an update burden for that index when you're inserting or updating data. And it's just a larger index too. And I think it may be the case where people aren't providing enough context. [00:16:16] Because whenever I am assessing what columns to include in an index, I'm also looking at the cardinality of that column. So if you're including a status column in an index, is that beneficial or not? If it only contains three values, that's not going to help reduce the data significantly. And it's usually better to have fewer columns in the index and have it do a final filter on the heap before delivering the data. And what I do is I take a look at the table, what indexes exist now and what are the queries we're wanting to run against this. And just by knowing that cardinality and what columns are in the where clause, I can pretty much determine, even before doing tests and running it against the database, what indexes it's going to try to use and what indexes would be better to use. So if you are going to be asking AI for index recommendations, you definitely need to provide as much context as possible to be able to get the best answers out of the LLMs. So are you using AI or LLMs for index recommendations? [00:17:22] Let me know in the comments below how that's been working for you. I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode. There you can find an audio version of the show, as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 0

Citus Open Source, Time Series Performance, Subscripting Updates, target_session_attrs | Scaling Postgres 155

Episode 246

Columnar Tables, Advent of Code, PG 16 Features, Commit Times | Scaling Postgres 246

Episode 122

Postgres is Fine, Generated Columns, Postgres Inserts, Synchronous Replication | Scaling Postgres 122