Episode Transcript
[00:00:00] If you have a lot of time series data that can really stress the database because you have data that's constantly flowing in, taking up more and more disk space. So the scaling challenges associated with this can get kind of interesting. Well, we're going to talk about that this week. Also, be sure to stay till the end of my consulting corner where we talk about how to optimize low cardinality fields with your queries. But I hope you, your friends, family and co workers continue to do well. Our first piece of content is time series considerations and this is the most recent episode from Postgres fm. So they did a whole show about time series data. First they had the discussion of what is time series data? Because there's a group of people that say, well, all data is time series. And yes, that's true because a lot of times application frameworks automatically add created at, updated AT fields to every table in your database. So every piece of data, you know when it was inserted or updated usually. But there's another group of people that say, well, time series is a more specific thing. Like if you're always presenting the data in a time series, so the temperature plotted over a period of time, sales plotted over a period of time, maybe you would call that time series data. Another way that you kind of think about it is that for any fresh data or new data, you want as much detail as possible, maybe down to the second or to the minute, what that value is. But then once that data is a little bit older, maybe a month old data, year old data, you never really look at the detail. You only want the aggregations 99% of the time. So basically each individual data point is more important the more recent it is, but it's less important the older that it is. And of course with time series data, typically a lot of it is append, only you can do updates to it. But the vast majority of data is append, only you never really change it once it's recorded and logged. Now of course the address is, I think Nikolai said the elephant in the room, which is quite a pun with postgres about timescaledb because it's specifically designed for working with time series data. And he said the two features of that extension that are most important to him are, are number one, the compression and number two, the continuous aggregate feature. Now he said it's difficult, but you could build a continuous aggregate system. But achieving the levels of compression that they're able to, that would be definitely more of a challenge. And they talked about a few things you can do to optimize or speed up your time series data. One is, with regard to your primary keys, you would definitely want to either use a Bigint or a UUID v6.7 time series based UUID. You don't want to use a random UUID like version 4 because you'll have tons of wall writes. So that's not a great choice for a primary key. They mentioned you'll probably want to partition because if you're having time series data again that older data becomes less important. So those partitions allow you to archive it and it helps your hot more recent partitions be more active in terms of inserts. And of course you're going to want to choose a time based partition key now if that happens to be hard to work with. You could also do two levels of partitioning. One a hash partitioning, say by some sort of account ID or tenant id, and then do a time series partitions. If you wanted to get fancy and have hierarchical partitioning and they did emphasize the importance of compression, but you can try to get something similar to that. If you develop your own roll up tables to essentially roll up or bend data into certain groups, then that becomes much faster to query, you have less data to deal with and it's basically just a way to pre aggregate data as opposed to always trying to query from the raw detailed data. And other than your standard B tree index type. They also talked about Brin indexes and how those could be highly advantageous for time series data because if you're doing a lot of aggregates, as long as that data is within range, there's not a lot of updates happening. That Brin index should be very efficient and also very small given the amount of data it can cover. But if you want to learn more, definitely encourage you to listen to the blog post here or watch the YouTube video down here. Next piece of content indexing vectors in PostgreSQL. This is from Cybertuck PostgreSQL.com and this is a follow on post from the one where he loaded Wikipedia data into Postgres. Well now he's applying indexes to be able to query against it. So we covered the two index types that PGvector offers. One is the HNSW Indexes and then IVFFLAT indexes. So you have the embedding store here and basically he needs to place an index on the embedding column and he looks like he's dealing with 41 million rows here. Before creating the indexes, he did increase his maintenance work memory pretty significantly to 32 gigabytes given the size of his machine and set his max parallel maintenance workers to eight. But even with that high maintenance work memory, when he started creating his HNSW index, he received a warning that the HNSW graph no longer fits into maintenance work memory after so many tuples. And that building will take significantly more time. So this build took half a day. It took over 12 hours. So that's a really long time. And he says you can of course monitor the index creation. And the CPU seems super busy on the system, so it really is doing a lot of work. And he said what he noticed, the single function that is done over and over is basically doing an inner product. And he says this is a lot of the reason why you see GPUs being used for AI work, because you can paralyze a lot of this work more easily with a gpu. But he said he looked at the index size here that was generated and it was 77 gigabytes, which is huge compared to the original data, which was 13 gigabytes. So once you do the embeddings and then the index definitely seems quite large. So then he took a look at the IVF flat index and it ran much faster in only 38 minutes. The index was still pretty large at 63 gigabytes. But what he will show later is that the accuracy of the HN S3 index is higher, or actually they call it recall instead of accuracy compared to the IVFFLAT index. I also think that HNSW can better handle data changes like doing an update, whereas the IVF flat, you have to rebuild it. But he actually didn't mention that in this blog post. But if you want to learn more, check out this blog post. Next piece of Content Mini Summit 3 transcript APT Extension Packaging this is from JustAtheory.com and this is a presentation that Christoph Berg gave with a video and slides talking about how the APT packaging system is done for postgres. So if that's of interest to you, you can check out those resources or even just look at the text transcription of it right here.
[00:07:14] Next piece of content waiting for PostgreSQL 18 non text modes for PGDumpall correspondingly change PGRestore this is from depeche.com and historically PGDump all only output in text. So it basically called PGDUMP in a text mode for all the databases that it was dumping. But with this enhancement you can now dump in custom formats, making the dump smaller and I think you can even do directory format as well. So I should mention As a reminder, PGDumpall is the cluster level backup that backs up users, roles, information and every database, whereas PGDUMP only does a single database. But it looks like they're adding More features to PGDump all to match what PGDUMP already has, and as a consequence they're having to update PGrestore as well. So if you want to learn more, check out this blog post. Next piece of content waiting for PostgreSQL 18 add modern SHA2 based password hashes to PGCrypto. So SHA2 basically means it looks like it's getting SHA256 and SHA512 as listed here on PG18 for PGCrypto. So this is definitely a welcome enhancement if you use these within postgres. Next piece of content PGDUMP gets a speed boost in PostgreSQL 18. This is from Command Prompt.com and the speed boost is based on the number of objects being dumped. But they had an example here where someone was dumping 20,000 schema objects and the performance from 17 to 18 sped up by looks like about a 30%. Well they said 28% reduction in time. So depending on how often you dump, that could be advantageous for you. Next piece of content is actually presentation User user. Who is the user? So this is a presentation from nomgm Us that talk about all the different options available for postgres, user management and the last piece of content boosting postgres Explain. So apparently Robert Haas has added a feature that allows external modules to provide additional information to the Explain command. So he says this information may allow him or other extensions to write out additional information with regard to explain output. So if you want to learn more, you can definitely check out this blog post. And now it's time for the consulting corner and this time I'm going to talk about what you can do if you have a lot of queries that are hitting low cardinality columns.
[00:09:40] So low cardinality means that you don't have that many values in a particular column that you need to select data from. So imagine a status column that maybe has three values in it, or five values. A B tree index is not going to be very efficient at pulling out those values because they're so common and by the nature you're going to be pulling out a lot of rows every time it gets queried. Well, what I usually like to do in this case is consider partial indexes. So if there's a status column with three different values, then I might consider making three Independent partial indexes, one for each status type. And usually that will include some other piece of information, maybe a user ID or an account ID or some sort of transaction ID along with it. So the index is on the transaction ID where the status is 1, and depending on the distribution of those statuses, it can be really efficient for some of them if they're not highly represented in the data, because that's going to make that partial index much smaller. And if you need to query that particular status, it'll return very quickly. Another way this can be advantageous is if you have multiple columns that you're trying to do a query on that have low cardinality fields. Because if you have a query that's looking for three different values in one column, one value in another, and one value in another, and maybe that query is super common, well, you can create a partial index specifying exactly those three columns in the where clause. So that particular query becomes super fast because it's automatically indexing only those rows with that data. And again, you follow this up with another column you're selecting by maybe it's an Account id, User id, and that can make things a lot faster than trying to create a multi column index with all of these low cardinality columns. Now, this can give you some great performance benefits, but some of the downsides is that it does give you less flexibility. Because if suddenly you need to change how the application operates and you need to add another status or you need to remove one, well, you're going to have to remember to adjust these indexes to take that into account. If you had a multi column index, you wouldn't have to do any of that. But this is something I like to keep in my toolkit when I'm considering optimizing different client systems, and hopefully it can benefit you as well.
[00:12:04] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links for all the content discussed, as well as sign up to receive weekly notifications of each episode. There you can also find an audio version of the show as well as a full transcript. Thanks and I'll see you next week.