Is the Future Immutable? | Scaling Postgres 356

[00:00:00] To me, Kubernetes has always seemed to be cutting edge technology. I mean, I know now with application containers people have been using Kubernetes for quite a while, but I still recall David Hennemeyer Hanson, who created Ruby on Rails, struggling with Kubernetes with some of their applications and they decided to ditch it because of its complexity. For that reason, I haven't delved too much into it at this point because one thing that typically comes with the cutting edge is that you're at the bleeding edge where you actually have some harm. And I've heard stories of people struggling to get Postgres to working reliably on Kubernetes. This was even before operators and sometimes even after operators. But the operator that seems to be gaining the most traction is Cloud Native pg, and they've definitely fully embraced Kubernetes to operate how it's designed. Now there's still some rough edges around it that they continue to iron out, but we're going to talk about that a little bit in this first blog post and be sure to stay till the end of the show where I have my consulting corner and I talk about default partitions, but I hope you, your friends, family and co workers continue to do well. Our first piece of content is the immutable future of PostgreSQL extensions in Kubernetes with Cloud Native PG. This is from gabrielebartolini.it and basically that's one thing about Kubernetes is that you know your containers are immutable, so they want that to carry through all aspects and they want Cloud Native PG to adhere to that. Now a downside of this is that's one of the greatest strengths here. They say postgres is that it is extensible. So if you're going to be adding an extension, this introduces mutations, essentially changes, so it makes it not immutable. But this post makes an announcement that they have gotten features into Kubernetes and postgres to make immutable extensions a reality. And basically being able to add an extension to Postgres via images without having to mutate the base postgres image. Because right now if you want certain extensions in your postgres container, you have to build them in. When you create that container, they can't be added after the fact because essentially that would be a mutable change. [00:02:27] So the first thing they're adding in Postgres 18 is an extension control path to define additional directories and control files for the extensions. And when you combine that with the dynamic library path. You can have all the code for extensions located outside of the container image for Postgres, which is great. So that's an enhancement in Postgres. But to Kubernetes they added an image volume feature that is expected to reach beta soon. So this is another indication that they're still ironing out all the rough edges of using Postgres on Kubernetes. But this feature allows you to mount a container as a read only and immutable volume. So then as long as postgres extensions are packaged as OCI compliant container images, you should be able to add any extensions you want to your postgres container. And they added features into cloud native PG that allows you to, under the spec postgres definition here, add a list of extensions. So here they are adding the PGvector extension with a particular image. So then presumably if you have OCI containers for your extensions, you can just iterate the three or four or five different extensions you want to use in your postgres installation. And it works. And again they emphasize here quote this feature is not available yet out of the box. So basically this is something coming down the pike. So I haven't jumped all into postgres on Kubernetes yet. I'm still waiting and watching all the improvements that are being made to the ecosystem and the operators and as they get better and better then I'll definitely feel more confident about jumping into it. But if you want to learn more, check out this blog post. [00:04:19] Next piece of content is actually not a blog post, but on some other information I read I saw that The Postgres Conference 2025 is coming up and I said wait, it's in Orlando. Now I live outside of Orlando, Florida and personally I've never attended a Postgres conference. Not because I didn't want to, it's just I'm not a big fan of travel given the cost and the time associated with it. But given that it is literally, you know, quote unquote in my backyard, I can't not attend. So I have registered and I will be attending the 2025 Postgres conference. If anyone who watches the show sees me and wants to say hi, feel free to go ahead and do that. So I guess I'll see you there. [00:05:03] Next piece of Content There was another episode of Postgres FM last week. This one was on extended statistics. So by default Postgres defines statistics on each column of each table in the database. It does not collect statistics that relate two columns to each other. So there can be correlations between data where having these statistics known can help improve query outcomes. But then of course, the question comes, well, when do you need it? And basically if you look at an explain plan and the row estimates are wildly off when you're doing a particular query, maybe with a group by in it, you can add extended statistics that may help the planner choose a different plan to more efficiently run that query. But what was very interesting that Nikolai said is that he has never used this feature in his consulting work. And frankly, that's the exact same thing for me. I have never used extended statistics to resolve a query performance problem that I was faced with. There was always a different way to resolve the issue that was more performant. I have on occasion tried to use them and it did improve the row estimates, but ultimately it didn't change the plan. I couldn't find a better plan with that more accurate information. So then it's not much of a use to have that overhead with it. But I believe it was Michael that made this interesting insight, is that with OLTP applications, which is a lot of what I help optimize, you try to get super fast efficient short queries, usually with specifically dedicated indexes, and those are probably more effective for giving you the best performance as opposed to creating extended statistics. But maybe that is more advantageous if you have an OLAP environment where you have many different types of queries running, maybe where you don't have as many indexes, so you're okay with the query running a little bit longer because it's a report or whatever. But in lieu of indexes, maybe you could add some statistics to define relationships between columns. Maybe these extended statistics help the planner choose better plans in general. Because I think Michael did also mention that he saw a blog post where someone said, hey, we used extended statistics statistics and it really helped out this query performance. But he looked at it with his tool PG Mustard, and he said, oh, but there's also this other way you could have gotten similar performance if you had added an index. But I did think it was an interesting insight how Nicolai never really used extended statistics. But maybe that's because with oltp maybe there's not as much of an advantage, whereas with olap maybe there's a bit more. But if you want to learn more, you can listen to the episode here or watch the YouTube video down here. [00:07:51] Next piece of content waiting for PostgreSQL 18 virtual generated columns. This is from depeche.com and we did announce this last week or week before last that Virtual generated columns look like they are going to be getting into Postgres 18. However, I did think something interesting he said was down here where he said the limitation of this is that you can't have an index on a virtual column because most of my generated columns right now are TSV fields. And that's the example he's using here because he says right here, you know, it would be absolutely awesome to use a virtual column for text search and being able to add an index to it, because right now you have to use a generated column and then add an index to it and you have to store that additional data. Now, yes, you can use an expression index, but that means you have to have a very complex definition for the index and then a very complex where clause every time you want to query it because the query and the index must match. Whereas if we can make it a virtual column, all the complexity would be in there just once and you could add an index to it. Then your queries in your index should be relatively easy to write. So that would be a great feature if they could add it in. But if you want to learn more, you can check out this blog post. [00:09:15] Next piece of content waiting for PostgreSQL 18 allow explain to indicate fractional rows. Now I read this, I was like fractional rows. What are you talking about? You can't have fractional rows. However, what they're talking about is the output, meaning the Explain plan output. And when you have multiple loops that are going on so you can see loops 4 here, this rows count actually gets rounded up or down, so it's not really accurate at this point. But by allowing partial rows they can say out of four loops, 0.25 rows were returned for each loop because essentially one row was returned from this loop before. So I'm not sure if that's still confusing that you do have to multiply loops by the rows. But it is better than it is currently where it tells you it returns zero rows but still it returned one row. So this is definitely better. Don't know if there's still a point of confusion as well though, but check it out if you want to learn more. Next piece of content. Validating data types from semi structured data loads in Postgres with pginput is valid. This is from crunchydata.com and I actually haven't used this before. Pginput is valid as a function that you get Give it your set of data and say what data type it is and it will say true or false. Like is this an Integer yes it is, even if it's of a text format. And this is available in Postgres version 16 and 17. It even sees if certain data can conform to JSON B or not. But I do want to highlight something mentioned here in terms of loading data into postgres and this is the way I always do it is always load in everything as text and then use a separate process to move it into the destination tables, converting the data type as necessary because the data coming in you never know what state it's going to be in. Even exports from other databases where presumably the data types are firm. I still always do it this way because there's always edge cases that pop up when trying to load data from other sources. But check this blog post out if you want to learn more. Next piece of content post Postgres Query Plan Visualization Tools this is from pgmuster.com and they are themselves PG a visualization tool. But this blog post lists some of the others and some pros and cons of them so you can check it out if you're interested. [00:11:36] Next piece of content PostgreSQL unlogged tables and backups this is from dbi services.com and they had a client who said hey we a dump of a database, we restored it to the primary. The problem is some of the tables are missing on the replica of this primary. So they looked more into it and the problem was is that they were unlogged tables. So when you create an unlogged table you get create performance but nothing is logged to the wall. But if you do a dump of that primary database they still will get exported. So the unlogged tables were in the backup were restored to the primary but but that table was not carried over to the replica because it is not wall logged and you need wall logging in order to pass data to a replica. So just something to keep in mind. Next piece of content pg17newrandom functions this is from postgresonline.com and the random function has been enhanced where you can easily give it two parameters and then it will give you the random value of those parameters and you can even do it with numeric types as well. So I guess that's another benefit being on Postgres 17 and the last piece of content there was another show on the Postgres Cafe. This is a podcast and this show was about contributing to open source. So if you're interested in that you can definitely listen to this podcast. And now it's time for the consulting corner. So I'm going to be talking about default partitions. [00:13:03] So personally I I don't really like using default partitions. I would much rather have an insert or an update fail and trigger an exception as opposed to silently being placed in the default partition. Because the default partition's purpose is for placing data that doesn't fit into any range of the other partitions. So basically it gets unexpected data. So maybe they should call it the unexpected partition. But a fair number of the clients that I've worked with, they haven't been checking their default partitions and sometimes there's data in there and they're surprised by this because again, it's kind of like a fail silently scenario. Well, how did that data get in there? What is it now? Of course the advantage of default partitions is that you don't get exceptions or errors shown to the users if data gets inserted in there. But the important point is you should try to minimize the amount of data in there. So number one, you should be scanning all of your default partitions to make sure that they're empty and if they do have data in them, that should be bubbled up in your reporting somehow and you should address it because it's probably unexpected. The other reason to monitor it is that it actually gives poorer performance because if you look at the partition range defined by the default partition, it's basically the inverse of all your partitions. So if you have, heaven forbid, a thousand partitions, which I wouldn't recommend, but if you have 1000 partitions, basically the definition of the default partition is anything that doesn't fit into those thousand partitions. So sometimes the performance in terms of planning can be really poor for default partitions. The other thing to watch out for is that if you're doing like date time partitions and some of your methods for partition exclusion are using greater than or less than in a where clause to try and exclude partitions, that default partition is always going to show up. But do you like using default partitions or not? Let me know in the comments. [00:15:02] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links for all the content discussed, as well as sign up to receive weekly notifications of each episode there. You can also find an audio version of the show as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 281

HNSW Indexes, Vacuuming Bloat, Watch Me Now, Connections | Scaling Postgres 281

Episode 267

PostgreSQL 16 Beta 1, Rust Functions, Partitioning Memory Problems, Tags & Arrays | Scaling Postgres 267

Episode 40

Materialized, PgBouncer Auth, Implicit vs Explicit, JIT | Scaling Postgres 40