Postgres Conference 2025 | Scaling Postgres 358

[00:00:00] Speaker A: So we're going to do something a little bit different this week. I have been at the Postgres 2025 conference. Therefore I have had no time to review the blog posts or content that I do on a weekly basis. So what I'm going to do is put that on hold and instead I'm going to talk about my conference experience for the show. So basically after every day I'll do a recording of what the experience was like. So hopefully you will find this interesting, but I hope you, your friends, family and co workers continue to do well. So the first day when I walked in and I started meeting people, the very first person that I talked to not only lived in Florida like I do, but they also live in the same city in Florida. And not only that, they live off of the same street in terms of subdivisions. And both of us were like, are you kidding me? But apparently there's a fair number of people who work with PostgreSQL in the community in Florida. But with that, I will say the best part of attending the conference is of course, the people and being able to meet everyone. I saw some people who watched the show, so thank you very much for that. Greatly appreciate it. Oh, and with regard to that, I just realized that while, while I was at the conference, I hit 10,000 YouTube subscribers. So thank you for everyone who's subscribed and watches the show as well. So definitely the best part about going here is hopefully the connections that I will build upon after the conference. I didn't learn a whole lot of new stuff, but the reality is that I have been doing Postgres for 15 years, so it makes it harder to find stuff to go and learn more about that I'm interested in. But in terms of the first day sessions, there were talks by sponsors of the conference. So the first one was profit is the reward, not the goal. So this was more of a, I guess, professional development type of talk and emphasizing how people are assets and shouldn't be called resources because assets are something you invest in and resources are something you consume. So you definitely don't want to be a resource that's consumed if you're a person. And they also went into depth on know your purpose. Now, I should say with all of these presentations, I think they are posting at least the slides and I think slides along with the audio as well is going to be posted. So you could check that out. The next presentation was about PostgreSQL at Amazon and it was basically the history of what Amazon has done with regard to PostgreSQL in terms of releasing RDS for Postgres and as well as all of their scale out solutions that are available. And then Google had their presentation again talking about their scale out solutions with regard to AlloyDB as well as Banner and then the next one was code your way out of burnout. So this is part of the professional track and a lot of working lifestyle presentations were given at the conference and in this particular presentation they did have I think eight to 10 different metrics and you rated them from zero to six and it gave you kind of a score to give your propensity to become burned out with your job. Thankfully I was on the lowest level setting so I think I'm going to be pretty good for a while. And then we started the different tracks. I went to unleashing postgres performance troubleshooting techniques for common use cases in RDS and Aurora PostgreSQL and this talk was pretty interesting because they said well if you have a CPU bottleneck, these are the things you should do. If you have a memory bottleneck, these are the things you should do. If you have a storage bottleneck, these are the things you should do. And then they gave some demos of hey there was this customer that encountered this problem and this is how we resolved it and they created the tables and showed you right there the solution to the different problems. But one thing I am going to look more into that I haven't used yet is pgproctab is something they mentioned that can give you some system based statistics. So that's one nugget I got out of that presentation. The next talk I went to was ensuring high availability in self managed PostgreSQL with patrony. So this was an AWS presentation, but they talked about different ways of handling high availability starting from first doing log shipping and then moving on to physical streaming replication and then maybe synchronous replicas and then if you want to do automated failover, bringing in patrony. Now I've always been fearful of patrony given the problems that can happen. And they gave a demo of using Patroni with EC2 instances. I said well that is some pretty slick stuff, but I still want the presentation of these are the things that can go wrong with Petroni and how to get out of them before I actually give it a try. And then I went to beyond PostgreSQL 17.7dba workarounds for enhanced management and they talked about doing zero downtime upgrades using things like logical replication using PL PGSQL check to find dependencies between objects in your procedures or your functions. They mentioned having automation with regard to postgres like Ansible Playbooks or a Postgres tuner extension by edb. They talked about being able to remove bloat from your tables using PG Squeeze or PGrepack. They talked about using different workload analysis using wait events that can be found in PGSTATactivity and other monitoring tools. They talked about a job scheduler in pgcron and also cross environment schema cloning. And they mentioned PG Dump and pgrestore or a tool that actually does schema cloning that EDB provides. In terms of day two I went to the postgres journey to Distributed SQL or this is D SQL for Amazon and they talked about their focuses looking for scale, consistency, resiliency, georeplication as well as SQL compatibility. Then they mentioned some other products in the space such as Google Spanner, Yugabyte, CockRoachDB, Azure Cosmos DB, which is really Citus data under the hood, so a scale out solution. And they talked about releasing their Aurora platform as well as offering Aurora Serverless and then Aurora Limitless and then finally their D SQL offering each I think getting further and further away from postgres compatibility, but essentially offering a global database as it were. Now the interesting thing about the SQL is that there's not really locking going on, but basically whatever process commits first wins. So you have to deal with rollbacks when using that type of database. So it really doesn't use pessimistic locking, it uses optimistic locking. Now another downside of that, particularly if you're using their multi region implementation of D SQL, is that every write has a higher latency to it, but that was a pretty interesting presentation. The next one I went to was building data replication pipelines. So this was from Datadog and they have basically a search box on almost every field of their application as well as faceted search as well. So you can narrow down by facets as well. So they wanted to build a search solution using Change Data Capture or cdc, so basically using logical decoding of the wall files, going to Debezium and then onto Kafka to generate the search indexes that are driving their platform. I did leave early because I wanted to jump into the next presentation that was starting in the middle of it, but that one I jumped to was logical replication upgrades. So this is something that I do for clients is helping them do upgrades of their database system and so far the last ones I've done in recent years all Use logical replication. So I just wanted to check this one out to see how that was. And this talk definitely prioritized the planning and the testing in as close to a production environment as you can. They mentioned try to avoid using logical replication with all tables because then you can't really change it or add or drop tables, whereas just add each table at a time and that gives you flexibility to add tables or drop tables, particularly if you run into conflicts with a particular table. And they went over the failover process as well. So when they got to the demonstration part I decided to leave because There was a PGvector talk that was starting because I felt I had a pretty good idea of logical replication upgrades. So I went to the PGvector1 and that was the largest attended talk I think on the second day was the talk about PGvector. So apparently there's a lot of people that still want to get up to speed on vectors, AI, etc. But they talked about embeddings, what they are best way to store them in toast, ideally using plain if they're less than two kilobytes. But they suggested not compressing them because that may lead to poor performance when you search through them. They talked about the different index types of ivf, flat and HNSW indexes and then went through a basic RAG app of building a semantic search solution. After that I went to a presentation on a 1 billion row challenge. So basically taking a billion rows of data and how fast can you load it in and then run an aggregate query on that data? Now with that type of query, column stores tend to excel in that area. So they looked at using Postgres as well as using DuckDB and then two extensions, PGDuckDB as well as PGMooncake. So Postgres was definitely the slowest, pure DuckDB was the fastest. But what's even interesting about DuckDB is that you can get it to the point where you can load all the data into memory in DuckDB and then run queries against pure memory. And there the aggregates were amazingly fast. Of course the problem is all of your memory is now taken up by the data. The the performance of pgduckdb and pgmooncake were pretty good, but I kind of wonder how these are going to work in real life situations because essentially it's a backend process in postgres, a session that's running the DuckDB engine. So how large can that be? How much memory are you going to use? Can you do multiple queries at the same time? And how much memory will that use? Because DuckDB is basically a single process and I'm assuming it's just what does one query at a time. But could you have 20 or 30 Postgres backends each with a DuckDB process? And how much memory is that potentially going to use for really complex queries? The talk really didn't get into that, but it's just something I think about as I was watching it. But this was also a pretty well attended talk like the PGvector, but it was pretty popular, so clearly using analytics with some of these different tools is appealing as well. Then I went to Sensors to insights postgres in IoT. So basically, if you have massive amounts of time series data that you need to load in, what are some solutions to do that? They talked about optimizing your cpu, your network and your disk for this, ideally having a separate disk subsystem or storage system for your wall files to minimize disk contention issues. Talked about the importance of having a fill factor of 100% because you're not really updating the data, it's just append only using Bren indexes for your timestamp fields because those should be more efficient and all your data should be sequential, so a block range index should be very efficient. With that, talked about how partitions can be beneficial, particularly when it's time to get rid of the data, and it also allows you to potentially have faster storage for more recent data and slower storage for older data. He talked about OS tuning parameters and PG parameters, as well as the importance of column order when you're setting up your tables. So for a typical OLTP system I don't worry about column order and getting the padding perfectly because you're adding new columns all the time and trying to maintain that order is hard, but if you have time series data and those tables collecting sensor readings aren't going to be changing that much, you can spend the time to actually order them correctly so data is as compact as possible and doesn't have extra padding in it, so that leads to less storage space as well as better performance when you do your queries. And of course he also mentioned trying to batch your inserts in some way so single inserts are the slowest. You could do a multi row insert to be faster or even use something like copy and then maybe you even want to do some sort of pre processing of the data that captures the incoming events, batches them up and then sends them through to postgres to batch. [00:12:51] Speaker B: The inserts more efficiently. [00:12:53] Speaker A: Next presentation I went to was logical replication protocol versions. So we actually covered the blog post that a lot of this presentation was about and how version one of the Logical Replication Protocol came out with version 10 of Postgres, but version two came out with version 14. And what that introduced is streaming of open transactions so it can start sending the data before the transaction is complete. And that streaming allows the logical replication to operate more efficiently. And then in version 15 they did a version 3 which supports prepared transactions, so things like two phase commits. And then version 16 they went to version 4 of the logical replication protocol. And this allows parallel apply of large transactions. So if you have multiple large transactions, you can do a parallel apply of those. But he said still a lot of the tooling just relies on version one. And then the last presentation I went to was about the Redgate Monitor and pgnow. So this was one of the sponsor sessions. So Redgate Monitor is a paid tool to do cross platform database monitoring. And then they talked about their free tool that they released, which is basically a desktop application called PGNow. And you can connect it up to your database to look at the cumulative statistics for Postgres. And this would also include PGStatStatements, PG activity. And the cool thing about it is you can actually click a button to take a snapshot, wait a minute or two, and then click another snapshot. And now you can see what activity has happened between those snapshots. Because with the cumulative statistics, they're cumulative. You can't see what's happened recently, whereas taking these snapshots you could do that. But that was the second day. [00:14:38] Speaker B: So on the final day I went to mastering postgres partitioning Supercharged performance simplify maintenance. So some of the reasons mentioned why you want to do it is that it can give a performance boost. In some cases he mentioned, particularly ingestion can be faster if you do it into partition tables. And I myself have found cold cache performance can be better with partitioning, but having a hot cache performance really didn't make that much difference in terms of performance partitioning versus not partitioning. But I think the number one reason to do it is for easier maintenance if you delete data. [00:15:13] Speaker A: So if you're going to be deleting. [00:15:14] Speaker B: Data from a table, rather than having the overhead of deletes which require a bunch of locks to make those deletions, and then vacuum has to clean it up, I think it's better to just go ahead and use partition tables. Then you can just detach and drop the partition tables when you're done with them. He also mentioned the importance of you make your schema changes on the parent and then they flow down to the children tables. And also when you are choosing your partition key, you want to make sure that that is the first key in your primary key, and then a second key if you need to, to make it more unique. So, for example, maybe it's a timestamp followed up by some other unique identifier to make sure the primary key is unique. But your partition key is the first column in the primary key, so it can do partition exclusion, because that's almost the whole point of partitioning, is to be able to do a query and only hit one or two partitions out of all your partition tables. He went over the different partition types, such as range partitioning, list partitioning, so you have a defined set of values you want to partition by. Maybe it's region or hash partitioning, which I personally use more for, say, integers or big ints that you want to partition by. He talked about default partitions, and basically if nothing fits into your existing partitions, it goes into the default partitions, if they exist. So you do want to make sure that no data is in there and keep those pretty clean as you're using partitions. He showed an example of range partitioning and then also hash partitioning, and then even a 2T or a hierarchical partitioning where you partitioned by the integer and and then by a timestamp. And so that's great because if you are expiring data after a period of time, you can then drop those by date partitions. He mentioned the fact you can detach and attach partitions, and if you're going to be populating data in them, you do want to make sure you apply a constraint on that table before you attach that partition, because it just makes the attachment much faster. And he mentioned that at the end of the day, the partitions are just tables, so you can draw directly query them, but the point is to basically query the parent so you don't have to target each partition that you want to query. You can even create indexes that only exist on the children table and none on the parent table. And ideally your hot partitions, your most recent data should be in the shared buffers, so maybe that helps you gauge how many partitions you want to create. But if you're creating many more than 1,000, particularly on earlier versions of Postgres, you may want to be cautious about doing that. And he mentioned there is no global index, so you can't do like A globally unique index on a partition table. But again, you can apply indexes on each partition, so you could have unique indexes on those. And if you're applying indexes to a partition table, remember how to do it concurrently. Is you concurrently add the index to the child tables and then you can just create an index on the parent table and it will automatically create it quickly without having to build anything, because you can't actually do a create index concurrently on the parent. Now the disadvantage of this is that once you've created an index on the parent and it's on the parent and the children, there's no way to remove that index concurrently. So the only way to remove an index is to remove it from the parent, which of course locks the whole table. So hopefully Postgres in the future will change that. The next talk I went to was on exploring catalogs and extensions. So they went through all the different system catalogs that exist in Postgres. So basically the PGSomething table, so it could be PGDatabase, PGCLASS, PGTABLES or indexes and all of the different views associated that Postgres tracks information in. Then they talked about the different statistics tables. So these are pgstatsomething, so it's pgstat, underscore database or tables, indexes, etc. And then there's also the progress views. So pgstatsprogress, and then whatever you're tracking the progress on that could be analyze or copy or base backup. And then this particular presentation also went over a bunch of extensions. These may be the ones that are available on RDS postgres because this was a presentation by aws, but they listed categories and the different extensions that do different jobs within there. So they talked about performance extensions. This would be pgstatstatements again, probably the number one most important extension. Pghint plan if you want to add hints to your queries and PL profiler. So this is a way to profile the performance of your PL PG SQL functions and procedures. They had an admin section where they talked about pgpartman for doing partitioning, pgcron for doing scheduling, PGPRE Warm for warming up the cache, and then pgrepack for compressing tables online instead of using something like Vacuum Full, which would lock the whole table. Then they talked about data sources. This would be different foreign data wrappers, such as one for MySQL Oracle TDS, which is basically Microsoft SQL Server, as well as log. Then they talked about replication, which would include things like PG Active, which allows you to have essentially active active replication pglogical which does logical replication, but I tend to use the built in logical replication wall to JSON so a way to convert the wall stream into JSON as well as test decoding. So if you want to test out your logical decoding they mentioned AI machine learning extensions such as PGVector and Postgres ML. Then they talked about text transformation extensions such as PGTrigram PG Bigram, which I actually haven't heard of before, Fuzzy string match and Unaccent. In terms of geospatial they talked about postgis and pgrowning. In terms of monitoring and logging they mentioned PG proctab which I mentioned previously as well as PG audit if you need to do extensive auditing. And then in terms of indexing they did mentioned hypopg which is hypothetical index creation so it helps you judge if you think an index would help, as well as hyperloglog for doing fast distinct counts. And then the last thing I went to is understanding database statistics. So they talked about the statistics system, its importance in the planner, having access to those statistics to be able to plan the queries that you're running against it. And those statistics exist for every column the database. And there is a default statistics target for the global database, but they actually recommend not changing that and only if you need to adjust the statistics do it at the column level. But you would only do this if you have so much data that these statistics are not accurate and you need a higher resolution to be able to get accurate statistics for your particular columns in your tables. Then they talked about extended statistics. This is something to help you get more accurate row counting information to the planner. So it helps define dependencies between columns. And in their examples they weren't using any indexes at all, so it was all just relying on querying against a table. So adding these extended statistics never resulted in a greater execution performance gain, but the gains were at the planning level. So it helped the planner choose the plan it was going to use to execute on. And that was it for the Postgres Conference 2025. Again, it was great meeting everyone and making all of those connections. Although it is funny one person said. [00:22:58] Speaker A: Oh I thought you'd be a lot taller in person. [00:23:01] Speaker B: So I guess apparently I'm the Tom cruise of Postgres YouTube. But I will say that next week all the blog posts that I missed this week will be included in next week. So there'll probably be a whole lot of blog posts next week. I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find the link to look at all the different sessions at the conference. While you'[email protected] you can also sign up to receive new episodes. You can also find an audio version of the show as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 349

The Dangers of Temporary Tables | Scaling Postgres 349

Episode 10

ANSI Schmansi, Split-Brain, Performance Scenarios, Parallelism | Scaling Postgres 10

Episode 3

Modeling, Query Performance, Statistics, pgmetrics | Scaling Postgres 3