Native Column Store? | Scaling Postgres 378

[00:00:00] I haven't played too much with column storage. [00:00:04] So this is storage formats that help you do analytics really quickly because the data is oriented in a column format as opposed to a row format. And these are formats typically used in things like Parquet files. And DuckDB is basically a column oriented database system. And historically if you wanted to use column storage in postgres, you had to reach for an extension like Citus years ago, created their C store extension, I think it was. [00:00:33] You can get timescale that has column storage built into it and it has ways of loading the data into and out of it. But what if you could get column storage by simply creating an index? [00:00:45] So we're going to talk about that in an article this week, but I hope you, your friends, family and coworkers continue to do well. Our first piece of content is, is a journey toward the columnar data store. This is from postgresql.fastware.com and this is a post that is about a presentation that was given at PGCONF dev of the same name, I believe. A journey toward a columnar data store. And to talk about basically hybrid use cases, meaning you have an online transaction processing system that's processing transactions, but you also want to do some analytics with it. And it would be great if you could have column storage of that analytic data in the same system so you wouldn't have to do CDC processes or some sort of an extract, transform and load into another database system. And this is essentially talking about the hybrid systems that timescale has been doing for so long. Meaning they offer the row store that can be converted into a column store. But some of the benefits of doing it all in one system, namely postgres, is ideally you don't want to slow down any updates that are happening. You want to have analysis with the latest data and have retrieval performance of a columnar store. And how they're going to achieve this is something they're calling a vertical clustered index to basically try to do real time data analysis or at least more performant. So the way they've set this up is with two storage areas. The first area is called write optimized storage. And and this is updated in sync with the updates or inserts happening to the main row store table. And essentially it just stores the TID reference and then in a separate asynchronous process it converts these essentially row storage items into a column store format using a background worker. And of course, as a consequence of this process, if this index needs to be read, it can read very Quickly the read optimized storage, but then it has to also add in what exists in the write optimized storage before it can give the answer. So that's one thing to take into account when you're using this index. And they did some performance tests using TPC H measurements. So this is an analytical workload and comparing parallel scans to a VCI scan, the vertical clustered index scan, they were seeing performance that was around 4.4 times faster. [00:03:14] So there you can see the performance improvements of columnstore. And then they also examined okay, does the vertical cluster index impact transactional throughput? And they showed this chart here showing that was relatively negligible. There is a performance impact because the with VCI is the chart on the bottom here. So there was a slight reduction in tps, but it wasn't significant. So doing these updates in real time here wasn't an onerous burden on the database system. [00:03:46] So it's great that it doesn't really impact update performance, even though you're keeping this index update in real time, but it gives you much faster query performance. [00:03:55] And then they actually have the slides embedded in this blog post as well. And I found this query performance slide where they were showing VCI parallel 24. [00:04:06] So apparently you can process these VCI indexes for queries in parallel up to 24 times in this example, and it resulted in a 46 times faster query for one of the examples and 11 times faster in another. [00:04:21] Sometimes it didn't make that much of a difference, but still, if there's particular queries that can leverage the column store, you can definitely see some great results. [00:04:30] So I think it would be awesome if postgres has this feature literally built in where you could create essentially a column store index and give you great analytical performance for certain queries. But if you want to learn more about that, definitely check out this blog post. [00:04:46] Next piece of content Btree gist improvements in PostgreSQL 18 this is from CyberTech, PostgreSQL.com and B Tree. GIST is a combination of B Tree and gist, so it lets you compare both a scalar and usually something like a range type, for example, or maybe geographical points. But he says it's mostly used for nearest neighbor search or exclusion constraints. But one disadvantage of this particular index is that it was not built to implement sort support. [00:05:18] So what that basically means there's two ways to build GIST indexes. One is buffered, which does each value one at a time, and it makes the index slower to build and not as efficiently Packed, whereas you can build it sorted, it builds it faster and is stored more optimally. So they investigated and implemented sort support for these BTregist indexes and they have this patch here that's scheduled to go into Postgres 18. [00:05:47] So first they took a look at some build benchmarks. So they're looking at this UUID here with a range of integers and built the index for 1 million, records 5 million and 10 million. And you can see the difference between PG 17 and PG 18. [00:06:04] So it looks about 4 to 5 times faster build times with the improvements to B tree gist, which is awesome. And here they're looking at an exclusion constraint example and again, maybe on average four to five times faster again, which is great. But then they also wanted to check out performance and they ran a set of queries and looked at the transaction per second and it was about threefold higher. So not just the build times are faster, but also the queries are faster now just from this index improvement. And maybe some of the reason for this improved performance is better shared buffer utilization. I don't know if this is due to size as well, but Basically there were 10 times the number of shared buffer hits with PG17 compared to the improvement in PG18. [00:06:53] So definitely, if you're using B tree gist, something to look forward to in Postgres 18. Next piece of content scaling Postgres Listen Notify this is from pgdog.dev. and again, listen Notify is one of those features I haven't really chosen to use in postgres. [00:07:09] Keeping a connection open for a really long time, listening for messages is just not something that I've chosen to use. I use other types of queuing systems or messaging systems, but support for this is something that's being built into PgDog. Again, this is the scaling pooler that is being developed by Lev Kokotov and he gives a basic visual representation of how Pubsub works. Basically, you have a publisher that sends a notify command to a particular channel with an optional payload. [00:07:42] So say it sends it to a channel on background jobs or a channel on user signups. And then another postgres client executes a listen command, listening on a particular channel. And it can listen to more than one channel for these messages coming down. So you could see this top subscribers listening for background jobs. The one in the middle is listening for both background jobs and user signups. And the one at the bottom is just listening for user signups. So basically once it receives a message, it can do some sort of action. [00:08:10] Now the Changes they've done to PGDog are to basically support this Listen notify within it, and it basically utilizes the Tokio or the Tokio runtime built in Rust to help manage these features. So it keeps the connections open from the client's perspective and then it sends background tasks as necessary to process those messages on the postgres back ends. So basically he says you can have thousands or even millions of clients connected to pgdog and it will forward those messages efficiently to postgres without putting all of that connection load on postgres. So it's doing a lot of the heavy lifting of maintaining all of these connections. Now he does say both Listen and Notify commands are acknowledged by pgdog immediately, so they don't wait for an answer for postgres. So they're trying to remove that latency penalty. But as a consequence your first client's guaranteed to be eventually subscribed to a topic after sending a listen command. So there's a potential for a lost message with that. In addition, they're implementing at most once messaging. So basically it's just going to attempt delivery one time and as a consequence you could also lose messages that way. So he says if message loss is a problem, you definitely want to implement some sort of database backed mechanism to make sure those messages are guaranteed not to be lost. But if you have another set of messages that it's okay to lose one or two, you could easily implement the system. And he said it also supports Sharding where pgdog hashes the channel name to route commands relative to a channel on a matching shard. [00:09:52] So I thought this was an interesting use case that may make Listen Notify more usable in certain scenarios. [00:09:59] Next piece of content High availability and postgres full sync replication. This is from multigres.com and multigres is the other sharding pooler similar to pgdog. Although they have de emphasized a lot of the Sharding talk recently, they're mostly focused on high availability. At least that's what I recall from their Postgres FM interview. And as a reminder, this is Sagu who worked on Vitess which is a high availability solution and I think include Sharding as well for MySQL. Well he is building multigres for postgres and so he's talking about his plans and goals here. Basically wants to have a set of durability policies for handling ha like do you just want one cross availability zone replica? Do you want to? Or do you want cross region replicas or majority Quorum using consensus rules like raft. Basically, these are different policies they plan to implement. And in terms of postgres and replication, you can either go with asynchronous replication or synchronous replication. So synchronous replication means it waits until that information is committed on however many replicas you define, receive that transaction and commit it to the database, usually at least to the wall, or even the database files itself, before it confirms the transaction back to the user. And he says that the test is focused on synchronous replication, not asynchronous replication. [00:11:31] Of course, the downside with synchronous replication is that it definitely introduces more latency because you not only have to get all that data committed on the primary, but then send it over a network transport, potentially regionally, and then commit that transaction on a secondary database. And if you have more than one database you're looking to commit to, that just extends the latency from there. So I'm interested to see how that actually plays out. And then they also talk about different ways you can lose data even with synchronous replication, but I think for some of these reasons, they're actually going to be implementing a two phase sync replication solution. [00:12:13] Now, I don't know if this is going to be built in postgres or it's going to be part of the test solution, but they're looking to really elevate the high availability possibilities of postgres with this. But if you want to learn more, definitely check out this blog post. [00:12:26] Next piece of content. Enable parameters will work differently in Postgres 18. This is from pgmuster.com and Postgres has a variety of parameters where you can enable or disable, like nested loops or sequential scans, different types of operations that the planner can do. [00:12:45] Now, I would never run postgres with any of these enabled, but they're good for diagnosing issues you're encountering. And they say how these settings have been working is that when you enable or disable the setting, for example, set enable sequential scan to off, it would actually append an astronomical cost to doing this particular operation. In other words, give a sequential scan an enormous cost so that the planner won't choose it. But as you can see here, it still chose it because it was the only option. There are no indexes, so it had to do a sequential scan, even though sequential scans are off. And they said, well, why are you changing this? Well, even though the number is very big that they're using, there are still people who are getting really complex analytical queries that were hitting this. And even though they turned it off like sequential scans, it was still happening when it actually shouldn't be. So they actually changed how it works. [00:13:39] So now what they're tracking is the fact if it's disabled or not and they actually choose the path, meaning optimizer path with the fewest disabled nodes. So basically it's like doing a first pass of like let's try to avoid anything that's disabled and then once we're there, then let's apply the lowest cost algorithm. So it tries to take the those disable nodes off the top, not consider them and then focus on lowest cost. And as a consequence, even though you're turning off sequential scan here, you'll see the costs have not been altered, they are the same. But it now just says disable. True. So it is using a sequential scan even though it's disabled because there's no other alternative in this particular case. But if you want to learn more, definitely check out this blog post. [00:14:26] Next piece of content Postgres replication slots confirmed flush LSN vs restart LSN this is from morling.dev and he's talking about the two different LSNs you see in the pgreplication slots view you would see a restart lsn and a confirm flush lsn. Well, what's the difference? Well, the confirm flush SSN is the latest lsn acknowledged by the slots consumer. So there's a consumer that that's consuming the slot and this is the last LSN it pulled. So he has an example here, created a table, created a replication slot using the test encoding plugin. He inserts three records into this table and then he does pglogical slot get changes for that slot. And you can see the implicit begin and then the three rows that were inserted and the implicit commit. And then when you actually look the pgreplication slots at this point you see the last LSN listed here, it ends in F780 is the Confirm Flush LSN F780. [00:15:32] So this is the last lsn consumed for the slot. Okay, so what about the restart lsn? So basically this indicates the amount of wall that needs to be held onto because you have in flight transactions. And he has this great example here. So imagine you have transaction A in blue and and transaction B in orange. Let's say so transaction A starts here, so it starts first and then transaction B starts here, but transaction B actually finishes early here. So it does its commit at this point. This is available on the slot and it can be pulled by the consumer. So once it's pulled, you have the confirmed flush in listed as this commit here. [00:16:20] But the restart lsn is still back here because you add this transaction A that's still in flight. Essentially it has not been rolled back or committed. Well, you can see it's committed at the end here, but at this point in the timeline. That's why the restart lsn lists this entry here, because it's before the transaction A started. [00:16:41] So it's conceivable if there's no other transactions going on at this lsn state here, where transaction A completes, then both the restart lsn and the confirm flush lsn will match because there's no transactions in flight and everything has been flushed to the replication slot consumer. Now he says one thing to keep in mind. If you are consuming logical replication slots, you cannot rely on the LSNs of received events to be strictly increasing. So a transaction could commit early in that example, because as transactions are exposed in commit order. [00:17:18] So we saw B before we saw A. Events with a lower SSN can be published after events with a higher lsn. And this also indicates the problem of larger long running transactions in postgres because it prevents postgres from increasing that restart lsn. He also talks a bit about mid transaction recovery, meaning what happens if you set the lsn to the middle of the transaction, what happens? So he inserts some more data into his table example. He looks at the actual records here and chooses a middle transaction here and ends in F90. And he advances the slot to that location. And then he says, okay, let's get the slot changes. Well, what you see is even though it's in the middle of this transaction, it still pulls the entire history of that transaction. When you do pull pglogical slot get changes. But I thought this was a great blog post explaining how logical replication works and how replication slots work as well as logical decoding. So definitely encourage you to check it out. And the last piece of content is Graviton 2 versus Graviton 4. This is from ardentperf.com and he was setting up the cloud native PG playground on some instances and he noticed on a Graviton 2 instance this CPU utilization was 40%. [00:18:41] Now I'm assuming that's steady state, but you can see the CPU utilization indicated here around 43% for the Graviton 2, whereas on the Graviton 4 it was at only 8%. And I think he said roughly around 10%, so about four times less utilization between these two processor generations. So if you're running Postgres and Graviton 2, maybe you want to do an experiment to try it on Graviton 4 to see if you get better resource utilizations on your instances. [00:19:11] I hope you enjoyed this episode. Be sure to check out scalingpostgras.com where you can find links to all the content discussed, as well as sign up to receive weekly notifications of each episode there. You can also find an audio version of the show, as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 124

Partition Migration, Like Performance, Best Fillfactor, In-Memory Tables | Scaling Postgres 124

Episode 225

PSQL GEXEC, Delete Duplicates, Postgres Podcast, Puny to Powerful | Scaling Postgres 225

Episode 147

DBMS of the Year, Better Data Migrations, Idle Connection Impact, Security Implementation Guide | Scaling Postgres 147