Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about scaling lessons, synchronous, commit, connection management, and cleaning up blobs. I'm Kristen Jameson and this is scaling postgres episode 161.
[00:00:22] All right, I hope you, your friends, family and co workers continue to do well. Our first piece of content is Lessons learned from five years of Scaling PostgreSQL. This is from Onesignal.com, and they're talking about their experience scaling up and managing a 75 terabyte database stored across nearly 40 servers. So that is definitely quite the database. And they cover different sections bloat, database upgrades, XID wraparound, replica, promotion partitioning, and sharding. Now, they first go into a higher level overview of the data, and basically they have two dominant tables they talk about. One is their subscribers table, which it looks like they have about tens of billions of rows and a notifications table. And they had to do a fair amount of updates from what they're saying here. So the notifications does have a heavy update load, and the subscribers write heavy in terms of inserts and updates. So of course, when you have a lot of updates or a lot of deletes, you're going to experience bloat. So they talk about table bloat as being an issue. And that what you need to do is, of course, set appropriate values for vacuum to make sure they get taken care of. Now, they don't go into detail about their particular auto vacuum setup to be able to make sure that their tables experience a minimal amount of bloat. They do talk about tuning auto vacuum, but they don't give the exact values that they have. But they do reference a second quadrant article about it. Then they do talk a little bit about schema optimizations they've done. So one is partitioning some of these large tables so that vacuum can run on each partition, as opposed to taking forever to run on a big, huge table. That's one aspect. The other aspect is actually moving some columns to a separate table, even though maybe it logically makes sense to have on that table. If you have information that's not updated as frequently, having it in a second table means that that data doesn't need to be vacuumed and then frozen as often. So that could lead to some benefits. And in terms of reducing bloat, they talk about using vacuum full, but again, that locks the whole table. So that's another good candidate. They talk about using the PG repack, although with indexes, I would probably just do the re index concurrently capabilities of Postgres. But PG repack is a good tool to use for tables. And they did have some specific issues with one table where they found PG compact table, which is another extension, does reduce bloat in a non blocking manner. So definitely some tools to handle bloat. Now, in terms of database upgrades, they've chosen to go the logical replication option with PG logical. So logically replicate the data over to new versions so that's how they've handled that and they discussed that here. And minor upgrades are pretty easy to do with postgres in terms of just updating the binaries. They talk about the importance of tracking transaction ID wraparound or the XID wraparound, so super important to maintain it, but then, well, vacuum tables should keep that under control. But there are also a few things you can tune, such as the auto vacuum freeze, max age. Next they talk about Replica promotion. But really I'm considering this, their high availability and how they handle it. And they use Ha proxy and it looks like they're kind of in the intermediate state of developing this. Then they talk about their partitioning and how they are partitioning right now by 256 partitions and they're considering upgrading it to 4096 partitions in the future. And then they talk about Sharding, as they say here, a natural extension of partitioning. So if you want to get a little bit of experience from someone who's been running a pretty large postgres installation for five years, definitely check out this blog post.
[00:04:03] The next piece of content the power of synchronous. Commit. This is from Momgm US and he's talking about the different setting changes that synchronous commit does. And he mentioned that he updated some documentation in October and was in the November twelveTH release where it needed to communicate a bit more what that parameter value does. And he goes into detail about it because it handles how things get committed at the local level, meaning at a single server level, as well as when you're having Replicas involved. For example, he mentions when synchronous commit is off, then the database doesn't wait for records associated with a commit to be written to the write ahead log. So it does guarantee data consistency. It just means that everything that's been committed and returned back to the client say, okay, it's been committed. If the server goes down, there's no guarantee that that commit will make it to the database files, but it will be consistent. And he says this is a little bit in contrast to Fsync Off, where there's no guarantee of data consistency. Now he mentions the local option for that setting waits for the commit to be durably recorded, meaning the wall has written it to disk. Now he mentions the other settings handle how Replicas will handle the writing. So remote write waits for one or more standby servers to have their transaction information durably stored. The on option, which is the default, is similar to remote write but is durable against standby operating system crashes, not just postgres crashes. And remote apply waits for the transaction to be replayed, invisible to future read only transactions. So if you want to learn a little bit more about the synchronous commit setting, you can definitely check out this blog post.
[00:05:45] The Next Piece of Content your guide to connection management in postgres. This is from Crunchydata.com. He's talking about different methods of pooling connections to make your postgres server more efficient, because the more connections that are left open, it can be reused. Results in lower latency to get queries back from the database because you don't have to build up a connection, particularly a SSL or a TLS connection. Now, the first thing he mentions is being able to handle application side pooling. So this is your application framework handling some of that pooling itself and keeping a number of connections open. So this may be based upon the number of workers you have and maybe they use a common pool of connections to do their work against postgres. And having those processes keep the connections open for a time and reuse them will make querying the database through the application framework more effective. And once you get to a certain scale, you may need to implement server side connection pooling. So this is using a tool such as PG Bouncer to pool the connections and use a common set of server connections for all the application connections coming in. And then he mentions the scenario where you can do connections across databases. And some application frameworks support sending particular database queries to certain databases. So he does mention, for example, Rails has some libraries that help with routing via a specific model. So if you want to learn a little bit more about things you could be doing with connection pooling to make your database more efficient, definitely check out this blog post.
[00:07:14] The next piece of content is cleaning up a large number of blobs in PostgreSQL. So this is from Cyber Postgresql.com and they're talking about using blogs or binary large objects and actually storing in the database. So one scenario if you're working with binary objects is to store the file and just store a reference to it in the database. The other way is to actually store the binary information of that file in the database so you no longer essentially need that file. Now, one way you can do that is you use the large object import function and give it a file name and it will actually store that binary data in this file in the database and it returns an object ID here. Now, the typical way you would probably implement this is create a separate table. And then when you do an insert into the table, use this large object import function and it will return the object IDs. Then you can store the reference to the large object in a known table that you have. Now, for an experiment, he loaded about a million of these objects into postgres. Then he took a look at how these large objects are stored and basically they're stored as byte arrays. So anything that's a blob is essentially an array of bytes. And for example, he set the byte array output to escape. And then when he queries PG large object for a particular object ID, you can actually see the data that's stored in that byte array. Now he mentions removing blobs, and that's easy to do using the large object unlink and referencing that object ID. And it unlinks and it removes it from the database. However, I'm assuming a client ran into an issue because how do you remove millions of blobs at a time? And when you actually try to do a large number of these, you can run out of shared memory. The reason is that it requires a lock to do this, and it's not a row lock, but it's the locks that can be tracked by the max locks per transaction. So you can't remove, say, a million of these binary large objects. So the scenario to deal with that is I would probably just delete the blobs in chunks. The other possibility is changing the postgres configuration, but definitely something to keep in mind if you're needing to delete these en masse. And the last thing he covers is how to deal with orphan blobs. Because again, the suggestion is to do an insert here where you can save the reference to the binary large objects. But what if you have an orphan objects? Well, it talks about how you can handle them here. So if you want to learn more about that, definitely check out this blog post.
[00:09:46] The next piece of content managing transaction ID exhaustion or wraparound in PostgreSQL. This is from Keith F four.com, and it's titled Keith's Ramblings. And much like the previous post, talking about five years of knowledge is monitoring. This is highly important to do, particularly with a highly active database. And he presents a query that is apparently used in the Crunchy Data PG monitor. But it's a CTE query that actually tracks the transaction ID and how soon you're going to hit that 2 billion limit, because he mentions a percent towards wraparound, percent towards emergency auto vacuum. So how soon is it going to hit the mark where it has to do an aggressive vacuum to take care of vacuuming things up? And he mentions also the auto vacuum freeze max age, for which the default value is 200 million. But if you have a highly active database, you may want to increase this by a bit, but definitely no more than a billion from the knowledge that I'm familiar with. But once you have your monitoring set up, the next consideration is how to fix it if you actually start approaching this problem. Now, one way to do it for the whole database is to just vacuum the whole DB specifying freeze, having a couple of jobs that can take it at one time and then also doing an analyze as well. And that'll freeze your IDs and allow you to free up transaction IDs. But there is also a per table fix. So you can actually use this query to determine, okay, what databases have a setting that's going to hit the limit sooner than others. And then you can check each table within that particular database to see what is the table that's closest to hitting that wraparound point. And then you could separately vacuum freeze that table. And he also has a technique here where you can do multiple tables at a time if you need to. So if you're looking for some techniques to be able to monitor your transaction ID for wraparound cases as well as how to resolve them, you can definitely check out this blog post.
[00:11:44] The next piece of content postgres is out of disk and how to recover the do's and don'ts. This is from Crunchydata.com, and the first thing they're talking about with regard to this is make sure you have a backup. And then they also talk about the scenarios under which you would get a full disk. So one is that the archive command is failing and the wall is filling up with disk space. The second is that the replication slots are essentially orphaned.
[00:12:09] There's a disconnected standby, and the result is on the primary, the wall is filling up. The next scenario is that there's a lot of database changes that generated a ton of wall that consumes all the available disk space. So maybe a mass of deletes connected with inserts or a mass of updates more likely. And then the last scenario is you just literally ran out of disk space during the data and you don't have any monitoring or alerting that let you know. So the first thing they talk about is what not to do. Number one is never remove the wall. Definitely never remove that. Number two, don't immediately overwrite the existing data directory with a restore from a backup. Well, I wouldn't do that either. Don't just resize in place. So I don't know if I 100% agree with this, because there's some ways that maybe this is possible. But they say what you should do is take a file system backup right now. Don't necessarily agree with that. They say create a new instance with sufficient space. I'm like, okay, that's possible, or fix the underlying issues. Generally, if you're running on a cloud provider, what I would do is see if you can just increase the disk so you can like AWS, for example, has a way to expand volumes. I would go ahead and do that. Or is there some other file or some other reserve space on that volume that you can delete at least to get the database back online? Because once you get the database back online, you can then potentially delete orphan replication slots, which may be a cause, or you can adjust the traffic so that you're not updating so much and the wall can be resolved, or you can fix the problems with the archive command. So there's a lot of different ways that you could potentially resolve this. And this post does cover the main ways that running out of disk space happens with postgres. So if you're interested in learning more definitely check out this blog post.
[00:14:04] The next post is finally a system level Read All Data role for PostgreSQL. This is from Cybertechgresql.com. They're talking about a new patch that's been added for postgres 14, which should be coming up this fall, in that you now have a PG Read All Data role and a PG Write All Data role. So they really love this feature because he says what tends to happen. Yes, you want to be able to spend time and craft your roles appropriately, but too many people just grant super user roles inappropriately and that when you're doing that, you're actually opening up your server to more risks. So for example, you can run operating system commands on the database server with super user permissions. So you want to minimize how many people have that access and these roles are a great way to avoid that. So if you have a data analyst and all they need to do is query the data, you can just give them the PG Read All Data role. Or if you have an application user that needs to be able to write and read data, you could give them both of these roles. Now, they did talk about some of the workarounds. If you don't have this role and really these are more structured way to handle it, and they talk about handling the roles and assigning the appropriate permissions to do essentially what these roles are doing. So if you want to learn more about setting this up, definitely check out this post from CyberTech. Postgresql.com.
[00:15:28] The next piece of content is cytostalk at CMU Distributed PostgreSQL as an extension. This is from Citusdata.com. They're talking about a presentation that was done at CMU relatively recently, talking about PostgreSQL and how it does scale out postgres via being an extension. And it covers some of the topics pointed out at the different watch times of the YouTube video, such as when to use Cytus to distribute postgres, using the PostgreSQL Extension APIs, transparent Sharding, Distributed Query Engine, and Distributed transactions. If you want to learn more about Citis, you can definitely check out this blog post and talk the next piece of content is benchmarking PostgreSQL setting up a Raid array from scratch. This is from Enterprisedb.com and the main subject of this article is setting up a Raid array because they did have an AWS cloud instance and they were moving to a on premises server to get more consistency out of their benchmarking. And therefore they're using the software raid tool Ndadm. But even I've used this on Amazon because when your database gets up to a certain size, there's limits to how big of an EBS volume you can have. And if you have a database larger than that, then you need to basically stripe across multiple EBS volumes. So even if you're in AWS and your database is large enough, you may be using this anyway. And he basically goes through the process of setting up this MD adm rate arrays for postgres volumes. So if you're interested in that, you can check out this blog post, The Next Piece of Content how does postgres handle external parameters? So this is a in depth look at the parameterization that postgres does with queries. So if you want to get under the covers of postgres and understand how that works, definitely check out this post from Hago CA, the next piece of content regression analysis and PostgreSQL with TensorFlow, part two data preprocessing. So if you want to learn more about TensorFlow for doing machine learning with postgres, you can definitely check out this post from Enterprisedb.com.
[00:17:35] And the last piece of content is the PostgreSQL Person of the Week is Devram Gundoz. So if you're interested in learning more about Devram and his contributions to postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube. RyTunes thanks.