Data Directory, Vacuum Features, Binary Performance, Network Latency

[00:00:00] In this episode of Scaling Postgres, we talk about data directory, vacuum features, binary performance and network latency. I'm Kristen Jameson, and this is scaling postgres episode 114. [00:00:21] You all right? I hope you, your family and coworker continue to do well throughout these times. Our first piece of content is new PostgreSQL releases. Twelve point 311.810, point 13, 9.618 and 9.522 are released. The only security issue relates to the Windows installer and it was running executables from uncontrolled directories. So if you run on Windows, maybe you want to upgrade faster. Otherwise there are over 75 bug improvements. So I would probably kind of look through this list to see if there's any reason you want to upgrade relatively quickly. And this is from the postgresql.org website. There's also this post that was done on Devram Gundas's PostgreSQL blog that says a Yum user, some hyphen development RPMs require a new repository. So he says if you're using Red Hat Enterprise Linux, CentOS Seven, then development subpackages in Eleven and Twelve require an extra repo enabled for LLVM and Clang. So he says you just need to update this for CentOS or Red Hat Enterprise Linux, you need this command. So just something to keep in mind if you use those versions. [00:01:36] The next piece of content is don't manually modify the PostgreSQL data directory. And this is from the Robert Haas blog on blogspot.com. And this is a big no no. Unless you have actually done PostgreSQL development, I would probably never modify anything in the data directory. And what's interesting, a lot of people, when he gives this recommendation, he says, really? What if I do X, Y and Z? Don't do it, just don't do it. And he says in 100% of cases my answer is that it's not safe. But going back to, hey, what if I do this? He kind of goes through this post on what happens if you do it. So for example, if you remove the postmaster PID, there's a chance of course, you're going to be running multiple versions of that same PostgreSQL cluster. That could lead to who knows what kind of effects. And he says if you suspect it's present without postgres running, assume that you're wrong. Because he says in his 20 or so years, he's never really seen it happen that way. So definitely something not to do. They said removing wall files is another big no no, because people look, at least in older versions of PostgreSQL, the wall files were stored in the PGx log directory. So people assume, hey, these are logs, I can delete them. No, they're part of the wall. And they've actually renamed it to PG wall in recent versions because you don't want to delete these because this helps with crash recovery and maintaining consistency of your database. And there's all sorts of LFX he goes through that can happen here. But he says instead of removing files manually from PG wall, consider why the system is, say, keeping so many of these around. So he says maybe you have a failing archive recovery command. Maybe you're just not keeping up with the rate of wall generation. I've had like slow disks that could cause this type of thing. Do you have an existing replication slot or an orphan replication slot that is preventing the walls segments from being removed? Maybe you have long running transactions or maybe you have your max wall size or wall keeps segments are really high preventing them being removed. So again, these are things you can fix without having to change things in the data directory. Next we covers is removing files from the PG transact directory, the Xact directory. He covers a PG multitransact directory, talks about trying to do single page restores which has a large degree of risk. He communicates here and all the types of things that can go wrong. So if you're interested I definitely suggest checking that out. And then there's a section on is manual modification of the data directory ever justified? And again, if you are in a corrupt state in certain cases, again very few where this is perhaps warranted, but again I would leave that to an expert. Leave it to someone who has done PostgreSQL development and knows exactly what they're doing. He makes it a comparison to a surgery. So essentially you want to find a surgeon to do this. You don't want to just willy nilly go in and try changing things because you'll probably make things worse. So this was a great post. If you've ever considered modifying the data directory, basically just don't do it. So definitely a post I suggest you to check out. [00:04:52] The next post is improved Vacuum and Auto vacuum in PostgreSQL 13. And this is from Emmet Capilla's [email protected]. Improvement number one is vacuum will be allowed to process indexes in parallel. This is huge and where it does it in parallel is the indexes. And the indexes in my experience take the longest part of vacuum and you can do various configuration settings looking at the max parallel maintenance workers min parallel index scan size to be able to configure it so that you are indexing in parallel. And it looks like this will be the default. So that's good for doing auto vacuum as well. But you can specify the number of workers when you're going to do a manual vacuum. So this is a great, great improvement coming in 13. Improvement number two is allow inserts to trigger auto vacuum activity. So again, by default when you just have inserts, there's nothing really to vacuum up in terms of deleted rows in a table or updated rows if all you're getting is inserts. And the first vacuum that's going to be is an anti wraparound vacuum which will probably take a really long time and potentially use more resources to do it, you can't cancel it. So there's some disadvantages to it. But with some configurations that they've introduced it will trigger a vacuum once so many inserts are done into a table so that you can do some general cleanups such as freezing rows and assuring heap pages are visible and also allow index only scans to skip heap fetches. So being able to do this vacuum is important for those reasons. Improvement number three is allow auto vacuum to display additional information about the heap and index in case of an error. So this is good that it can actually tell you where particular errors are if vacuum runs into a problem. Now, knock on wood, I haven't experienced this, but this reporting would be good to have in case it happens. Proven number four, auto vacuum will now log wall usage statistics along with other information. So this is great to understand how many things like full page images are used and things of that nature. Improvement five make vacuum buffer counts 64 bits wide to avoid overflow of buffer usage stats. So this will just make the stats accurate as opposed to invalid, so that's a good benefit. And then six is add weight event vacuum delay to report on cost based vacuum delay. So he says this will help us to monitor the auto vacuum throttling. So a lot of great features in coming 13 to vacuum, so especially this parallel. So if you want to learn more about it, definitely check out this blog post. [00:07:24] The next piece of content is binary data performance in PostgreSQL. So this post covers performance using binary data types in PostgreSQL. Now, they cover three ways that you can do this. One is storing the data in files and then storing a reference to that file. And these have advantages and disadvantages. Disadvantages that consistency is not guaranteed, like when you write the file, when you put the record in the database, those are going to be misaligned. Benefits are the database is small and the performance can usually be very fast because reading directly from the file system is better. Next option is storing the data in large objects. This is an older way of storing binary data in PostgreSQL. Now, it's the slowest, but it does have advantage in that you can store arbitrarily large data with large objects and the API has support for streaming so you can read and write large data in chunks. The third option is storing data as a byte A data type and that's the new way of storing binary data. Now, it says the main disadvantages are there's an absolute limit of 1GB and all the data has to be stored in memory, there's no streaming support. And he says with byte A this basically uses toast to be able to store that data so it stores it separately from the table. Now, along with this, because normally PostgreSQL compresses large data being stored, you can avoid that by setting the storage to external for your byte A data type. So for example, here setting this to external it actually avoids the compression. So if you're already compressing it, use the external here to avoid trying to compress it again. Next they talk about a Java implementation they did to test out the speeds of these different solutions. And they come up with a table at the bottom here that show using the file system directly for two different data size types. You can see that's the fastest, but again, the disadvantage is inconsistency the large objects pretty darn slow, but it has support for streaming. And then the Byte A data type faster than large objects, but nowhere near the speed of the file system. And then in the summer here they talk about the advantages and disadvantages of each type. So if you're wanting to store binary data in PostgreSQL, definitely check out this post so you can make the best choice for what you're choosing to do. And I should mention, this is from CyberTech postgresql.com. [00:09:53] The next post, also from CyberTech postgresql.com is PostgreSQL. Network latency does make a big difference. So here they wanted to test performance with increased latency. So they're using a tool called TC Config to configure latency in Linux and then using PG bench to set up some transaction tests. So running at full speed with no latency introduced, you get 8813 transactions per second when introducing ten milliseconds of delay and then running the same benchmark, it drops to 461 transactions per second. So a huge drop off. So essentially these are clients waiting for their results to come in before they send the next transaction. That type of latency of ten milliseconds can result in that. And then if you bump it up to 50 milliseconds, now your TPS is at 88. Basically it dropped over 100 times. So network latency is hugely important. And they mentioned specifically here for online transaction processing because you have a lot of small short transactions, any latency you introduce from wherever the client is to the server is going to significantly decrease your transactions that you can execute per second. So definitely interesting blog post showing you how latency can impact your queries. [00:11:14] The next post is postgres distinct on, and this is from John Noonmaker.com. This is an interesting post because he's a developer and he walks through a problem he was having with performance and how he wanted to fix an N plus one query to get better performance. So this talks a lot about Ruby as well as the SQL he used to solve his problem. Basically the solution he chose as the most efficient one is using distinct on to pull up the first record that he needed. And then he goes through his Ruby implementation of how he introduced this query into his code to get the best performance. So if you're interested in a developer's perspective of seeing a problem and coming up with a solution using distinct on, definitely check out this blog post. [00:12:01] The next post Overcoming Imposter Syndrome. Working with PostgreSQL's SSL modes. This is from Richien.com. He's talking about all the different SSL modes that are available in PostgreSQL. Now, what determines what uses SSL or not from the server's perspective is the PGHPA. Comp file. So however, in that first column you have it set will determine whether SSL is used. Now Local uses a Unix domain socket, so it's not going to use any SSL at all. Using Host, it could use SSL or it could not. Using host SSL. It only uses SSL. And then using host no SSL means don't use SSL with us. Now, convenient way to reject all non SSL traffic is to put this in the top line of your PGA HPA. Comp file. Basically for host? No. SSL for all databases, users IP addresses reject it. So basically reject all no SSL connection attempts. And then you would need to put in a Host SSL line for others wishing to connect. So this is how you enforce it on the server side. Now, there's also an SSL mode on the client so that dictates how the client connects. So it can prefer SSL. It can require it, it can disable it. There's also some certificate verifications. So here he goes through the different ones. So he set up the server as described to rejecting all non SSL. It's saying only use SSL with connecting with a password. The first attempt here where the SSL mode of a client was disabled, it rejects it. It won't work because the client has said, I don't want to use SSL, but the server says, no, I'm only going to use SSL. So the connection doesn't work. The connections are permitted through when the SSL mode is allowed for the client. It works if it's preferred for the client, or it works if it's required. However, you'll notice that verify CA fails as well as verify full fails. And this is where you're asking for the client to verify the certificate used by the server. So this will only work if you have set up and manually installed certificates on the server and the clients trust the certificate authority who signed those certificates. Verify CA basically means verify the CA who signed the certificate of the server you're using. Verify full means. You're also verifying that the common name or the server alternative name that is in the certificate matches the database server that you're trying to connect to. So basically, this is the strongest way of verifying authenticity that you're not connecting to a server. That's not what you expect. So if you're wanting to learn more about these settings, definitely check out this post. [00:14:47] The next piece of content is benefits of external key management system over the internal and how they could help securing PostgreSQL. So this is talking about the benefits of an external key management system. Now, why are we talking about this? This is because this organization, Higo CA from this blog post is working on a transparent data encryption feature. This is where Data gets encrypted at rest by the PostgreSQL server. Now, in order to do that, they mentioned that encryption is relatively simple, but what is not is key management. So they're working on this transparent data encryption feature, potentially for version 14, and they're talking about having some internal key management features for it. But they're going through and talking about this post from the Benefits of an external key management feature. So if you're wanting to learn more about the benefits of this, may want to check out this post from Higoca.com. [00:15:43] The next post keep your libraries updated and nobody gets hurt. This is from secondquadron.com. They're talking about the importance of keeping all your libraries up to date even when running PostgreSQL, because PostgreSQL relies on a number of libraries and if those aren't kept up to date, you could run into problems and issues and they go over. Not upgrading may give you a false sense of stability because things are fine and you may have a fear of upgrading, but eventually you're probably going to run into issues. And it's usually related to an old software version that hasn't been updated. So they go into some actual scenarios where, say, someone was using a PG dump, but then they discovered that it had a PG audit extension that hadn't been updated in more than a year and it was resulting in an error when trying to restore the database. So that's a big issue. The next scenario they're talking about where they had a huge performance difference when trying to do a Postgres query and they tracked it down to an issue with an old Glib C version. So once that was updated, the performance went back to normal. So it's basically communicating the importance of doing updates to your system, not just your PostgreSQL versions, but also the extra libraries that potentially PostgreSQL is using. [00:17:00] The next post is multicolumn Partitioning, and this is from [email protected], and this is a way of partitioning tables using multiple columns. And I actually didn't know this was possible where you can create a table and have different columns and looking in different ranges. So if you look at this here, it's like a range from one to 20 and from 110 to 200 and from 50 to 200. And this goes through all the possibilities and things to watch out for and how queries work when using multiple columns for your partitioning. Now, I haven't encountered a use case for this, but this was definitely interesting blog post you may want to check out in case you have a use case for it. [00:17:41] The next post is PG Badger X ray vision for your queries. This is from Enterprisedb.com and they're basically talking about using PG Badger to analyze your logs and they go through some of the things you may want to additionally configure in Postgresql.com in order to make sure you're collecting the most information. And it shows you how to walk through and how can use PG Badger to analyze your logs. So if you're interested in doing that, check out this blog post. [00:18:10] The next post is Key Parameters and Configuration for Streaming Replication in postgres twelve. So in this example, they're looking at setting up a primary server with two Replicas and you can choose to have them synchronous or not. And they're using replication slots and they describe the settings that need to be made on the primary and the Replicas. So if you're interested in setting this up for your system, you can check out this blog post. Also from Enterprisedb.com. [00:18:37] The next post is Fuzzy Searching with PostgreSQL, and this is from a Dev two website and it's talking about Trigrams basically, so using the PG Trigram extension to do fuzzy searches or similar like searches in your database. Now this goes over just a very basic implementation of it and how you could use it. It doesn't cover things such as adding indexes and things of that nature, but if you want to get a good basic description of using the PG Trigram extension, definitely check out this blog post. [00:19:09] The next piece of content is advanced SQL and database books and resources. This is from Neilwithdata.com, and we had previously talked about other books that were presented as educational resources, and here he goes into several different online courses and books for you to boost your knowledge of SQL. Again, this is not necessarily PostgreSQL specific, but a number of his resources of course talk about PostgreSQL. The next post is Build PostgreSQL and Extension on Windows and it tells you how you can build postgres along with extensions in Windows. So if you have a need to do that, check out this post from Higo CA. And the last post is that the PostgreSQL Person of the Week is Fabrizio de Royce Mello. Forgive me if I pronounced your name incorrectly. So if you want to learn more about Fabrizio and his contributions to PostgreSQL, definitely check out this blog post. [00:20:04] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the Show Notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks you our.

Show Notes

Episode Transcript

Other Episodes

Episode 261

Easy Foreign Data Wrappers, JSONB Cheatsheet, Updating Cost Limit, Parallel Aggregate | Scaling Postgres 261

Episode 222

Postgres 15 Beta 2, Concerning Locks, Vacuum Tuning, Transaction Anomalies | Scaling Postgres 222

Episode 196

Easy Recursive CTE, Zheap Undo, High Availability, Loading Data | Scaling Postgres 196

Data Directory, Vacuum Features, Binary Performance, Network Latency | Scaling Postgres 114