Cleaning Up, Function Performance, 11 Million IOPS, Change Data Capture | Scaling Postgres 151

Episode 151 February 07, 2021 00:18:10
Cleaning Up, Function Performance, 11 Million IOPS, Change Data Capture | Scaling Postgres 151
Scaling Postgres
Cleaning Up, Function Performance, 11 Million IOPS, Change Data Capture | Scaling Postgres 151

Feb 07 2021 | 00:18:10

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss cleaning up your database, function performance, 11 million IOPS and change data capture.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/151-cleaning-up-function-performance-11-million-iops-change-data-capture/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about cleaning up function performance 11 million I ops and change data capture I'm Creston Jameson and this is Scaling Postgres episode 151. [00:00:22] Alright, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is the unexpected find that freed 20GB of unused index space. And this is from Hakibinita.com and he's talking about they were running out of disk space, so they normally just add some more. But they started looking at where they could potentially clean up certain areas of the database to free up space. And they show this graph here where essentially this is free space over time. So it's a little unusual that the higher it goes, the more free space it has. But basically they did a few operations that brought it up to here, had a little bit more growth and then something additional. They found out that let them go up to this high, which is about 70GB of free disk space. So the first thing they did is they looked at unused indexes and they include the query here. So you could just take this query and run it to look to see if you have any indexes that aren't being scanned. No tuples read or no tuples fetched. And the thing to keep in mind is that this also uses postgres statistics, so you may want to reset them periodically to make sure that you're including the most up to date data that indexes are indeed being used. And it gives a command that you can use to do that here. The next thing they looked at was an index or table Bloat. So how you can address index Bloat is by doing a reindex. And ideally you want to do this reindex concurrently, so you want to make sure you have this capability if you're going to do a reindex, to reindex concurrently, and I believe it's postgres eleven, might be twelve that offers this. So make sure you're on a version that supports that. Failing that, you can also use Pgrepact to do it, but this is the preferred way. Of course, if it's included in there. And again, with recent versions of postgres, you are going to get some benefits in terms of compacting the indexes. And they mentioned in postgres 13, there's a lot of deduplication that can happen if you're indexing duplicate values. Now in the process of doing a reindex, the new indexes are marked with an underscore CC new as they're being created, and if there's any kind of failure that happens, you may need to go in and clean them up. They provide a query for doing that and they took to look at some of the deduplication capabilities in 13 and then they moved on to table Bloat. And how you address table Bloat. One way you can manually recreate the table and switch to using it. So there are techniques, but it's a very manual process to try and do that and you probably have to use triggers to keep things up to date. The other option is to do a vacuum full. The disadvantage is that locks the entire table. Other tool you can use is the PG repack extension to be able to repack tables. So that's another option you can use, but keep in mind it uses more than twice the amount of space of the table that you're working on, potentially including the indexes as well. So that's something to be aware of. Now in terms of the thing that they found to reduce their index size is that they were indexing a column that mostly contained nulls. So for example, a canceling user, there's a very small amount in their database where someone was canceling. So most of the values were null. Well, postgres indexes those, so how they were able to save a lot of space is to a partial index only indexing it where it's not null essentially. And in doing that they said 99% of the dead weight was shaved off the index. So that's a pretty big reduction. It went from 769 megabytes down to five megabytes. And you can also use partial indexes if you say, have a status column that you never query for particular statuses. Well you can just index only the values that you are querying to save space and they even give you a query to look for where you have a column with a lot of nulls that could be candidates for a partial index as they created here. So if you want to learn more about how to help clean up your database, you can check out this blog post. [00:04:12] The next piece of content is cleaning up your postgres database. This is from Crunchydata.com and he's talking about techniques you can use to monitor and then potentially clean up your database. The first thing he suggests looking at is your cache hit ratio, predominantly how often indexes are being hit and also how often tables are being used. And he gives the exact query to check your index hit rate as well as your table hit rate. The next thing he looked at is a PG stat user tables to determine what percentage of the time an index is being used for queries. And he presents it by table. So you could see a small table where the index is not used that much, maybe you don't need to worry about it as much, but a very large table where the index isn't being used a significant amount, that's probably something you want to check out and potentially add some indexes to. The other thing you could do is clean up unused indexes and he has a query here to look for those that aren't being used similarly to the previous post. Now the other thing he mentioned is keep certain things outside postgres. So he's proposing if you have things like messages, logs, events, things of that nature, don't put them in postgres because inevitably you're probably going to take over most of your storage through this. Now, I actually store things like that in postgres, but I tend to use partition tables and then delete the partition tables when they're no longer needed. I prefer doing this rather than run some separate data store, but this is a proposal that he has and the last thing he mentions is to dig into performance with PG stats statements to determine how well your statements are running and if anything needs to be optimized. So if you're interested in this content, you can check out this post from Crunchydata.com. [00:05:48] The next piece of content is Boost your User Defined Functions in PostgreSQL. This is from Angres.com and they're giving some advice with regard to using user defined functions within postgres in your application. Now, they're not necessarily saying you should or shouldn't put logic in the database, but if you do, you should follow these steps. First, they're saying use PL or PGSQL functions for simple SQL statements. So for example, you can put logic such as this section of a case statement and make it a function to call. And when you do it, you want to use SQL to do it because this is basically just SQL as opposed to using PLSQL if you don't need it because the optimizer is able to optimize SQL in queries versus using a psql. So it's pretty much doing the same thing, but the pure SQL is about four times faster compared to the Plfsql function. The second item is unnecessary usage of select into clause. So basically you should use the assignment operator in your functions as opposed to the select into clause because it gives you better performance. And again, it's about a four times better performance in this example they show here. The third is overusing the raise clause. So like any other language, having exceptions and using exception processing is a burden on that functions processing. And here you can see it's about four or five times slower to have that exception processing in there. Four is overusing the high level programming coding style for SQL activities. So basically it's best to express your queries in SQL versus trying to build your own. Like for example in this comparison, using a for loop to get what you could have done with a lateral clause, you can see you improve performance by about twice or two times by using that lateral statement as opposed to using a for loop in a PL SQL statement. The next set of recommendations are with regard to function properties. The first one is use parallel safe whenever possible. So if you can make your function parallel safe here, and mostly in regards to readonly queries that don't access sequences and have a number of restrictions, but if you can do that, it's going to give you some better performance. [00:08:06] Secondly, use Immutable when possible, so if your function is truly immutable. In other words, the same input will always result in the same output. Go ahead and mark that function as Immutable. And the last set of recommendations is to monitor your performance functions using the PGSTAT user functions so you can check on your function's performance. So if you're interested in learning more, you can check out this post. [00:08:29] The next piece of content is achieving 11 million I ops and 66gb/second I O on a single thread ripper workstation. Now, this is not database specific. He doesn't mention databases in terms of analysis at all. But what he is saying is that modern hardware has become super, super high performance. And you may want to consider a very recent, very fast set of hardware versus setting up a cluster of hundreds or thousands of lower powered machines because that may result in better performance. And I actually took a look at this from AWS's perspective, and right now the max I ops provide they offer is about 64,000 I ops. They have a preview where they're offering a 256,000 I ops and you can attach about 30 volumes to one of their metal instances. But you're still with that. Not going to get near this 11 million I ops from this single workstation. So it's an interesting read on what is possible. Now, this is a quite complex blog post that doesn't mention databases, but it's all about the hardware and capabilities and optimization to get around different bottlenecks. But in this era of distributed workloads and seeking to scale out versus scale up, well, here's a post that talks about how far you can scale up, essentially. So if you're interested in that, you can check out this post from Ten Lpoder.com. [00:10:00] The next piece of content is change data capture in postgres with Debesium. This is from Crutchydata.com and essentially this enables you to use your write ahead log to track changes. So you don't have to put any triggers on postgres, but it just reads the Write ahead log stream and sends those data changes off, say, to another database or another way to capture the data that is of interest. So they were saying this is a way you could potentially put logs and messages and things of that nature. If you don't want to put them in postgres, you can send them off to a separate data store. And this is using Debesium to do it. And they go through the process of setting this up. Now they're setting this up with their crunchy bridge, but you can use this information to be able to set up debesium for your postgres instance. So if you're interested, you can check out this blog post. [00:10:49] The next piece of content is PostgreSQL what is a checkpoint? This is from CyberTech Hyphen Postgresql.com, and this walks through the process of how inserts happen. Basically, an insert needs to happen. The first step is actually to write what's being inserted to the wall so it commits it to disk that this is going to happen. It then writes that information to the shared buffers so it's in memory on a periodic basis. A background writer writes it to the actual data files and that is the checkpointing process. It's basically make sure all the dirty buffers get written to the disks and then that checkpoint is written to the wall. And they mentioned two ways of doing the configuration. One is the checkpoint completion target, which by default is 0.5. That means with whatever duration the checkpoint is supposed to happen, it should be completed basically 50% into that time frame. But interesting, you mostly bring this up to around zero nine. And he mentions a comment here. Quote in postgres 14, this parameter will most likely not exist anymore. The hard coded value will be 0.9, which will make it easier for end users. So that leaves two other variables you can adjust. One is the checkpoint timeout. So by default it's five minutes. But normally you make that 20 minutes, maybe 30 minutes, hour at most probably. And then the max wall size determines how much wall you're going to keep. So basically these two parameters will determine how often checkpoints happen. So if they're happening faster than your timeout, you'll generally get a warning, or you should configure that warning if you're having too much wall being generated. So then maybe you'll want to increase your max wall size. But if you want to learn more about checkpoints, you can check out this post. [00:12:28] Next piece of content is actually a YouTube channel. About 20 or so videos have been posted to the EDB YouTube channel and therefore Postgres Build 2020. So there's all sorts of postgres related videos approximately 30 minutes in duration that have been posted here. So if you are interested in video content, definitely check out this YouTube channel. [00:12:50] The next piece of content is Load and query Pihole data from Postgres. This is from Rustprooflabs.com and Pihole is an application that lets you block ads at the network level and it stores information about its operation into a SQLite database. Well, he actually wanted to analyze it. So what he did is he exported the database and then used the SQLite foreign data wrapper to be able to query that data from postgres. He talks about his setup and getting it set up to be able to query the data. And the thing he was noticing he had a little bit of a poor performance. So then he goes into actually loading the data from specific tables he was interested in into postgres. So he used the foreign data wrapper to actually load the data into tables he created in Postgres. So if you're interested in that, you can check out this blog post. [00:13:41] The next piece of content is PostgreSQL data security authentication. This is from Procona.com and it's the second in a series of posts on authentication and they're talking about the Postgres internal authentication. In other words offering trust, reject, MD Five Scram and Cert authentication methods. Trust and reject. Basically, trust come in without any password at all or reject without asking for a password at all. MD five. Does an MD five hash for password authentication? Scram uses Scram authentication and then using certificates. So if you want to learn more about setting up authentication with Postgres, definitely check out this post. [00:14:19] The next piece of content is Postgres and Crypto supply Chain of Integrity. This is from Crunchyday.com and they're talking about how do you get PostgreSQL to use Phips 142 crypto and basically this is the Federal Information Processing standard 142. So if you're interested in that, you can check out this post from Crutchydata.com. [00:14:44] The next piece of content is waiting for PostgreSQL 14 search and Cycle Clauses so these are clauses that are part of the SQL standard that can be used for recursive queries. He's using an example of a directory structure and you can actually use the search keyword to do a search by depth. So basically it's going to show you the data laid out by depth. Or you could search by breadth first, so it gives you the breadth of the directories and then it goes further down. And then he also talks about cycle and that is where you're trying to determine all the different routes between different cities. Say if you're planning a flight, once you load up all the data in and all the possibilities, you can then use a recursive query to find all the different paths that you can go on. And what's great about Cycle is that it actually halts it once a potential duplicate trip is stopped. So if you're interested in learning more about that, you can check out this post from Depes.com. [00:15:43] The next piece of content is how PostgreSQL inserts a new record with the help of Table Access method API and Buffer Manager. This is from Higo, CA. If you want to learn about the internals of PostgreSQL and how it does inserts using the Table Access API, definitely check out this blog post. [00:16:02] The next piece of content using GitOps to self manage Postgres and Kubernetes. This is from Crunchydata.com. So if you want to use GitOps to help manage your PostgreSQL database system using Kubernetes, definitely check out this post. [00:16:17] Next piece of content is Various ways to retrieve PG pool Two statistics. So this is monitoring PG pool two. They present three ways. One is through show SQL commands you can execute from a database connection. The other is using PCP commands which my understanding communicates directly with PG pool Two to give you information back. Or you can use a PG pool adm extension to do so. So if you're interested in learning more about that, you can check out this post from Bping Blogspot.com. [00:16:48] The next piece of content is overlay ng and invalid geometry. This is from Linearthinking Blogspot.com, and he's talking about some improvements that were made to PostGIS three One and GEOS 3.9. In referencing a previous post that we mentioned on Scaling Postgres, that PostGIS was still reporting errors with invalid geometry. And he goes into a little bit of a clarification on why that is and the rationale behind it. So you can check out this post if you're interested in that. [00:17:19] Next piece of content is PG. Timetable asynchronous chain execution. So this is an update to the PG timetable extension for scheduling tasks in Postgres. If you're interested in that, you can check out this blog post from Cyberdeck Postgresql.com. [00:17:33] And the last piece of content is the PostgreSQL Person of the Week is Alexander Sosna. So if you're interested in learning more about Alexander and his contributions to Postgres, you can definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube. Ridetunes thanks.

Other Episodes

Episode 215

May 15, 2022 00:17:13
Episode Cover

New Releases, AlloyDB, Time Zones, Connecting to Postgres | Scaling Postgres 215

In this episode of Scaling Postgres, we discuss parallel server-side backup compression, IO in Postgres, parquet files and the new pg_stat_monitor extension. To get...

Listen

Episode 276

July 30, 2023 00:15:57
Episode Cover

BRIN & Correlation, Poor Partitioning, 10 Beginner Tips, Table & Index Usage | Scaling Postgres 276

  In this episode of Scaling Postgres, we discuss the importance of correlation with BRIN indexes, how partitioning can kill performance, 10 tips for beginners...

Listen

Episode 127

August 17, 2020 00:14:37
Episode Cover

Recursive Queries, Lateral Joins, Pagination, JSONB Arrays | Scaling Postgres 127

In this episode of Scaling Postgres, we discuss recursive queries, using lateral joins, paging through a data sets efficiently and working with JSONB arrays....

Listen