Connection Queuing, Connection Pool Size, Simple Audting, Fuzzy Matching

[00:00:00] In this episode of Scaling Postgres, we talk about connection queuing, connection pool size, simple auditing, and fuzzy matching. I'm creston. Jameson and this is scaling postgres episode 154. [00:00:22] I hope you, your friends, family, co workers continue to do well. Our first piece of content is connection queuing in PG bouncer. Is it a magical remedy? This is from Procona.com and it's talking about a PG bouncer. And specifically what it can do as a connection pooler is it does connection queuing, meaning that if you have a query that needs to run, but all of the server processes or server sessions are being used, it puts those connections in the queue to be run when a server connection is available. Now, of course, most people use PG Bouncer so that they can get more connections without using a lot of memory and resources of the server. But this post talks about how the queuing component is equally beneficial. So if you are not using some sort of connection pooler, you could experience a number of problems. Problem number one, a spike in load can jam and halt the server. The reason being is because there's a one to one mapping between active PostgreSQL sessions and the running processes on the box. So every new connection to postgres spawns a new process. In addition, with parallel execution, you can actually get many more processes running than sessions that exist. So you can suddenly get a ton of different processes running on the box as you're adding more connections. And they simulated an example here where they had ten active connections on a two virtual CPU server, and the speed at which that they could connect and run a command compared to just having two active sessions went up to 15 seconds when they have ten active sessions on a two virtual CPU box. So that server is struggling to operate efficiently. And they say, quote, we could see that as the number of active sessions approaches double the number of CPU cores, the performance penalty starts increasing heavily. So that's something to keep in mind as a problem that you can experience if you have a lot more connections than your CPU cores. For example. The second problem is too many clients already errors. So this is basically you've run out of connections, you're trying to make a new connection, but it gives a fatal message, sorry, too many clients. The third problem that you can experience is a big max connection value and the overhead that it causes. So there's been a number of posts that we've actually covered in Scaling Postgres, and they link here about how just having a lot of max connections defined still consumes resources even though they're not being used. So you don't want to set your max connection value in your PostgreSQL configuration too high all the time. You want it optimized at a particular level. So now a solution to help you with all of this is using a connection pooler. Like PG bouncer. But we can also do is assist with connection queuing, which is what this post is about. So for the example where you try to connect and there's too many clients, you've run out of connections. PG Bouncer allows you to run more active connections than exist at the server side. So for example, here he's showing the pools of PG Bouncer, and you can see that the client that is active is one because they only have one active server connection. And you see that four of the client connections are waiting. So this is this queuing that he's talking about. They're waiting for their opportunity to use this one existing server session. So that's what PG Bouncer can do for you. And if you actually bump up the number of server connections to two and you run that same five, you can see that you can get more active running and less waiting. He then tried an example with 20 direct server connections and you can see that the average load went up to about 17. So this is not using PG bouncer. And similarly, when trying to connect to that box using SSH, it was 17 seconds to connect. And I assume this is the two virtual CPU server? Yes, but when they implemented PG Bouncer and set the pool size to four on the server side, the load average dropped to 1.73. So about a ten times less load average. And connecting took half a second via SSH, so it was dramatically better in terms of general performance. Now then the question comes, does this actually affect your throughput because you're having connections wait to do work, these client connections wait to do work. And he actually did a test of database throughput and actually with using the queuing, it was actually 5% better. So it wasn't worse. This waiting time, allowing the server to run in an optimized state and just have PG Bouncer pass off the new connections when they're ready, actually resulted in better performance, not worse performance, even though you have some waiting going on. And then when he switched to a transaction level pooling performance got even better. So it's just something to keep in mind that PG Bouncer is not only for lowering your connections to save on memory in terms of postgres, but can actually give you more throughput to optimize the number of sessions allowed to connect to the database versus how many CPU cores you have. So if you're interested in that, you can check out this post. [00:05:29] The next piece of content. Estimated connection pool size with PostgreSQL database statistics. This is from Cyprtechn Postgresql.com, and he's talking about there have been some new connection statistics added to version 14 of postgres, mainly to the PGSTAT database view. And it covers things like a session time, which is total time spent by sessions, the total active time of sessions, time spent in idle in transaction, the total cumulative number of sessions, the number abandoned, the number of fatal, the number of killed. So this is great information and more statistics to have. But actually, where he finds this useful is in terms of estimating how large you should make your max connections on the database server if you're doing transaction level connection pooling. So exactly the problem. This previous post that was talking about setting that max connections at an optimal value, just taking into account CPU cores like don't set it too large. Here he actually has a formula where his recommendation is number of connections should be less than the maximum between the number of cores and your parallel I O limit. How many queries can you run at the same time? Given the I O storage solution that you have, divided by the session Busy ratio, how busy your sessions are multiplied by the average parallelism you're doing. And some of these numbers, they're a little loose sometimes. I can imagine it being difficult to really nail the parallel I O limit or the average parallelism. But he says you can actually calculate the session busy ratio by using some of these new statistics given this query right here. And it will give you what that ratio is to factor into this function to give you an estimate of what's the optimum setting for your max connections. So if you're interested in this, definitely check out this post. [00:07:21] The next piece of content simply auditing your database changes. This is from mydbainobook.org and this post is talking about a technique where you can do auditing of statements where you don't have to use triggers, you don't have to use a special extension, you don't have to use third party tools. You're just basically using the built in functionality of Postgres to do this. I found this particularly interesting. So number one is you need to log the statements that you want to track. So, for example, they recommend setting the log destination to a CSV log. And she said, we'll see why later, turning on the logging collector, as well as set what statements you want to log. So you could do DDL or Mod or all, whatever you want to log. You need to set that log at that level. And then they ran a PG bench to get some data going into the logs and then used a foreign data wrapper. So create extension foreign data wrapper, create the server for the foreign data wrapper, and then created the foreign table giving its definition to the log file that's being produced. So essentially, PostgreSQL is logging the file, the statements, and then use the foreign data wrapper to actually read the log file that it's generating. And then for convenience, they actually add a number of views that let you look at the DDL or the DML. So they're basically going into the logs, pulling out particular data to be able to view it more conveniently. So, for example, when querying the DDL, you could say, look at the table in the database bench in the query having to do with Pgbench Tellers. And you can see all the operations that happen. There's a drop table, create table, truncate table, alter table. So all the DDL statements, similarly with the DML, you can look at a particular database name in a table and look at all the different statements that are being run against that table. So that's a pretty interesting technique to be able to use the database to query and see what's happening in the logs of the actual database that you are logging. So if you want to learn more about this, definitely check out this post. [00:09:30] The next piece of content is fuzzy name matching in postgres. This is from Crunchydata.com. So to get started, they actually created a table with 50,000 names using the fake name generator. So they give example of some of the data that's produced here and basically they want to do fuzzy matching against these names. The first example they came up with to do a prefix search is just using the like operator and they got results in about eleven milliseconds. They tried adding an index, but of course it still came back in about eleven milliseconds because like by default is not going to use that index. But if you index it using text pattern ops, then it will use the index and I believe only for prefix matching. But there's also a case issue. So you can do case insensitive searches, but if you do that and want an index, you're going to have to do a function index. So you actually define an index with that lower function on it to be able to efficiently pull out those records. Now getting onto the fuzzy matching, they actually used the fuzzy string match extension, added that in and it does Levenstein distance between strings. So they did a query to actually use this Levenstein function to be able to pull out records where one character was missing from the search. But the problem was it took over 100 milliseconds. So pretty darn slow. But a solution to that is using a soundx function. And this algorithm reduces a word to a phonetic code. So for example, even though each of these Harrington words are spelled differently, they actually return the same phonetic code. And for example, out of 50,000 records doing a soundx check against Harrington actually only return 46 records. So it helps you reduce the number of records are present. So to handle this, what you do is you create a function index using this soundx function against the name. And then when you use the same query but with this new index in place and of course adding the soundx, it actually returns data in just one millisecond. So very efficient fuzzy matching. So if you have a need of supporting fuzzy match searching, you may want to check out this post from Crunchydata.com. [00:11:47] The next piece of content PostgreSQL how to write a trigger. So this is a basic post from CyberTech Postgresql.com that talks you how to set up a trigger. So they created a table for tracking temperatures, has a few different columns here. Then the next thing you need to do is you need to define a function and then create a trigger that's going to use that function. Now, before creating the function, they mentioned that triggers can be either before or after. So you can have as many before triggers as you want to before the actual insert or update, and then as many triggers as you want after. And actually it does it in a defined order by the name of the trigger. So you want to name it appropriately so your triggers happen in a particular order. So as mentioned before, the first step is to create a function. So creates a function called Ftemp, probably for function temp. And basically if the new value is less than zero, then set that new value to negative one. So that's the basic function. It's going to change the value of the row being inserted if it's less than one. [00:12:51] And then you create a trigger and you specify before whatever operation you want to do before update, before delete, before insert, and it's doing before inserts, and for each row execute the procedure that was defined up here, the Ftemp procedure. So with this trigger in place, he did some tests and the result came out as he expected. When trying to insert a negative 196, it returned a negative one, otherwise it returned what was inserted. Now they use the keyword here, new, and that refers to the new row being inserted. But there also exists in old and it depends on what operation is happening, which keyword is available. So new is in inserts and updates. The old is present also in Updates but also Deletes and nothing is present in Truncates. Now, in addition to these keywords, there are some other ones you can use. So we mentioned the TG relid schema, the table name, the rel name as well as the op or the operator to say whether it's happening because of an Insert, Delete, update or Truncate, as well as another options that you can use in your triggers. So if you want to learn more about using triggers with postgres, definitely check out this post. The next piece of content is postgres through all logical replication advantage and step by step setup. This is from Haigo CA, and they're talking about the reasons why you would want to set up logical replication. They mentioned a few here. One is sending incremental changes in a single database. Two is consolidating multiple databases into a single one. Third is replicating between different major versions of postgres. Like maybe you want to migrate data or do an upgrade and then sharing a subset of the database between various databases. So a lot of use cases for moving data around. And then they go through the process of setting up a basic logical replication setup where you set up the server to be the publisher, set up the server to be the subscriber, create the publication on the publisher, create the subscription on the subscriber in order to start syncing data over. So if you're interested in that, you can check out this post from Hago CA. [00:15:03] The next piece of content, faster data migrations in postgres. This is from Citusdata.com. He's talking about fast ways to migrate data from one database to another, primarily from postgres to postgres. And he says the number one tool to do that is PG dumps and PG Restores, because they basically logically let you back it up. And you can do individual tables or individual schemas to be able to transfer data from one database to another. And that what some of the keys to do it efficiently is using the Jobs option to specify the number of processes that can run in parallel to do that export and then import job. But he says if you have very small tables that may not benefit from parallelism, you can actually string a PG dump straight and pipe it straight into a PG restore. And that can actually transfer data pretty quickly. So I haven't actually seen this technique. So that was interesting if you want to check out that method. But then if you have a lot of large tables, another solution they have is actually parallel exporting and loading. So there's a Python script he mentioned, and there may be one that does another language that's called parallel loader. So basically, it can take one large table, for example, and use multiple threads to execute copy commands in parallel. So you can see it's using PostgreSQL's copy to actually transfer the data. And using this parallel method of transferring data, it went from a PG dump and restore for greater than one day to 7 hours and 45 minutes. So it's definitely a lot faster being able to do it in parallel, because again, you just have one process that can work on a table at a time. So with a large table, being able to break it up and run the process in parallel gives you a lot of performance. So if you're interested in learning more, definitely check out this post. [00:16:56] The next piece of content is first review of partitioning OpenStreetMap. This is from Rustproof Labs, and this goes over his review of whether he should partition the OpenStreetMap data that he works with and imports into postgres. And again, he was using a lot of schemas for this management, but he's wondering if he could partition it and make data management easier. [00:17:19] Now, he covers some of the issues here, but the next post, round two, Partitioning OpenStreetMap describes how he ultimately did it, and here are some highlights from his results. So, the bulk imports that he does on a regular basis now, now generate 17% less wall. So that's good the bulk deletes generate almost 100% less wall because deleting data doesn't have to actually do a delete. You just drop the partition table. [00:17:47] And then they mentioned a simple aggregate query runs 75% faster. Now, they did mention that not all queries are running faster. I believe he said that quote partition tables are generally fast, if not faster. So it looks like the speed is pretty good or a little bit faster. So not everything is this blazing 75% faster. I believe that was one query that he mentioned. But he goes through a lot of the process of his testing in his thoughts about the best way to optimize this. And this is really how you have to handle partitioning because it is a burden to manage partitioning. So you want to make sure that the benefits outweigh the costs, of course. So this has been a great series of four posts. I believe that look at that decision process of should I partition or not. So if you're interested in that, check out this blog post. [00:18:44] And the last piece of content is the PostgreSQL Person of the Week is Takayiku Tetsunakawa. My apologies if that pronunciation was incorrect, but if you want to learn more about his contributions to postgres, definitely check out this blog post. [00:19:00] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 310

Andres, Microsoft & Postgres Save Linux? | Scaling Postgres 310

Episode 48

Microsoft Acquires Citus Data, Split WAL, Maintenance Work Mem | Scaling Postgres 48

Episode 149

ARM Tests, Tips & Tricks, Hierarchical Structures, Benchmarking Framework | Scaling Postgres 149

Connection Queuing, Connection Pool Size, Simple Audting, Fuzzy Matching | Scaling Postgres 154