Fast Random Rows, Fast Distinct Queries, Temp & Unlogged Tables, Terminating Connections | Scaling Postgres 164

Episode 164 May 09, 2021 00:17:29
Fast Random Rows, Fast Distinct Queries, Temp & Unlogged Tables, Terminating Connections | Scaling Postgres 164
Scaling Postgres
Fast Random Rows, Fast Distinct Queries, Temp & Unlogged Tables, Terminating Connections | Scaling Postgres 164

May 09 2021 | 00:17:29

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss selecting random rows quickly, faster distinct queries, how replication handles temp and unlogged tables and how to terminate connections.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/164-fast-random-rows-fast-distinct-queries-temp-unlogged-tables-terminating-connections/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about fast random rows, fast distinct queries temp and logged tables and terminated connections. I'm Kristen Jameson, and this is scaling postgres episode 164. [00:00:24] Alright, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is getting random rows faster. Very much faster. This is from Magnus Hagandar's [email protected] and he's talking about ways you can get fast random rows from a table. One simple way to do it is to order by the function random and do a limit of one. And that should get you what you need, a random row. However, this is probably the slowest way. He also mentioned that Jonathan Katz posted a different way to do it on Twitter that's faster but more complicated. But then he is presenting his own idea for being able to get fast random rows and that's using the table sample capability of postgres. Now, normally this returns a portion of a table, such as he says, 10%. However, you can add an extension called a TSM system rows extension and that allows you to return a certain number of system rows. And it uses the syntax that looks like this. And then he has some speed indications here. Whereas using his example, the order by random returns in about 2 seconds from 5 million begins returns in 1 second from Jonathan Katz's method. But the table sample method using this extension returns in 0.5 milliseconds. So super fast. And then looking at it with 10 million rows, essentially it doubles the time, or at least close to it for both of the first two examples, but it's still about that same speed using the table sample with this extension. Now, he says how does this magic work? Basically, it chooses a random disk block on the table. Now, as a consequence, if you're pulling more than one row, it won't be that random. So for example, looking at the slow method here, when you're pulling random, you do get three random rows, whereas if you use the table sample and choose three rows, it chooses one random disk block, but then reads what's available there in that disk block. So you can get a random start, but you get contiguous rows from that point. But his point is the performance gains using this is so huge you could probably get around it and call it more than once so that you be sure to get a random row for however many you select. And he has a CTE here that he developed that kind of gets you close to doing that. But I thought this was an interesting post and gives you a great way to pull out a random sample of rows from even huge tables. The next piece of content, how we made distinct queries up to 8000 times faster on PostgreSQL. This is from Timescale.com and they're talking about their extension timescale DB for postgres and how they were looking to make improvements on distinct queries. Now their focus is on time series data, so they're probably looking at a number of devices that are returning readings over a period of time and that's the example they're using here. Now they talk about distinct queries and why they're slow is because basically when it scans an index, postgres needs to go through every item in that index to look for the unique values. However, there is a concept in other databases called a skip scan. So basically it skips over redundant values. So if you have 10 million data points from a device and you're pulling data from that table, it can go to each device as opposed to having to look through those 10 million entries in an index. Now, they did say that it looks like this skip scan capability may be making it into PostgreSQL 15, which is in the fall of 2022. But if this is something that is important for you now, it is available in the Timescale DB extension and that's pretty much what this post talks about. Now the other thing to keep in mind is that you can use a recursive CTE to get similar performance to this. It may not be as fast, of course, but essentially it is a CTE that looks for each item from the index from a distinct perspective and this is the CTE that will do it for you and be able to give you something equivalent to a skip scan. So if you want to make your distinct queries fast, you could implement this solution as opposed to using this extension or waiting till postgres 15. So basically, as I mentioned, the skip scan basically skips over the repeating values, looking for the unique values in the index and then they have some benchmarks here that they developed. And of course, it's usually going to be faster the fewer devices you have or the lower the cardinality because there's fewer entries in that index it needs to search for versus having a lot of values. So this is a scenario where you have devices that are collecting data and they have a certain amount of rows in them and they tested five different queries looking for specific things. I'll mention in a second, but you could see the ratio of improvement they're seeing is 26 times up to about over 8000 times. So the scenarios they looked at is what is the last reported time of each device in a paged list, what is the time and most recently reported set of values for each device in a paged list was the most recent point for all reporting devices in the last five minutes. Which device reported at some time today but not within the last hour, and then what devices reported yesterday but not in the last 24 hours. So each of these examples gives you a performance rating for them based upon the number of devices. Now, the thing to keep in mind that with this technique there's a few requirements. So number one, the query must use the distinct keyword on a single column. So it has to be a single column. And that column must be first in the index, followed by any other order by columns. So, for example, looking at this query here, it's distinct on Tags ID. And that index needs to have Tags ID followed by Time to give you the highest level of performance. And the other thing to keep in mind is that the index needs to be a Beatri index and it of course needs to match the order by in your query. So even though this is mentioning a feature in Timescale DB, they do give you some insight into what's coming with postgres 15 in terms of this and also a recursive CTE you could use in lieu of this to get probably similar performance. So if you're interested in that, definitely check out this blog post. [00:06:51] The next piece of content to wall or not to wall when unlogged becomes logged this is from fluca 1978 GitHub IO and he's talking about the difference between logged and unlogged tables and temporary tables and asking the question what gets passed over to a Replica? So we have a primary database that is doing streaming replication to a Replica. What shows on there? Now, in terms of his testing, he created a normal table, an unlocked table and a temporary table. And he inserted a number of rows in there. And you can see in terms of the insertion time that the normal table took 4.7 and the other two took about 1.71.8. Why? Because these are not logged to the wall. And the thing to keep in mind is the wall is the technique that is used to transfer information to Replicas. So it's how they keep in sync. Essentially, it's streaming the wall to the Replica. Now, he set up replication and he wanted to look at what tables exist on the Replica. So looking on the Replica, you'll see that there is no temporary table. Now, temporary tables will only exist on the primary and they only exist for the session that you're in. So if you close out that session, that temporary table is then released. So nothing is included in the wall. It gives you great performance, but nothing will be passed over to the Replica, which makes sense. Similarly, an unlocked table, nothing's written to the wall, but it's interesting that it actually shows on the Replica. Now, what it shows is the existence of the table, but it doesn't have any data. So you can see when it does a select count on the Replica from the unlocked table, it says Error cannot access temporary or unlock relations during recovery. So you're not going to be able to access that unlocked table. You can see it. So the data definitions of that table are passed over to the Replica, but not the actual data itself. So system tables are copied over, but not the data contents of unlocked tables. Then he goes through the process of actually switching a table from Logged to unlogged and back again and seeing the performance. And basically it takes quite a long time to do it similar to a new insert because it has to put all of this information in the wall so the replica can capture it. So I thought this was a pretty interesting blog post. If you want to learn more, definitely check out this post. The next piece of content terminating database connections in PostgreSQL. This is from CyberTech postgresql.com. He's talking about the two different ways you can cancel queries. The first one is PG cancel backend and you give it a PID. Or PG. Terminate backend and give it a PID. So the Cancel cancels whatever query is running but not the connection and Terminate Backend actually terminates the query and the whole connection. Now, the first thing to be aware of is what do you want to kill? How do you get the PID of a running query? And the answer to that is using PG Stat activity. So they mentioned a few things you're going to want to do here. You're going to want to narrow it usually by the Dat name field, which is the database name. So you want to pick the right database. You want to look in the query column to look for the query that you want to cancel. And you also need to be aware of the state of that. So if you only want to kill something if it's actively running, usually if it's inactive, it's just the last query that that connection ran. And the final thing to be aware of is that because you can have parallel queries is that you want to look for the leader pit. So you want to cancel that pit as opposed to one of the parallel back ends. And then to do it you just do select PG Cancel Backend and give it to PID to be canceled. Now Terminate works the same way, you can identify the PIDS the similar way and just run PG Terminate Backend to terminate the back end and he closes out the post talking about what if you want to kick out every single user who is not you? Well, you can do select PG Terminate Backend, pass it in the PID. But from that you're actually reading from the PG Stat Activity table and you're showing where the PID is not your PID. So PG backend PID gives your backend PID. So it's basically kill everything but yours and also ignore those connections that aren't using the database name because usually these are postgres processes like the background writer he mentions and other required processes that aren't dedicated to a database as well as don't worry about the parallel children, just use the leaderpit. So this is an interesting blog post showing you how to cancel queries. [00:11:18] The next piece of content creating PostgreSQL test data with SQL, plpgsql and Python this is from Tangramvision.com and they're talking all about generating test data. Now they talked about doing it in a docker container that they've provided so you can follow along. But basically it goes through the post first using raw SQL and using functions such as generate series in random to generate random data, including using some of the built in dictionaries of postgres to choose random words to generate test data. And he has a full set of different SQL queries you can run to generate different types of data in the data set he's chosen. And he says not only that, you can also create functions within the database where you can just call those functions as opposed to running these scripts every time. Then the post moves on to talking about plpgsql. So this is of course similar to the SQL commands, but now you have the capability of using conditionals, so if else you can also do looping within procedures, or loop over rows from a query or raise information. So he goes into producing a procedure for this purpose. Then he follows up with talking about using Python, and again this is using Python within postgres. So you can create the Python three U extension, for example to be able to write Python directly in postgres to generate your test data. And then finally they follow up with the faker package which is available in Python. There's also a faker package in Ruby as well, so depending on your language you may be able to find something similar. So it's a pretty comprehensive blog post of how you can generate test data using a variety of techniques and postgres. So if you're interested in that, check out this blog post. [00:13:06] The next piece of content JSON b multicolumn typecasting this is from Momgm US and he's talking about typecasting JSON. You're returning from the database and a lot of times you would cast it this way for each value. So this value is text, this value is an integer. He said there's also a function called JSON b two record and you can give it a whole row of data and specify how it should be converted out. So for example, make a text in b integer and you can see that he's outputting the types here. And the other benefit of it is that it then assigns this essentially as a column you can query, so you can say where b is less than equal to four or b is greater than four. So perhaps this is a new way you want to try to do multicolumn typecasting. [00:13:55] The next piece of content JSON path, star operator and lax mode aren't meant to be together this is from Alexander Korotokov's blog and the STARStar Operator, or the Asterisks operator, as he mentions here, explores arbitrary depth finding the values that you're searching for essentially everywhere within that JSON path and Lax mode. Is kind of a relaxed way of working with the JSON and basically you can get some unexpected results from this. So basically based upon how these work, he's suggesting using the strict mode when using this operator and if you do it, you get the value expect. So, very simple post, but you can check it out if you want to learn more. [00:14:40] The Next piece of content using Kubernetes chances are you need a database. This is from Crunchydata.com and they're basically covering how you can use Kubernetes to set up postgres database system up. And this post is pretty short, but it has a lot of links to different content that's covered on getting postgres up and running in a Kubernetes environment. So if you're interested in that, you can check out this blog post. [00:15:04] The next piece of content. The many spatial indexes of PostGIS. This is from Crunchydata.com, and he's talking about spatial indexes. So if you have a certain number of points that you need to say, all the things inside this or all of the things near this other thing, you can use bounding boxes to find those particular points. And how you generally do that is you use a spatial index to help with performance and he gives an example of creating a just index on a particular geometry. But there's more than just this index, there's other indexes that exist that you can use. So for example, based upon the spatial geometry, whether it's two D, three D, four D, as well as other than gist, there's also the little bit newer SP gist as well as the Brin or the block range index. And he goes over and discuss some of the different benefits of it and performance and looks at how they compare in the different sizes. So if you're interested in this, you can definitely check out this blog post. [00:16:03] The next piece of content OpenStreetMap to PostGIS is getting lighter. This is from Rustprooflabs.com and by lighter they've dropped the memory requirements of OpenStreetMap. And to him what's even more beneficial is that you can now easily calculate how much memory you will need for doing your analysis. So he tested this out and it resulted in exactly what he expected in a dramatic savings in memory, showing the new low Ram middle compared to the legacy middle. So if you want to learn more about some of these space savings, definitely check out this blog post. [00:16:39] The next piece of content pushing maps with Geo server and PostGIS. This is from Cyprdecksgresql.com. If you want to know how to get Geo Server up and running and presenting data in a map, definitely check out this blog post. [00:16:53] And the last piece of content. The PostgreSQL Person of the week is Dave Page. So if you want to learn more about Dave and his contributions to PostgreSQL, definitely check out this blog post. [00:17:04] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube ritetunes. Thanks.

Other Episodes

Episode 120

June 28, 2020 00:12:12
Episode Cover

Postgres 13 Beta 2, System Stats Extension, Tuned, Join Strategy | Scaling Postgres 120

In this episode of Scaling Postgres, we discuss Postgres 13 Beta 2, a system stats extension, tuning your operating systems for Postgres and different...

Listen

Episode 200

January 30, 2022 00:12:11
Episode Cover

5 Minutes of Postgres, Unnest, SCRAM, Multi-master Review | Scaling Postgres 200

In this episode of Scaling Postgres, we discuss the new series 5 minutes of Postgres, using unnest, switching to SCRAM authentication and reviewing multi-master...

Listen

Episode 98

January 26, 2020 00:13:00
Episode Cover

Non-Relational Data, Contributers, Security Features, High Availability | Scaling Postgres 98

In this episode of Scaling Postgres, we discuss storing non-relational data, recognizing contributors, Postgres security features and implementing high availability. To get the show...

Listen