Combining Indexes, Regular Expressions, Parallelism, PostGIS | Scaling Postgres 75

Episode 75 August 05, 2019 00:15:40
Combining Indexes, Regular Expressions, Parallelism, PostGIS | Scaling Postgres 75
Scaling Postgres
Combining Indexes, Regular Expressions, Parallelism, PostGIS | Scaling Postgres 75

Aug 05 2019 | 00:15:40

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss combining indexes, regular expressions, parallelism in Postgres and PostGIS content.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/75-combining-indexes-regular-expressions-parallelism-postgis/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about combining indexes, regular expressions, parallelism and PostGIS. I'm Kristen Jameson, and this is scaling postgres episode 75 one. [00:00:20] Alright, I hope you're having a great week. Our first piece of content is bind indexes versus separate indexes in PostgreSQL. This is from Cybertechnposgresql.com and they're basically talking about indexes on single columns versus covering multiple columns. So for this discussion, they go through and create a table with three columns, a, B and C, and insert a million columns into each with random values. And then they also create an index that covers all three columns. So a multicolumn index of ABC. Now, the first thing that you start with is something pretty basic that PostgreSQL will rearrange filters for you. So it doesn't matter how you order your where clause, it'll still use the conditions and scan through the index in the proper order. It'll do that job for you. It doesn't matter how you place it in the index, it even does transitive properties and breaks them out into how they should be. So even if you have B equals A, A equals C, C equals ten, it'll know to do A equals ten, B equals ten, C equals ten, and also with a multicolumn index, it can use the first part of the index to do searches. So since A was the first column in this multicolumn index, you can search A equals ten and it will use the index to scan those rows. And the key thing is, you can also use it in the where clause as well as the order by clause. So they did a where clause of A and then ordered by B and C. So this is a natural fit for the index. It filters by the A column and then orders by the B and C, which are the next following columns in the multicolumn index. But of course, they also discuss cases where a composite or multicolumn index won't work, where if you're searching on a second column or a third column, it's not going to use the index you're going to get, as it shows here, a parallel sequential scan. It's not doing an index scan. Now they say sometimes on occasion it may scan through the index, essentially doing a sequential scan of the index based on certain conditions. And they simulated this by setting the sequential page cost relatively high. And then when they did the scan, it does scan through the index, but it's also a lot slower. So this one is 63 milliseconds, so using the second column, 63 milliseconds, whereas the next closest example is four milliseconds. So definitely a lot slower doing this scan. So it can't really only search on the last column or middle columns. It needs to at least have the first or 1st 2nd columns in the multicolumn index and then they go into understanding bitmap index scans. So this is where it can combine the results of indexes so it does a bitmap scan in this example searching where a equals four and then it does another bitmap scan where a equals this other number and it does a bitmap or them together to combine the results. So that's also something it can do. And in the last set of examples, they created single column indexes on all three columns. So no longer the multicolumn index and they did the same search. And of course, it can use those index and combines them together. But if you'll notice, it believes it's more efficient not to use all three indexes at once, but it uses what it feels are the two most efficient indexes, combines them together using a bitmap and and then just does a filter on the index they did not use. So the optimizer used the index on B and C but it just used a generic filter on A and lastly they follow up with optimizing Min and max in SQL queries. And this is an additional case when you're looking for the Min of a value or the Max of a value, it will also use the indexes. So this post gives a good overview of how indexes basically work and maybe some of these ideas can help you decide whether you want to use single or multicolumn indexes as you're developing your tables. [00:04:12] The next post is PostgreSQL regular expressions and Pattern matching and this is from Secondquader.com. It says postgres supports multiple different types of pattern matching. One is Possex style regular expressions, the other is similar to which is an operator added in SQL 1999 and the SQL like operator. Now, I usually tend to use a like operator, but this particular post talks about possex style regular expressions and that you can use the tilde for matching a regular expression. Case sensitive. The tilde star, which matches a regular expression. Case insensitive, exclamation point tilde does not match the regular expression and it's case sensitive. And then exclamation point tilde star or asterisk does not match the regular expression and it's case insensitive. And it goes through some different examples just to show it. So this is a pretty easy post to look at if you want to start using regular expressions in your queries. [00:05:14] The next post is parallelism in PostgreSQL and this is from Procona.com and actually the next four posts this week are from Procona.com so they've definitely posted quite a few this week. So first they're talking about the components of parallelism and this is mostly the parallelism features that were introduced in 9.6. And they're talking about the concept of PostgreSQL is process based. And generally you have one process that does a particular query because generally you have one process per connection. But with the parallels and features that they introduced, you can get additional threads. So here you have a gather thread and that coordinates multiple workers. So if PostgreSQL believes that a piece of work can be done in parallel, it will then spawn multiple workers that send its tuples up to the gather node in order to parallelize this particular query. And they talk about the different capabilities this enables. For example, the first is parallel sequential scans. So if you're scanning a whole table in order to look for data, this makes total sense to be able to do it in parallel with as many courses you have. In this first example, I believe they did it just with a single yes, they did it with just a single processor and it executed in looks like 4343 seconds. Whereas when they did it in parallel using ten workers, it completed in 1628 seconds. And so they post the graph here and what they did note is that there is a diminishing set of returns eventually. So once you go in parallel, number one, it's not linear and number two, at some point you reach a point at which the work of coordinating all of the different workers exceeds the benefit from adding another worker. But they do show how this definitely does improve performance. Next, they talked about parallel aggregates. So this is when you're trying to say do account of the number of rows in the table and how optimized can that get. And again you have a gatherer working with partial aggregates that are done in different parts of the table leading up to a finalized aggregator and then a leader. Now here they didn't have quite as much of a difference. So for example, they say they got a boost, performance boost of just over 16% doing it in parallel with ten workers compared to just using a single. So that's definitely not significant, but it does give you a bit more performance. Then they talked about a parallel index on Btree scans and how that works, as well as discussing a little bit about parallel bitmap scans and parallel joins. So if you want to learn a bit more about parallelism in PostgreSQL, this is definitely a blog post to check out. The thing that I always like to keep in mind is that is your workload more transactional based and do you want a lot of connections doing work. In that case, maybe parallelism isn't that much of a benefit because you're already asking it to handle many, many connections doing small queries. However, if you have more of an analytical workload, you're going to be doing a lot of sums counts as well as fewer numbers of users but wanting to use more cores of the machine to handle each query, then definitely parallelism is a benefit. So just something to keep in mind if you're looking at optimizing your parallelism settings. [00:08:41] The next post also from Pocona.com is how to adjust Linux out of memory killer settings for PostgreSQL. Now they go into a lot of detail about the out of memory killer and why it exists. Basically you don't want a rogue processor program taking the entire memory of a machine. Unfortunately, for PostgreSQL that is the primary thing that generally runs on it and it tends to be a candidate for being killed. But of course the recommendation is to set the VM over commit memory variable to reduce the probability of this happening to two. And that means that quote here, the kernel is not supposed to overcommit memory greater than the overcommit ratio. And this overcommit ratio is another kernel setting where you specify the percentage of memory the kernel can overcommit. And this is pretty much the safest option recommended for PostgreSQL. So if you want to learn a little bit about the out of memory killer in Linux and some of the settings that are suggested for PostgreSQL, definitely a plug post to check out. [00:09:46] Next post is PostgreSQL simple C extension development for a novice user and performance advantages. So here they're talking about developing a C extension and they go through the process of how you do it. So if you're interested in developing additional features on PostgreSQL, this is definitely a great post to check out. But they created a very simple extension called Addme that just adds two arguments together. And they did it in C. And then for comparison, they actually did a plpgsql function to compare what their performance was like. And they ran the timings and just using the straight SQL plus operator, because you're just adding two parameters together, you got a performance of zero 65. Using the C function call I got zero 65. However, using the PL SQL function call I got 1.4, so maybe 40% slower or 35% slower and then they graphed it out here to show you the reduced performance. So definitely creating a C extension definitely gets you more performance if you have custom functions you want to develop. So definitely interesting blog post to check out if you're looking into a way to squeeze more performance out of PostgreSQL. [00:11:02] The next post if you're interested in doing more plpgsql development, is using plpgsql underscore Check to find compilation errors and profile functions. So basically this is an extension and what it does is it allows you to check for compilation errors in function code, finding dependencies and functions and also offers profiling functions. Now if you use the profiling functions you do need to add this extension to the shared preload libraries area of the postgresql.com file. So they have an example from a previous post where creating this function generated no error, but if you actually ran it through the Plpg SQL check function command, it actually found that a relation needed does not exist. So this could be a great benefit if you're doing a lot of plpgsql development and they point out here you can also help you find dependencies as well as do profiling functions. So if you're doing a lot of Plpg SQL development, definitely blog post to check out. [00:12:09] The next post is postgres twelve highlight replication slot copy and this is from Pakir XYZ and they're talking about replication slots that are used in streaming replication or logical replication or logical decoding. And now they have a new feature where you can actually copy a slot, so they have PG Copy Logical replication slot and PG Copy Physical replication slot. Now some use cases is maybe you wanted to do a backup with a permanent slot and then make a copy of it that'll give you the same restart LSN and then you can use two Replicas or standbys to then each one gets their own slot to be able to follow to do a restore from. [00:12:54] So that's pretty interesting. They also talked about scenarios where maybe you're doing logical decoding or logical replication and you can create a slot of where an LSN of an existing slot is to be able to test decoding. Maybe you want to output information that you're getting from like PG receive logical. So definitely an interesting feature being added to PostgreSQL twelve and if you want to learn more, definitely check out this blog post. [00:13:20] The next post is waiting for PostgreSQL 13 add support for Jobs in Reindexdb. So they're already talking about version 13 and there is actually a parameter Jobs added to be able to parallelize certain work, and in this case reindex DB. And they show here when they parallelize it with eight they get completes in about 6.8 seconds, whereas when they didn't parallelize it, it completes in 24 seconds. So definitely an improvement. So I know twelve hasn't been released yet, but they're already working on version 13. [00:13:59] The last set of posts are all related to PostGIS. The first post from Cleverelifent CA is PostGIS overlays. So basically if you have a particular overlay you want to compute, this goes through on how you can get the combining set here. Now again, I'm not an expert on PostGIS, but this post covers how to do that. So if you're interested, definitely a post to check out. The next post is simple SQL GIS. So this is a case where they had a particular area that water was included and they were actually able to use some additional data of where the water exists and then use some SQL in order to create a difference from it to create this type of representation of the data. So if you have a use case for this, definitely a blog post to check out. [00:14:57] And the last post is waiting for PostGIS Three St underscore Asmvt Performance and they are talking about greater performance for this PostGIS function. And this is from Crunchydata.com. So again if you're interested in the upcoming PostGIS three, definitely a blog post to check out. [00:15:17] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scaling Post where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 95

January 06, 2020 00:11:51
Episode Cover

max_locks_per_transaction, Logical Replication, Backup & Recovery, Pgbouncer | Scaling Postgres 95

In this episode of Scaling Postgres, we discuss max_locks_per_transaction, logical replication, managing backup & recovery and pgbouncer connection pooling. To get the show notes...

Listen

Episode 113

May 11, 2020 00:13:06
Episode Cover

arm64 with apt, Contributors, Backup Manifest, Now Functions | Scaling Postgres 113

In this episode of Scaling Postgres, we discuss arm64 package support for apt, annual Postgres contributors, backup manifests & verifications and different now functions....

Listen

Episode 69

June 24, 2019 00:15:13
Episode Cover

Performance Training, JSON Path, Hypothetical Indexes, Interpolation | Scaling Postgres 69

In this episode of Scaling Postgres, we discuss performance training, SQL/JSON path support, hypothetical indexes and linear interpolation of data. To get the show...

Listen