DBMS of the Year, Better Data Migrations, Idle Connection Impact, Security Implementation Guide | Scaling Postgres 147

Episode 147 January 10, 2021 00:20:28
DBMS of the Year, Better Data Migrations, Idle Connection Impact, Security Implementation Guide | Scaling Postgres 147
Scaling Postgres
DBMS of the Year, Better Data Migrations, Idle Connection Impact, Security Implementation Guide | Scaling Postgres 147

Jan 10 2021 | 00:20:28

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss PostgreSQL as the DBMS of the year, running better data migrations, the impact of idle connections and a security implementation guild.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/147-dbms-of-the-year-better-data-migrations-idle-connection-impact-security-implementation-guide/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about DBMS of the Year better Data Migrations, Idle Connection, Impact and Security implementation Guide I'm Kristen Jameson, and this is scaling postgres episode 147. You all right? Hope you, your family, friends and coworkers continue to do well. Our first piece of content is PostgreSQL is the DBMS of the Year 2020. This is from Dbengins.com, and they have ranked it based upon its gain in popularity for the last year. And they are defining popularity according to their methodology, which is basically about how many people were communicating about it. So, for example, their list here, job offers, professional profile entries and citations on the web. So this is actually the third time that PostgreSQL has been nominated to be DBMS of the Year. So that's another great achievement for Postgres. So if you're interested in learning more, you can check out this site. [00:01:08] The next piece of content is stop worrying about PostgreSQL locks in your Rails migrations. Now, this is talking about Rails migrations and it does cover some Ruby and Rails code, but the information in this post is valid for anyone using Postgres. And if you use Django or Python or some other language, you can find important things to do and be wary of if you are working with a large database. So they have a database that handles, they say, up to 15,000 requests per second. And they have, I believe, over 100 full stack developers working on it. So it seems to be a pretty large application. And they have refined their procedures to do migrations, which are essentially DDL changes to their database safely. So the first thing that they talk about you want to be aware of is adding an index. So if you just add an index and they are using a Rails migration. So this is Ruby code, but it basically just adds an index. And of course, this is a locking operation until that index completes. What you want to do is create that index concurrently. Now, you can do that using the Ruby syntax by specifying the algorithm here concurrently when you add it. So you definitely want to do that. However, the other thing that they apparently have set for their application is a statement timeout for the user that's running these migrations. So to avoid that index operation being timed out, they actually first, as part of their migration, they actually store what the statement timeout is into a variable. Then they set the statement timeout to zero. So basically no statement timeout for this session that they're in. They add the index and then they reapply the old value to the statement timeout. Now, one thing that they also do have to do with this is that they have to disable the DDL transaction wrapper that Ruby provides. So it basically avoids that in order to run this concurrent index migration. So, a few steps, but the main one here that is different than just creating your indexes concurrently is altering the statement timeout so that the index will run. So that's something they have to do in their environment. The next area they talk about is adding a foreign key and with the Ruby DSL you can just do add foreign key table and then the reference. Now again, this would prevent some writing on both tables. So one way that they could do it is set a statement timeout to avoid that locking. So here they set the statement timeout to 5 seconds, try adding the foreign key and then if that's successful, then revert the statement timeout to what it was before. [00:03:45] However, the better way to do it that they will mention here is that you add this constraint without validation. So basically any rows moving forward are validated, but not preexisting rows. And then as a second step that is not going to block operations, you validate that constraint. So here's the process that they go to. They go ahead and save the old value of the statement timeout, they set the statement timeout to 5 seconds to keep it relatively short. They go ahead and add the foreign key and they set the validation to false. So that operation should run relatively quickly. Again, less than that, 5 seconds. Once it's successful, you can then set the statement timeout to 0 second or turn it off. Then you can do this command which validates the foreign key for this table so it goes through and checks all the values to ensure that it's valid. And then as a final step, you revert the statement timeout to what it was before. So it's a little complex, but you can see the logic in doing it. First you create the foreign key in an invalid state and then you validate it later. And the next area they're talking about is adding a column. So you can add a column like this and normally that happens very quickly, but where you can run into problems is with a lock queue because it still needs to get an access exclusive lock on that table. But if there's other locks going on, what can happen is this gets put in a lock queue and then things behind it have to wait. And they give a graphical representation of this, where basically, operations that require a lock have to wait until this add column query is completed before they can start processing. So everything starts getting backed up in the queue. So the way to avoid this is having a lock timeout. So basically you say if it gets timed out due to lock waiting, it'll go ahead and cancel that operation, it won't move forward with adding that column. Now of course, if that happens, you need a process to retry it to make sure it's successful. And they don't show that in this particular example here. They just set the lock timeout to a short value and then revert it. Now they also mentioned there's a way to add a reference, which is basically a combination of the things that were shown. So adding a new column, adding an index, adding the reference, so multiple ways of doing it that can cause locking. And basically this is a way to get around this issue. Instead of using this very simple DSL to do it, you need to do one step to add the column, one to add the index and then add your foreign key. Now, it would be great if Rails offered a way to do these very safe steps to do updates to the database. Unfortunately, they don't. Now, there may be other orms that offer a more sophisticated way to do it in this safe manner. And there have been a number of libraries created for Ruby for doing safer migrations. Well, they've actually written another library or a gem called Safepg Migrations that basically take that standard DSL, simple syntax and do all of these things behind the scenes. So this post is definitely beneficial for those who are using Rails or Ruby to create safer migrations. However, even if you're not following these steps and these processes and thinking through it can help lead you to creating better DDL statements for migrating your database as you're working with it. If you're interested in that, you can check out this post that's on Medium. [00:07:15] The next piece of content is resources consumed by idle PostgreSQL connections. This is from the AWS blog and they're talking about connections that really aren't doing anything and the resources that they use. Now, this is the first part of a post and the second one goes into some of the performance impact. But this one does a very test driven approach, an empirical approach for measuring the impact of these idle connections. So first they do a test of connections with no activity. So basically they open 100 connections and just leave those idle connections open for ten minutes and see what the impact is. And they also try doing a discard all statement to try and free up any resources to see if that frees up anything. And unfortunately, it really doesn't in any of the tests that he's seen here. So you can see the number of database connections going from zero to 100 as tested, or maybe around 102 as tested. You can see the freeable memory drops from about 5.27 to 5.12gb, so that's about 150 megabytes. So each connection is using about 1.5 megabytes. The next test, they actually did some work with temporary tables, so they created one and dropped one, and then essentially leaving those connections idle for ten minutes to see what the impact was. And here you can see the number of connections going up and down. And then this is the second phase with the discard all. And with this one you can see that the memory dropped from 5.26gb to 4.22gb. So it definitely uses more memory and those idle connections basically don't give anything back to the memory until those connections have been shut off. So those idle connections are still using memory resources. Next, he took a look at some different select queries. So trying doing select queries and then waiting for ten minutes for that idle state to see what happens. And again here you can see that the memory does drop and it is only released back once those connections are dropped for this ten minutes that they are idle, they're still using these resources. They are not released back to the operating system. For example, then you combine both a temporary table and the select queries and you see the same impact where both activities were done. It drops the memory down and they are only released once those connections are dropped. A discard statement still does not free them up until the point that those connections are stopped. And he says that you even get a CPU impact from these idle connections. So here's an example where he compared different test runs on CPU utilization compared to how many connections. So for example, you can see here it goes up, I don't know if it's maybe a percent for 100 connections, 500 connections, it gets up to maybe a 3% utilization with really nothing going on in terms of those connections, it gets up to about 5% utilization. At 1000 connections, it gets up to about six or 7% at 1500 connections. And then at 2000 connections, it gets around 8% utilization. So definitely even connections just sitting around really doing nothing still uses CPU resources of the system. Now, the next post talks about the performance impact of this. So he looks at measuring transactions. So he tested with 100 clients and then a test with 1000 idle connections. And just using a standard PG bench test, you could see that adding 1000 idle clients drops the transaction throughput from 1249 to 1140 TPS, which is about an 8.7% drop. He then said you'll probably get more of a performance impact looking at select only workloads. So we did a select only PG bench again, looked at 100 client run and then added 1000 idle connections. In addition to that 100 clients in here, he saw an 18.2% drop in TPS performance. So that starts becoming really significant. Then he looked at a custom PG bench workload where he did very large queries again did the 100 clients and then the same 100 clients again, but adding 1000 idle connections to the database at the same time. When testing that, he saw a 46% performance drop from TPS. The transaction per second went from 378 to 206. So that's a huge performance drop from adding 1000 idle connections. And then he gets down to the main point is that how you get around this is by using a connection pooler. So he tested using PG bouncer in this case to see what impact it had. So he configured a PG bouncer to allow 5000 client connections and then a maximum of 200 server connections. So 200 connections to the database, but up to 5000 application clients could connect. And then he did a PG bunch run of 100. So you can see that there's 100 clients and 100 servers to get a baseline. And then he opened up 1000 connections and left them idle. Now, just adding those thousand idle connections, it's not really using any server connections. And when he ran that 100 PG bench, again, he got the exact same performance. So using PG bouncer, adding 1000 idle connections really didn't have an impact on the performance. It's the exact same transactions per second. So, for example, you can see when he was doing that 100 PG bench run with 1000 idle connections, you could see the thousand idle connections here under the client active, and 100 of them are the ones running PG bench, and they're using up the 100 active server connections. And again, that performance was identical whether you had 1000 idle connections or not. So using a connection pooler definitely helps with your performance when you have a lot of idle connections. Then he tried an interesting thing is that the system he was using only had two virtual CPUs, and running 100 processes in parallel could result in a lot of context switching. So what he did is he tried dropping the number of connections down to 20. So the server side connections down to 20 and ran his benchmark again. And this time he actually got an increase in throughput to 426 transaction per second compared to 377. So this is an interesting case, is that you drop the number of server connections, but you increase the level of performance because you're getting less context switching with two virtual CPUs trying to handle 100 connections at once, when they can only handle 20 connections per one, they become much more efficient and can give you a better transaction throughput. So definitely something to keep in mind that he mentions here is that sometimes you think just adding connections will give you more performance. Not necessarily. And then they close out the post talking about a tool available in AWS, which is RDS Proxy, which is essentially a tool like PG Pool that you can work with. So definitely interesting insights and a lot of tests with regard to connections. So if you want to learn more about this, I highly encourage you to check out these two blog posts. [00:14:14] The next piece of content is announcing the Crunchy Data PostgreSQL Security technical implementation guide. Now, this was first released in 2017 in conjunction with working with the United States Defense Information Systems Agency, and they've come out with a new version as of, I think, November of 2020. So here's the link to the Security Technical Implementation Guide that you can use to help secure your PostgreSQL installations. So if you want to learn more about this and get a link to the guide, go ahead and check out this post from Crunchydata.com. [00:14:47] The next piece of content. Also security related is PostgreSQL database security. What you need to know now this is a very short post, but it says it's going to be the start of a larger series. So this is kind of the 50,000 foot view of it and they've basically broken down security into these six areas. And more specifically, what Postgres deals with is regard to the authentication, identifying who is allowed to access the database, authorization, what they have authorization to see, and then accounting for logging who's accessed what data, et cetera. And they go and break throughout the different features that are available within Postgres to handle that authentication, that authorization as well as accounting steps. So if you're interested in this post, go ahead and check it out from Percona.com, the next piece of content is how to limit rows to at most N per category fix. So this is with reference to a post that was posted on Depom, where he limited to the number of addresses that were allowed to a user. And someone had made note that with a race condition or multiple people adding addresses at once, you could actually overrun his previous implementation because his implementation just did a query to find out how many were there and if it already had three in it, then it would just raise this exception. But if you had the case where there's transactions going on or things are happening in a concurrent fashion, you could easily overrun this. It had five or six or whatever. So what you do to avoid that is you introduce a locking. Now the first implementation he did did a four update lock which essentially locks this individual row, but that locks it for everything going on. So you may not necessarily want to do that because again, you can, like we mentioned earlier, get stuck in a lock queue and have things locking behind it. So you may not necessarily want to do that. The implementation he ultimately decided on is using an advisory lock. So I would say this is only this specific purpose. So it's kind of considered an application level lock even though it is at the database level. It's used a lot in application development. So here you have a specific type of lock for this check address feature. So it does apply this lock with reference to that user ID that's being used. Then it does the count, and if the count is three or more, then don't do the insert, just raise this exception listed here. So if you're interested in that, you can check out this blog post. [00:17:22] The next piece of content is single user mode and Hyphen P flag. This is from fluca 1978 GitHub IO and they're talking about if you have a case where you have damaged system indexes in postgres, there is a way to rebuild them. But the problem is you can have trouble connecting to the database if there is this type of corruption. And the solution is that you start the cluster in single user only mode and he has a link of how to do that here. And then you can start a background process, ignoring the system indexes here, and then run the command reindex system, the database that is having issues. Then you can restart the cluster in order to get it back. So if you run into this issue, definitely a blog post to consider checking out. [00:18:11] The next piece of content is using R in postgres for logistic regression modeling. This is from Crunchydata.com. And this is another post in the series talking about doing data science with Wildfires in California. And they're at the point where you're actually getting to do the logistical regression. Now, for that purpose, they're actually going to be using an extension called PLR, where they made R into a procedural language within postgres. So basically you can install this extension and then it can read R. So they built the model in R and then they can create a function within postgres to run this model and process the data that already exists in postgres. So if you're interested in learning more about this, you can check out this post from Crunchydata.com. [00:19:00] The next piece of content is Improved OpenStreetMap data structure in PostGIS. This is from Rustproof Labs. They're talking about some of the enhancements that were recently made to the OpenStreetMap data where they're allowing a more flexible output, and some of the enhancements that have been made to the general data structure. So for example, some of the opinions or additions that were added to it is that every table has a primary key. More columns get, not null, prefer proper columns over JSON b and prefer JSON b over H store. In other words, as he says, also known as goodbye hstore units, cleaned and matching OpenStreetMap default units and conversion to other units should happen in generated columns, views and even materialized views. So if you're interested in learning more about this improved data structures, go ahead and check out this post. [00:19:52] And the last piece of content is the PostgreSQL person of the Week is Amit Langot. So if you're interested in learning more about Amit and his contributions to postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the Show Notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks. [00:20:25] You close.

Other Episodes

Episode 238

October 24, 2022 00:13:05
Episode Cover

Slow Updates, Lateral Joins, Serial To Identity, PG 15 Changes | Scaling Postgres 238

In this episode of Scaling Postgres, we discuss why and what to do when updates are slow, use cases for lateral joins, moving from...

Listen

Episode 148

January 17, 2021 00:11:54
Episode Cover

Hash Indexes, Int Float Numeric Types, Postgres 14 Features, Data Science | Scaling Postgres 148

In this episode of Scaling Postgres, we discuss hash indexes, the int, float and numeric data types, features coming in Postgres 14 and data...

Listen

Episode 75

August 05, 2019 00:15:40
Episode Cover

Combining Indexes, Regular Expressions, Parallelism, PostGIS | Scaling Postgres 75

In this episode of Scaling Postgres, we discuss combining indexes, regular expressions, parallelism in Postgres and PostGIS content. To get the show notes as...

Listen