Security Concerns, Fast Counting, Indexing, Trends

[00:00:00] In this episode of Scaling Postgres we talk about security concerns, fast counting, indexing and trends. I'm Kristen Jameson and this is Scaling Postgres. Episode 58. [00:00:20] Alright, I hope everyone's having a great week. Our first piece of content is a bit of news from Postgresql.org and it says CVE 2019 91 93 not a security vulnerability. And apparently they say that there's been widespread mention of a security vulnerability in PostgreSQL and it was registered this CVE number. However, the PostgreSQL team believes that this is not a security vulnerability, it's just how it was designed. And specifically it's talking about the Copy command which allows you to import and export data from a PostgreSQL table, for example to a file, but to a program. It allows you to run a program and pipe the output from that into say, a PostgreSQL table. And apparently this feature is thought to be a security vulnerability, but you can only use it if you have superuser privileges or you have the Role Pgeecute server program because it allows you to run programs upon which the system that PostgreSQL is running, and specifically just the rights that a Postgres user has, assuming that PostgreSQL is installed under the postgres user. So he says, quote by design there exists no security boundary between a database super user and the operating system user the server runs under. And that's part of the reason why they don't run PostgreSQL server or set it up to run as a root user. And they also state, quote we encourage all users of PostgreSQL to follow the best practice of never granting super user access to remote or otherwise untrusted users. So of course you should keep this very much under locking key because of the power it has. Now there's a few more posts that went into this in more depth. The next one is when a vulnerability is not a vulnerability. And this is from Magnus Hagander's blog and basically he talks about the PostgreSQL documentation essentially documents how this is handled. So it's essentially how the system was designed and not a vulnerability. And I like what he says here is that while this is not a vulnerability in PostgreSQL, it is certainly a vulnerability present in a nontrivial number of installation of PostgreSQL deployments. So basically avoid granting super user permissions to a user and quote a common setup is to only allow the Postgres operating system user itself to act as a super user, which is what I tend to do. I also like what he says here. It seems PostgreSQL needs better documentation about this for new users, possibly along the Create role page, which does not explicitly mention this, but if super user is granted to a user, that remains the equivalent of gridding the OS user permissions. [00:03:07] So basically if you give someone super user permissions, they have some permissions over the Postgres operating system user to do some things. So I agree it would make sense to update documentation to make people aware of that. And of course his recommendation is if you have any application that uses super user. So for example, you have a web application that talks to a database and you're connecting as the super user, you should pretty much stop that right away and just grant the permissions necessary for that application user to do what it needs to do. Maybe it could just be the owner of the database for example. And of course he also mentions use Pghba.com to ensure that no super user can log into the system remotely. Now it also mentioned the origin of this appears to be from the blog post on Trustwave. So I will link this in the show notes that you can take a look at it as a reference. Another post related to security is Postgres and Super User Access because this again mentions the same CVE and they pretty much cover the same issues and particularly highlight if you're handing out super user access to your database or connecting to your application with a super user role, consider changing that immediately. So it's definitely best practice not to be using superuser privilege to carry out standard application operations. [00:04:29] The next post again probably prompted from this CVE announcement is secure PostgreSQL a reminder on various attack services and this is from CyberTech Postgresql.com and they talk about different attack points. The first one they mentioned, given what's in the news this week is unnecessary PostgreSQL super user access and they flag this as the number one point of danger and emphasize again like mentioned super users can execute random commands on the OS level under the PostgreSQL process owner privileges. So this is usually the postgres user of that installation. So they go through a scenario where you can potentially SSH into a machine. Another area mentioned is the archive command abuse because typically this is an operating system commands, a command that gets run every time you switch a wall archive. Typically you copy those out to another location. But here let's remove the postgres installation for the archive command and if you reload your configuration and do a log switch you've essentially removed your whole postgres installation. The next area they mentioned is a file foreign data wrapper, arbitrary program reads. So in this case being able to potentially run a program that again removes the postgres installation and then untrusted PL languages that again can try to do the same thing. And in terms of the mitigation the recommendations are the same applications should be designed so that super user access is not needed. The next attack area mentioned is brute force password guessing and the mitigation area for this is use long random passwords. Area two is install and configure the relatively unknown auth delay contrib extension that will halt the communication for a moment in case a login attempt fails making it slower to guess passwords. And third, actively monitor the server logs doing full power brute force generates many gigabytes of logs per hour with fail entries which should be picked up by a good monitoring setup. The next area is man in the middle Attacks. Basically someone getting between you and your database server when you're trying to connect and it says at a minimum force SSL connections preferentially use Scramshaw 256 which is a new authentication mechanism as opposed to MD Five and then also use certificates that use a verify CA and verify full SSL modes when connecting. The next attack area he mentions is Loose Backups or Logs. So basically where are they being stored and being sent, so make sure they're kept separate. Only DBAs have access to them and use things like at Rest Encryption. The next area mentioned is disk space attacks. So someone trying to say fill up the disk. And some of the mitigation strategies are don't use the public schema for important applications and create application specific schemas. And also you could set the temp file limit parameter to a reasonably low value. And the last area mentioned is distributed denial of service attacks for publicly exposed DBS. Now, I'm not sure why you would want to publicly expose your database and of course mitigation number one is try to avoid direct public access to your database, but failing that, look into anti DDoS services. So a pretty good write up of different attack areas and mitigation strategies to avoid them. [00:08:03] The next post again security related, is enhancing your PostgreSQL ten security with the CIS benchmark. So this is from Crunchydata.com and they have developed a center for Internet Security PostgreSQL benchmark that you can establish against your installation to see how well it achieves this security benchmark. And it was developed by testing PostgreSQL ten running on CentOS seven and it has recommendations with regards to installation and patches, directory file permissions, logging, monitoring and auditing, user access and authorization, connection and login PostgreSQL settings, replication and special configuration considerations. So if you want to benchmark your PostgreSQL installation with regard to security, definitely a blog post to check out. [00:08:53] The next blog post is actually not security related and it's Count Star or Count asterisks made fast. And this is from CyberTech Postgresql.com and they're talking about why is count so slow? Because some other database systems keep a count of the total rows in a table and when you count from a whole table in PostgreSQL it's relatively slow. And the primary reason is MVCC. It has to go in and count every row to determine which rows are visible to give you an accurate count of how much is in the table. And the first thing he mentioned is count the problem because you're returning essentially all rows. But no, it's really just counting rows and there's nothing to be gained by avoiding the asterisk. Would an index only scan help? And it could potentially if vacuum has been run sufficiently and the visibility map is updated. The other thought is using an aggregate table. So basically you keep an active count of the rows you're interested in in a separate table. And he has this scenario where he created a started transaction, created a table called my table underscore count with just a single column, created a function that will serve as a trigger. So if something's inserted, the value gets incremented, if something's deleted, it gets decremented. And in other cases where for example, it's truncated, it gets reset to zero and then he creates the trigger. After insert or delete, run this procedure and get an initial count from the table by running this statement here. Only disadvantage, it will lock the table while it's setting this up. He says this provides us with a really fast alternative to count, but at the price of slowing down all data modifications on the table. So now essentially, you're going to have to update the second table every time an insert or delete is done to it. And he makes note even though this counter table might receive a lot of updates, there's no danger of table Bloat because these will be hot updates or heap only Tuple updates. The next thought is, well, do you really need an accurate count? And another way you can count is just look in the PG class table at the Rail Tuples and you get kind of what the planner uses for estimates of what the count total rows are in the table. So if you don't need something perfectly accurate, this may be a reasonable way to get an approximate count. And then he has this interesting way where basically use an explain plan to pull out an estimate of the rows, which I'm not sure I would use because of the last statement here. Do not use this function to process untrusted SQL statements since it is by nature vulnerable to SQL injection. They also list a number of comments here discussing different ways. And one interesting one I thought is that if you don't have a lot of deletes going on and only inserts, you could potentially use the sequence of the primary key. So this sequence could give you a rough count of how many rows are in the table. One disadvantage of that, even if you don't have a lot of deletes going on, is that if you have failed inserts, that will still increment the sequence but not actually insert the row so you'll be overcounting them. But that's another interesting alternative. So if you're looking a way to maybe get counts of an entire table, these may be some techniques you want to look into. [00:12:17] The next post is postgres indexes for absolute beginners. This is from the PG Mustard blog and this is a basic index post for postgres. They say step one, understand what you want to achieve. Because again, indexes are not free. There is a cost for keeping them up to date and maybe sequential scans aren't so bad if you're going to be looking at a lot of a table and pulling back data from it, then maybe sequential scan is what you want. An index scan is advantageous when you're only looking for a very, very small portion of the data in that table. And they talk a little bit about the different index types, the primary of course being the Btree index and talk about how to create them and even covering multicolumn indexes. So if you want a basic introduction to indexes, definitely a blog post to check out. [00:13:09] The next post is 2019 PostgreSQL Trends report private versus public cloud migrations database combinations and top reasons used. So basically it looks like ScaleGrid here went to the PostgreSQL conference in March and surveyed PostgreSQL users. So these are all PostgreSQL users, they make some reference to other databases, but no, this is a PostgreSQL crowd that was asked and they say basically how many people are using public cloud versus a private cloud or a hybrid one. And I believe private just basically means on premises. [00:13:45] And what are you using for different cloud providers? Predominantly AWS, it looks like, although I wonder how much because I don't think I saw this in here, but I wonder how many are running their own postgres versus using something like RDS, a hosted version. What's your state of using or migrating or exploring using PostgreSQL most popular database combinations with PostgreSQL. So definitely just a survey of PostgreSQL users about how they're using it in their current state. So if you're interested in that type of information, this is definitely a blog post to check out. [00:14:21] The next post is PostgreSQL's exciting features you should know. And this is from a hackernoon.com and the first one they mentioned is inheritance. And then I was looking at the inheritance here and really what he's boiling it down to is basically partitions using inheritance. And he says the advantages are performance because when you need to do a query you're going to only choose like that partition table. The index sizes will be smaller, so less to scan through. [00:14:51] And in terms of maintenance, if you have a table that's not being used, you can easily vacuum full it, do a re index or even a cluster on the data. He talks a little bit about data types, talking about JSON and how to work with JSON, although interestingly, he doesn't mention JSON b at all. And really if you're going to be using and manipulating JSON, really that's the data type you want to use. [00:15:16] Just the pure JSON data type is mostly beneficial for just getting a full snapshot of a payload and storing it for reference. But if you're wanting to insert data, update data, retrieve specific parts of the data, then A JSON B is generally the better data type to use. He goes into the array data type and then hstore. Other interesting about hstore is that with JSONB I haven't really used hstore anymore. Then of course, he talks about data integrity, which is a hallmark of relational database. He goes over custom types and functions and error handling, combining queries and window functions as well as check constraints and sharding, although he says sharding, but this actually looks more like partitioning. So if you want to review of some of the features in PostgreSQL, here's a blog post to check out. [00:16:13] The next post is I am a developer and you can too. And this is from the second Quadrant.com Blog. And basically this is an introduction on how to get started with contributing to PostgreSQL and not necessarily writing code but reviewing patches. And it goes through the process of how you would go about doing that and contributing to PostgreSQL. So if you're interested in doing that, definitely a blog post to check out. [00:16:40] Now, related to that, the next piece of content is writing PostgreSQL Extensions is Fun C Language and this is from Percona.com. So maybe if you don't want to contribute to PostgreSQL right away, maybe you could try writing an extension first so they go over the process of what you would need to do to kind of start putting together a PostgreSQL extension using C. So if that's interesting to you, definitely a blog post to check out. [00:17:10] The next post is postgres Excel and Global MVCC. So this is talking about the Postgres Excel project, which is a fork of Postgres and how one of the things that it does is handles global transaction management. What that means is that it ensures that if you have database nodes that are separate geographically, like a multimaster setup, it ensures that changes are seen by both instances at the same time using a GTM or a Global Transaction Manager. So if you have the need for distributed databases or multimaster and you haven't heard or explored Postgres Excel, this could be a blog post to check out as well as the Postgres Excel site. [00:17:57] The last post is Replication between PostgreSQL versions using Logical Replication. This is from Percona.com and it is a post about logical replication and how to go ahead and set up your publisher and subscriber and to get started using logical replication. And even if you're using a version that doesn't have it versions prior to ten, how you can use PG logical to set up the same process. Now it doesn't go, from what I can tell, into the process of actually doing an upgrade, but at least getting logical replication initially set up. So definitely a blog post. If you're interested in setting this up, that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the Show Notes. Be sure to head over to where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 327

Postgres Graph Queries | Scaling Postgres 327

Episode 309

Will Postgres Pull A Redis? | Scaling Postgres 309

Episode 107

Showing Plans, Atomic, Configuration, Migrations | Scaling Postgres 107

Security Concerns, Fast Counting, Indexing, Trends | Scaling Postgres 58