Postgres Releases, Data Warehouses, Exclusion Operators, Read-Only Users

[00:00:00] In this episode of Scaling Postgres, we talk about postgres releases, data warehouses, exclusion operators and readonly users. I'm creston jamison. And this is scaling postgres episode 165. [00:00:23] All right, I hope you, your friend, family and coworkers continue to do well. Our first piece of content is PostgreSQL 13.312, point 711.1210, point 17 and 9.6.22 are released. This is from the Postgresql.org website and this release covers three security issues. One, a buffer overrun from integer overflow and array subscripting calculations. The second is a memory disclosure in insert on conflict due update. And this depends upon a purpose crafted table to exploit. And the third is a memory disclosure in partition table update. Returning that again, relies on a purpose crafted partition table. So there are three security issues that are addressed and then a number of bug improvements for these versions. So definitely when you get an opportunity, you should go ahead and upgrade to the latest version. [00:01:20] The next piece of content using PostgreSQL as a data warehouse, and this is from Narrator AI, and they're talking about using Postgres as a data warehouse. They have a number of recommendations that they cover in this post. The first thing they cover is the differences between data warehouses and relational databases, although I would call it difference between data warehouses and online transaction processing databases in that they mentioned here. Most queries are looking for one record or a few records, whereas data warehouses do a lot of analysis. So online analytical processing is another term that's frequently given for it. And for these analytical queries, the traits are you process many rows at a time, queries go from several seconds or less than a second to several minutes, and it may only be looking at a small number of columns. And for that they're advocating a column row store may be better, but you can still use Postgres's row store for doing it. Now they mentioned some third party data warehouse products, but there's also the C store extension for Postgres and also the Citis extension which allows you to do column storage within postgres. So you could check those out and then they go into if you're going to use Postgres, how would you configure it for data warehousing operations, the first thing they mention is to avoid common table expressions, at least with versions prior to twelve, because the versions after allow you to materialize that CTE or not. The next recommendation they have is to use indexes sparingly. Now this depends on how much data if you're going to be analyzing a whole table, well then of course you don't need indexes or even a significant portion of that table. But if you're going to be analyzing, say, by a date and a time, then indexing could prove beneficial. Or if your data is so huge, you could actually partition by month or by year, not use indexes. And that way you can only target certain partitions when doing the query. So you don't have to cover the whole data set and they seem to be mostly focusing their discussions on Btree indexes. But you could also potentially try using Brend, the block range index. That could give some very good performance. Potentially, if you're looking at a lot of records that you're querying, then they cover the partitioning that I mentioned before and how it can be beneficial for breaking up a large data set into smaller tables. They talk about the importance of minimizing disk I O and the importance of using SSDs or other types of fast disk access. Then they talk about the importance of vacuuming because it's especially important after you do a bulk insert to do a vacuum to make sure that everything is optimized. Then they talk about making sure that your parallel queries are tuned and that you want to typically make modifications to your max parallel workers and max parallel workers per gather to optimize those for the types of queries you're running. And then lastly, they also recommend increasing your statistic sampling size, which you could do on a per table basis or for the entire table to give you the best query performance. So if you're wanting to use PostgreSQL as a data warehouse, perhaps you would like to check out this article. [00:04:23] The next piece of content is actually a YouTube video and it's Exclusion Operators explaining PostgreSQL. And they're talking about the problem of preventing overlapped bookings. So if you have a room, you don't want to book that room more than once. So you want to avoid this situation where someone is going to double book a room in, say, a hotel. Now, to do this they're using range types, so they're using a start date and an end date within a single data type. And you can store this in postgres and then you can do things to determine, hey, does this date exist within this range of dates? And it returns true or false. Now, the trick that you can do with Postgres is use a special type of index and an exclude using clause when you create the table to create a constraint to avoid double booking. So the first thing you need to do is create this extension Btree Gist, which combines a B tree and a Gist index. Then you create your table that has a room with an integer as an Identifier, a range. So this is the range of dates that that room is wanting to be booked. And then you specify an exclusion constraint where you're exclude using Gist and room with an equals in a Myrange with double ampersand. And basically this operator avoids overlaps. So with adding this extension and this minimal set of code, you essentially avoid double booking rooms in this implementation and that's all the code you need to worry about. And when you do an insert and you attempt to do a double booking, it'll return an error so you can capture that in your application and then handle it gracefully. So this is a great video and I encourage you to watch the whole video and see how to implement this and even implement multiple exclusion constraints. [00:06:15] The next piece of content creating a read only postgres user, this is from Crunchydata.com and they're talking about you're wanting to create a user that has read access to all tables in a database or in this case this example in a schema. So you can use grant and revoke to grant different objects to different users or to revoke them from different users. And the alter default privileges basically says for any new objects created by a particular user. So in their example here, they created a separate schema to store their data. They created an app schema. They revoked all permissions on schema from public, which is a general security recommendation. They created a table and an index and then inserted stock data into it and created a function to be able to retrieve from some data from it. Now they created the user they're calling a scraper because they want to scrape the information. They created a role as a login. That's how you would create a user. Now, when they try to query it, you're going to get a permission deny because it doesn't have access to this app schema stock data table. But these are the grants that would give you access. So you grant usage on Schema app to the Scraper. You grant select on all tables in the Schema app to the Scraper, you grant usage on all the sequences in Schema app to Scraper and then grant execute on all functions in Schema app to Scraper that will allow you to query the table or to use the functions. And if you try to insert data or create a new object, it will fail and give you an error. But with just those grants you actually can't view new tables. So if you go in as a different user and create a new table, you then go in as the scraper and try to access it. You're going to get a permission denied. So that's where you need to run these two alter default privileges in the Schema app to grant select on tables and grant execute on functions that will then let you select and execute functions for any new objects. Now, there is a caveat that if a different user has access to the app schema and they create an object in it, if they have not set altered vault privileges, the scraper won't have access to it. So you'll have to structure your users and permissions to be able to handle that situation. Although they mentioned here that in postgres 14 there is a PG read all data default role that you can grant to users and that will allow you to select from any table object in the database. So if you want to learn more about setting up readonly users, definitely check out this blog post. [00:08:49] The Next Piece of Content postgres 14 highlight create table compression. This is from Pakir XYZ, and they're talking about a feature where you can now customize the compression algorithm used for toast compression. Now, this doesn't let you arbitrarily compress columns or compress tables in postgres, but when you have a table row that exceeds the space allowed for a given row, it gets stored in the toast table. Well, when you store something in toast, it gets compressed. By default, that compression is a Pglz compression, which from what they were saying here, is circa 1999. But now with this feature, you can opt to use LZ four compression, which should be much faster. The only caveat with that is that PostgreSQL must be built using Hyphen Hyphen with Hyphen LZ four. But with that, you can then customize the compression and give you a little bit of a speed boost. So if you're interested in learning more about this patch coming into PostgreSQL 14, definitely check out this blog post. [00:09:52] The Next Piece of Content clustering a Table this is from Momgn US, and he's talking about the command cluster, ordinarily the table rows in the heap where the actual data is stored is not ordered by any method. It may have an ordering at first insertion, but as updates and deletes happen, the rows can be placed anywhere. It's your indexes that are kept in an ordered fashion. The table is never really ordered. However, cluster allows you to actually order a table and you cluster it based upon an existing index. So that's a way that you can get the table order to mirror the ordering of an index. Now, this could be beneficial if you pull out large numbers of contiguous rows at a time using an index. So, for example, what he mentions here is that if you're just pulling out one row, it consults the index and knows exactly where to go to pull out the row from the heap to present it to you. But if you're pulling up a lot of different rows from an index, they may be scattered across the heap and requires accessing multiple locations in memory, if not disk, to retrieve that data. But if you know you are going to be frequently pulling a lot of rows, it could be beneficial to order the table in that fashion. But there's two big downsides to cluster. Number one, it does not maintain the ordering. So you'd have to keep clustering the data if updates and deletes keep happening. And secondly, it locks the table so you can't really access the table while a cluster operation is running. So this post talks a little bit about cluster and some of its use cases, but because of those restrictions, it's really not used frequently. Perhaps in data warehousing use cases, it might prove beneficial where you do a load of the data and then cluster it from a particular index. But if you wanted to learn more, definitely check out this blog post. [00:11:43] The next piece of content. Setting up PostgreSQL Streaming Replication this is from Cybertechn Postgresql.com and it's a recent post on how you can set up a streaming replication from a primary DB to a Replica or a standby DB and they're using version 13 and give you all the different commands to run and how you can configure both systems to set up streaming replication from one system to another. So if you're interested in learning more about that, definitely check out this blog post. [00:12:11] The next piece of content. TLS for Postgres on Kubernetes OpenSSL CVE 2021 Edition. So they previously had a post on setting up TLS for Postgres and Kubernetes. This is an updated post where they had to update a few techniques to handle a new CVE that was introduced. So it's basically a revised post using the operator that Crunchy Data offers to be able to set up TLS for your Kubernetes postgres installations. So check out this blog post to learn more about that. [00:12:42] The next piece of content regression analysis in PostgreSQL with TensorFlow. Part three data analysis. So if you're interested in doing data analysis with TensorFlow and postgres, definitely check out this post from Enterprisedb.com. [00:12:57] And the last piece of content is the PostgreSQL Person of the Week is Lawrence Albeit. So if you're interested in learning more about Lawrence and his contributions to Postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube. Right tunes thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 9

High Volume Processing, pg_wal Issues, Restore with pg_receivewal | Scaling Postgres 9

Episode 374

Migrating Millions Of Databases | Scaling Postgres 374

Episode 346

Things I Don't Do In Postgres | Scaling Postgres 346

Postgres Releases, Data Warehouses, Exclusion Operators, Read-Only Users | Scaling Postgres 165