External Compression, Parallel Vacuum, JSON Aggregates, JSON & Arrays

[00:00:00] In this episode of Scaling Postgres, we talk about external compression, parallel vacuum, JSON aggregates, and JSON and array I'm creston. Jameson. And this is scaling postgres episode 130. [00:00:22] All right, I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is need for external compression methods in PostgreSQL. This is from Amit Khan at PG Blogspot.com, and he's talking about a means of doing compression within Postgres using external methods. And there are some utilities that Postgres offers the ability to do compression, but it's usually just gzip. And there's some internal ways of doing compression, like with the oversized attribute storage technique or the toast tables. But this author indicates there could be a need for being able to use external compression mechanisms to compress a table or to compress data to get better performance. So, as he says here, the idea is smaller data size means less data pages to scan, which means lesser disk I O and faster data access. And of course, the data decompression methods need to be fast enough not to hamper that query performance gains that you would get. And some of the compression methods that it would be good to utilize in some way are zlib, LZ, four, ZTD, Snappy, gzip, et cetera. Now, he also mentions another benefit of putting this feature in would be to utilize hardware acceleration for compression and decompression. And he mentions a few of those here and looks like a few of them on the Arm 64 platform. Now, he does say that PostgreSQL uses its own built in compression algorithm based on LZ for toast table compression. But he says, what if you could choose that compression algorithm, and what if you could configure the compression level, and what if it could use hardware acceleration? And he says that the community is working on something exactly like that. So there is a proposed feature in the making with an experimental patch available. So he took a look at it. He created two tables, a Zlib tab table and an LZ tab table. The zlib is, of course, compressed with zlib with a level of four, and then he just used Pglz compression on the other one. Now, what's interesting with each of these, you get different behaviors, you get different levels of performance depending on what you're doing. So, for example, with a copy where you're inserting data, the Zlib was almost twice as fast as LZ. The table size, LZ was smaller, looks like maybe a 1.2gb versus 1.7gb. But in terms of select performance, LZ was twice as fast as Zlib. So what's interesting is that it seems as you could choose your compression algorithm and get different levels of performance. Maybe you want to have the fastest queries, even though maybe the table is a little bit larger or it takes larger to insert. Or maybe you're just interested in raw insert performance. Well, you can choose an algorithm that is best at that now he mentions you could even define your own compression algorithms in defining a particular compression handler and he gives an example of that. So I found this blog post super interesting in how we could potentially use compression and decompression of data in postgres to get better performance. And you being able to define the type of performance you're looking for. Are you looking for a way to save space? Are you looking for fast queries? Are you looking for fast inserts? So definitely encourage you to check out this blog post. [00:03:44] The next piece of content is Parallelism comes to vacuum. This is from secondquader.com. Now first they cover the different phases of vacuum as they exist. So you've got the heap scan, the index vacuum phase, the heap vacuum phase, index cleanup phase and the heap truncation phase. Now, of all of these, the vacuuming of the indexes takes the longest and then that is exactly what they're looking to make happen in parallel with version 13 of postgres. So by default when you run a manual vacuum, so vacuum some table, the parallel option will be enabled. So it's on by default. However, and they say here quote, parallel vacuum is always disabled in auto vacuum. So that's a little bit of a disappointment because I would really love to have it run there. But I assume they may be enabling this in a future version of postgres. Because normally you see new features such as this come out where it's not a default, where it wouldn't potentially impact the vast users of postgres but then maybe in a later version it's deemed safe enough to be the default but we'll just see what happens in future versions. And then if you do specify the parallel clause, you can define how many workers you're going to be using or you can disable the vacuum by setting a parallel setting of zero, so no workers. Now, he also mentions down here that there are also different settings based upon the indexes and that you're not always going to get a parallel scan even though you ask for one. So if a B tree index is small enough, it still may not do parallel operations because it would take more time to set up and coordinate those workers than it would to be just to do the work itself with one process. But otherwise you do get Btree indexes being vacuumed in parallel. Now, we also mentioned you do have support for different phases for the other indexes that are mentioned down here such as the hash gen, gist, brin, et cetera, and depending they may or may not be performed in parallel depending upon the phase. Now, in terms of performance, he set this up with a six gig table with eight three gigabyte indexes and then ran it in parallel for different processes. So as you can see here, there's a dramatic performance gain by going to eight workers, which is the number of indexes that are on the table. And in what he mentions here is, quote the execution time of the index vacuum accounted for more than 95% of the total execution time. Therefore, parallelization of the index vacuum phase helped to reduce the vacuum execution time much. So definitely a great improvement coming in postgres 13. And again, like I mentioned, I look forward to future versions where they may enable auto vacuum to do this in parallel as well. [00:06:26] The next piece of content is postgres JSON aggregate functions are pretty cool. This is from the blog at Knut EU and he basically has a set of tables where a query looks like this. You have a username, an age and a group name, and he wanted the API to return something that looks like this in terms of a JSON where things are grouped together. So different users and different groups presented in this JSON format. So first he shows the different tables and how the data is inserted to be able to get a query that looks like this. Then he converts it to JSON b. By combining this information to here, he removes the ID column because this wasn't necessary in the output, which gives the results shown here. He then uses the JSON aggregate function, or JSON underscore AGG to put each row for each group in its own row. So you can see here you have two records inside of this and grouped on the group name. Then lastly, he did a JSON object aggregation, where it combines this into JSON again into the final output that he was looking for. So this is a pretty quick post, but it shows you how you can get postgres to output JSON in a format that could be directly served by an API. Now, we've covered different articles that show this and what they mention is that they usually get some big performance gains having postgres do this work through queries as opposed to using, say, your application framework to do the manipulation. So maybe you lose potentially some flexibility, but you generally gain some performance. So if you're interested in learning more about how to do this, definitely check out this blog post. [00:08:06] The next piece of content is webinar JSON and Array contemporary PostgreSQL data types follow it. This is a webinar that's given by second quadrant.com, and you can click here to look at the webinar, and there's a ten or eleven minute preview down here. But it basically reviews the JSON data types as well as arrays and working with them to avoid having to follow any kind of an entity attribute value pattern which can really give poor performance. So basically it takes what was in the previous post and goes into those functions and more manipulations that you can do in more detail in a webinar format. So if you're interested in learning more about JSON and manipulating it in postgres, definitely check out this webinar. [00:08:51] The next piece of content is how to get the best out of PostgreSQL logs. This again is from Secondquader.com and they're talking about setting up logging for postgres. Some of their recommendations are don't make manual changes to postgresql.com, basically use some sort of configuration tool like Ansible or Puppet or Chef. They recommend using the logging collector. Some people choose not to use it and just use the system logging of their operating system. But they advocate using the Logging Collector how to set your log destination to a particular format. They have recommendations that they suggest with regard to that, how to configure your log file names, how to configure your long line prefix, and the different configuration settings for setting up your logging. And then they cover some final things like using tools to analyze the logs, such as PG, Badger and other management applications to be able to track log changes. So if you're interested in learning more about that, definitely check out this blog post. [00:09:53] The next piece of content is PostgreSQL versus PostgreSQL versus postgres versus oracle versus all the rest. This is from Cybertechyphen Postgresql.com and they're talking about searching within postgres where you're doing kind of similarity searches. Now, it's not purely about similarity searches, but there are ways to find information through text searching. The first thing they cover is the Citex data type, which is available at this extension, or the CITEXT, which is a case and sensitive text. So basically you can do searches without having to worry about case and it will find the appropriate records if you use this data type with this extension. The next they go into like queries and how you can search define portions of records using like and I Like again, I like is Case insensitive using a similarity search using PG Trigrams. They give some examples of using that and then going into full text search which does an exact search, but only on certain words that exist within rows you're searching for. And then using a phrase search where you can define certain words, follow others in terms of a phrasing. So if you're interested in learning more about these types of searches, definitely check out this post from CyberTech Postgresql.com. [00:11:09] The next piece of content is Oracle to PostgreSQL cursors and ltrees. So this follows on a post that we discussed in a previous episode of Scaling Postgres where they're talking about when you go to Oracle to postgres, you don't have to use as many cursors. Now the last one, they talked about working with hierarchical data using recursive CTEs. This one covers Ltrees, which is a label tree which is an extension that exists for postgres that helps you work with very complex hierarchical data. So for example, they have the taxonomy list here and how this goes 21 levels deep with over 600,000 records and shows you how you can use Ltrees to be able to work with this type of data. So if you're interested in that, definitely check out this post from. Secondquader.com, the Next Piece of content is three easy things to remember about postgres indexes. This is from Crunchydata.com, and it's talking about indexes can speed up other operations too, meaning not just searching on the where statement, but also when you're doing joins or you're looking to group and sort data, having indexes in place can help those operations and speed them up. Second, that indexes aren't always used, so depending upon your table and how you query the data, like for example, they're using the like syntax here. There are cases where an index will not be used in that case. And then thirdly, that indexes come at a cost. Every time that you're inserting updating data, those indexes have to be updated and there's a performance impact for that. So it's a pretty simple post, but if you're interested in learning more, definitely check out this one from Crunchydata.com. [00:12:47] The next piece of content is an overview of trusted extensions in PostgreSQL 13. So prior to Postgres 13, you needed to be a super user in order to create extensions on databases. But apparently with 13 there are certain extensions that are classified as trusted and certain ones that are untrusted. So something like they mentioned H Store here is in 13 considered a trusted one, so you just need certain permissions like create DB privileges to be able to add this to your database. And you don't need to have postgres's super user permissions, whereas other extensions like the File Farm Data Wrapper is considered untrusted, and therefore it says you must be super user to create this extension. So it's an interesting adjustment that's coming with 13. And they mentioned there are 24 trusted and 24 not trusted extensions. So basically it helps you give more power to users who have access to the system if they have certain privileges to enable some of these extensions for the database they work with. If you're interested in learning more, check out this post from several nines.com. [00:13:56] The next piece of content is PostgreSQL Backups. This is from Higo CA, and they cover all the different ways that you can backup postgres, from doing logical PG dumps and PG dump alls to file system level backups using manual backup methods such as starting the backup, backing up all the files, and then stopping the backup to using PG based backup. So if you're interested in looking at different backup solutions for postgres, check out this post. [00:14:25] The next piece of content is combining PG Backrest and Streaming Replication, a Postgres 13 update. So it shows you how to use PG backrest with Streaming Replication and getting that all set up for version 13. So if you're interested, check out this post from Pgstuff GitHub IO. And the last piece of content is the PostgreSQL Person of the Week is Marco Slott. So if you're interested in learning about Marco and his contributions to postgres, definitely check out this blog post that does it for this episode of Scaling Postgres. You can get links to all the content mentioned in the show. Notes be sure to head over to Scaling Postgres, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 308

Scale Through Sharding | Scaling Postgres 308

Episode 253

Unlogged Tables, pg_stat_io, Type Constraints, Text Types | Scaling Postgres 253

Episode 360

Handle One Million Connections? | Scaling Postgres 360

External Compression, Parallel Vacuum, JSON Aggregates, JSON & Arrays | Scaling Postgres 130