JSONB Performance, Bidirectional Replication, Most Recent Record, PG14 JSON

[00:00:00] Speaker A: In this episode of Scaling Postgres, we talk about JSON B performance, bi directional replication, most recent record, and PG 14 JSON. I'm Kristen Jameson. And this is scaling postgres episode 201. All right, hope you, your friends, family and coworkers continue to do well. Our first piece of content is Postgres large JSON value query performance. This is from Evanjones CA, and he's talking about the performance of JSON JSON B H store in postgres, particularly as you add more data to it, especially when you exceed the two kilobyte limit per row. Now, this is because postgres generally has an eight kilobyte page size and there's a minimum of four rows per page. Therefore, you have about a two kilobyte limit per row. And what he's reporting is that when accessing these columns, performance gets pretty darn slow, two to ten times slower. Now, this is due to both the compression and using the toast table. So whenever a row exceeds that two kilobyte limit, the excess spills over into the toast tables. And just looking at the JSON B test that he did, the query time for inline uncompressed was 746. When you compress it, it goes to 1178. When you go to the Toast uncompressed, it's 3393, and then the toast compressed it's 7624. So again, these are query times. As you can see, when data is being stored in the toast and it's compressed, it's ten times slower than just accessing data that is uncompressed in the actual row, not in toast. So that's a pretty big performance difference. Now, they don't mention indexes or anything, so I'm assuming this is just raw reading of the table. And the reality is accessing and pulling the data out takes ten times longer versus just having it smaller and uncompressed in the row. So definitely something to keep in mind if you're wanting to store large JSON Blobs in a JSON B field for any of these other types. And he advocates, maybe you want to consider storing things that exceed 2 KB in multiple rows, as well as using, say, LZ four compression to hopefully get you a little bit more speed. But definitely super interesting post and something to keep in mind as you're determining how you want to store your data. The next piece of content is actually an enhancement of the previous post because it was covered in five minutes of postgres. This is episode three, postgres Performance cliffs with large JSON B values and toast. So basically, he primarily covers the previous JSON B post, and he adds a few more posts describing how toast works, some other things to keep in mind with regard to it, as well as how updates can be really slow. Because as he says here, quote, when you update a value that's in the toast, it always duplicates the whole value. So this could be for text fields or even a large JSON B field. If you only want to change one section of that JSON b, the entire JSON B needs to be duplicated and that also has impact for wall files, which has impact for vacuum. So it can really cause performance issues. And going back to what the original post mentioned, maybe breaking it up into multiple rows would make that easier to work with. We just need to be cautious how often you're using updates with very large JSON B fields, but two great pieces of content to check out. Oh, and again, I should say this article is from Pjanalyze.com. The next piece of content, PJ friday, nominally bi directional. This is from Enterprisedb.com and what they're talking about is there was a post that talked about using Pglogical to achieve bi directional replication and this was on the Amazon blog and I actually reported on this in the previous episode of Scaling Postgres. And his suggestion is do anything but this. So whatever you read from this post, he says definitely don't do it because you're going to get yourself in a world of hurt. And then at the bottom here, famous quote is here be dragons. So it's super dangerous to try and use PG logical to achieve some sort of multimaster setup or bi directional replication because there are so many things that can get out of sync and cause issues having to rebuild to get logical replication back up and running again, reliably and he goes through all the different issues you'll encounter. I mean, one of the main ones is that replication slots, the locations are not passed to Replicas. So if you have an instance where you have a failover event, that slot information is going to be lost. So you need to do specific procedures to try and get logical replication back up and running. And when you're using it for replication purposes, that can cause a number of issues. And they also talk about there are certain considerations where the last update wins, can also get you into problems. And they're talking about merge conflicts where you're looking to update a balance that could definitely cause big issues with the data. Basically, he advocates if you are wanting a multimaster or some sort of a bi directional replication solution, use a piece of software that was designed for that. Now of course, Enterprise DB does offer their BDR product, which is their bi directional replication product, but there are other solutions that are also viable. But I've also seen posts where they're talking about BDR and some of them seem to be actively discouraging some use of it. So basically it has its role, but it's not perfect for everything because you really need to change your application to support this type of database replication. But definitely great post to read and consider if you're thinking about setting up some sort of multimaster or bi directional replication. The next piece of content select the most recent record of many items with PostgreSQL. This is from Timescale.com. They're talking about a scenario where you have a lot of devices, a lot of trucks that statistics are being reported on on a regular basis. So maybe you have say, a set of trucks that the mileage is being reported fuel, longitude, latitude, and you want to get the truck and its most recent value. And they have an example of that query here and they said by default, the timescale solution indexes by the timestamp. Well that's not going to work too well when you're looking for a truck ID. So the best solution to do that is to do a compound index where you're actually indexing on the truck ID and the timestamp. So that can allow you to return that query pretty quickly. Now I should say that this is talking about timescale, but a lot of this also applies to postgres as well, particularly if you're using say, partition tables. But another scenario that can cause real slowdowns is where you're looking for say, just the last value from a truck and there's a few techniques that you can use to do that, but some are better than others. So they talked about using a naive group by and in all cases this is probably not going to give you the best performance. They talked about using a lateral join to try and find this type of data as a viable option. They have a skip scan capability. So using that as one, if you aren't using skip scan, they say you can do a loose index scan. So that's another technique. And lastly is just logging the last value of each truck in a table. So basically you take the hit at insert time as opposed to query time. And they actually have a table down here of some of these different techniques and some of the benefits and disadvantages of each one. So if you want to learn more. [00:07:46] Speaker B: About that, definitely check out this blog post. [00:07:50] Speaker A: The next piece of content working with JSON in postgres 14. This is from Aaronboss Dev and he's talking about the new subscripting that's available in postgres. So basically before we had to use these operators, I'll call it the arrow or the double arrow operator to pull out individual values of JSON or JSON b. Well, now you can use subscripting. So he has some example data he inserted into a table and shows you how easy it is to use by just using the syntax to be able to pull out the first name or the last name or even do a where. And the key thing you have to remember is that when you're using this, you do need to double quote it because what gets returned automatically is JSON b, whereas when you're using the Deborah arrowhead operator, it actually returns it as text. So you just need to remember to do that if you're going to be using the new subscripting capabilities. And he says you can even access items and arrays by just using a numerator here and then updating data is a lot easier than using some of the JSON BSET or JSON BSET lax functions to do it. So here you can change the value of a column by just setting it to the particular value you want. You can also update an array value by its index using syntax such as this. You can even add a key when it doesn't exist already by doing something like this. So you're inserting a new key into the JSON b for all values in this case. Or you can even append to a JSON array by simply adding a new index to it. So this is a great post. It's very brief and it tells you exactly how to use some of these subscripting techniques. So if you want to learn more. [00:09:31] Speaker B: Check out this blog post the Next. [00:09:33] Speaker A: Piece of Content how to gain insight into the PGSTAT replication slots view by examining logical replication this is from Fastware.com. They're talking about the new view that's available in postgres 14. And basically it lets you track disk spilling where the data that's having to be sent for logical replication exceeds the logical decoding workmem, or when it's streaming. So it's in the process of streaming an in process transaction, for example. And it basically helps you configure this configuration parameter. But this post goes into a lot of detail of logical replication, how it works, and how this view gives you insight into exactly what's going on. So if you want to learn more. [00:10:15] Speaker B: Definitely check out this blog post, the. [00:10:18] Speaker A: Next Piece of Content automatic Partition Creation in PostgreSQL this is from CyberTech Postgresql.com and they're discussing a way to automatically create partitions in postgres. Now, the way that I do it is I just set up a cron job with my language of choice that could be Ruby or any other language. You could even do it as a bash script, or you could even create a function in postgres to create these tables ahead of time. So they talk about using some of these different things using an operating system schedule. You could even find a scheduler that works within postgres. There's an extension Pgpartman that will do it for you as well. But they actually tried to do something in real time using Listen and notify in conjunction with triggers. Now, they said that they got something working, but it really was a hit to performance and they definitely wouldn't advocate doing it this way. But it was definitely an experiment to see if it would actually work. But if you want to learn more. [00:11:14] Speaker B: About that, definitely check out this blog post. [00:11:17] Speaker A: The Next Piece of Content secure permissions for PG Backrest this is from Crunchydata.com, and basically what they wanted is to only give permissions to PG backrest, which is a backup solution to just be able to view or read the postgres data files but not have any modification permissions to the database at all for security reasons. And this post describes how to get that set up and get that working. Now you may not use PG backrest, but a lot of the techniques basically getting a user that just has read only access to the postgres data files that needs to run the backup could be applicable to any other tool or way you intend to backup the system. So if you want to learn more. [00:11:59] Speaker B: Definitely check out this blog post. [00:12:02] Speaker A: The Next Piece of Content PostgreSQL Timestamps and Time Zones what you need to know and what you don't this is from the [email protected]. This is a super long post, so apparently there's a lot you need to know about timestamps and PostgreSQL. So this is definitely a tour de force set of work covering this topic. So if you're interested in learning more about times, timestamps, time zones, dates, etc. [00:12:27] Speaker B: In Postgres, you can check out this blog post. [00:12:30] Speaker A: There's also the companion blog post addressing primarily intervals, and this is PostgreSQL Timestamps and Time Zones how to navigate the interval minefield. So basically intervals are distance between times, so they could be so many hours or so many minutes, so many days, et cetera. So this is from an interval perspective and it's equally long in duration, but if you're interested in that type of. [00:12:53] Speaker B: Content, you can definitely check out this blog post. [00:12:56] Speaker A: The Next Piece of Content creating a Postgres foreign Data Wrapper this is from Dolthub.com and they're talking about a way you can create your own foreign data wrapper. And as a reminder, a foreign data wrapper allows your database to talk to other data sources. So you can use Postgres SQL to query from your database and it would actually contact this external data source and query it on your behalf. So you can send it to other databases, you could send it to Files, as it says here, and you could even send it to other databases like MySQL or Oracle or Microsoft SQL Server. So they set up their own way of creating their own foreign data wrapper. So if you want to learn about how to do that, you can definitely. [00:13:38] Speaker B: Check out this blog post. [00:13:41] Speaker A: Next Piece of Content how to use regular expression group quantifiers in PostgreSQL this is from Endpointdev.com, and they had an address in an HL Seven V Two format in a Postgres data column that they needed to parse or at least extract the text from. And they had a grep pattern that worked, but when they tried to put it into Postgres, it really wasn't giving them what they wanted, so they actually had to use a particular technique to be able to properly extract the address. So if you want to learn more about how they did that, you can. [00:14:14] Speaker B: Check out this blog post. [00:14:16] Speaker A: The next piece of content, the Postgres Go Person of the Week is olivaro Hirara. So if you want to learn more about Olivaro and his contributions to postgres. [00:14:25] Speaker B: Definitely check out this blog post and. [00:14:28] Speaker A: The last piece of content. We did have another episode of the Rubber Duck dev show this past Wednesday. This episode was on exploring the Jamstack. So this is a development technique where you predominantly use precached pages in JavaScript to enhance the application or even just the website. So if you're interested in long form, developer based content, maybe you'd like to check out our show. That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scaling Postgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 293

Better Plans With Statistics | Scaling Postgres 293

Episode 191

Man-In-The-Middle, pg_auto_failover, Lesser Known Features, LZ4 Compression | Scaling Postgres 191

Episode 342

Faster Writes With ZFS? | Scaling Postgres 342

JSONB Performance, Bidirectional Replication, Most Recent Record, PG14 JSON | Scaling Postgres 201