Performance Training, JSON Path, Hypothetical Indexes, Interpolation

[00:00:00] In this episode of Scaling Postgres we talk about performance training, JSON path, hypothetical indexes and interpolation. I'm Kristen Jameson, and this is scaling postgres episode 69. [00:00:21] Alright, I hope you're having a great week. Our first piece of content is actually that new versions of PostgreSQL have been released. Now primarily this is due to a security issue that affects versions 1011 and twelve because of something to do with the scram authentication and error in it. So that only impacts 1011 and twelve in terms of the security issue. But there are numerous bug fixes and improvements for the version nine series. So definitely, especially if you're on version ten and higher, go ahead and upgrade your version of PostgreSQL as quickly as you can. Now, going a little bit in depth into this security release is this next piece of content explaining CVE 2019 164 PostgreSQL security best practices. So this talks about this essentially out of cycle update release for the 2019 101 six four vulnerability. They go over essentially what it is and essentially it's an arbitrary code execution vulnerability. But you need to be an authenticated user and you could come up with a password that can basically crash the PostgreSQL server or in the worst case, have the operating system user postgres that runs arbitrary code. So if you have your system locked down, this mitigates some of this, but you'll definitely want to upgrade your version of PostgreSQL as soon as you can. And it also talks about fixing a potential denial of service attack vector as well. Now again, they reiterate here the good news, that the attack can only be accomplished by a user who was already authenticated to your PostgreSQL server. So it's not someone who doesn't have access to it. They talk about how do you protect it. Basically you do the upgrades to the next version and they also did some updates to libpq on the client side where it can cause issues for clients as well, potentially causing those clients to crash. And one of the recommendations here is that you can do a client verification mode, make it verify full to avoid connecting to rogue PostgreSQL servers that could potentially do something to your client. So if you want to get more information about this particular vulnerability, here is one blog post. And this is from the Crunchydata.com blog. There's also another one pretty similar in nature stating close to the same thing called CVE 2019 164 who's affected and how to protect your systems. And this is from [email protected]. So you could check over this post as well. That describes some additional details. [00:03:06] The next piece of content is actually two YouTube videos together, part one and part two. They are. PostgreSQL in depth training performance, part one. And the same thing. PostgreSQL in depth training performance, part two. And this was put on by Jeremy Schneider, database engineer at AWS. So the first video is about 44 minutes and the second one's about an hour now he had a t shirt on that said Oracle Nerd. So it looks like he has a lot of history with Oracle and there were some issues when he was doing his talk or he needed to look up some information with regard to PostgreSQL. So it doesn't appear he's been someone who has spent years and years in PostgreSQL, but maybe has come from an Oracle background, but it does give you some different insight into performance than maybe you've seen with some other presentations. So he covers issues with regard to monitoring and how important that is, talking about explicit locking because that's clearly an area you can get yourself in trouble. He talked about prepared statements and how those could be utilized. And then he spent a fair amount of time talking about wait events, which is something that was added to believe started to be added to like PG Stat activity in 96 and then further in ten and eleven additional ways to get insight into what queries are waiting for. And he advocated using tools to be able to identify or track what's going on with queries over time, even at the second or five second interval, to be able to track different wait states on what individual queries are waiting on. Now this does have an emphasis on RDS, their relational database service for Amazon, but it does have relevance for PostgreSQL in general. And he does talk about Performance Insights, which is a tool on RDS. But some of his thinking in terms of performance and how he goes through identifying, okay, there's a production issue, how do you narrow down what the exact issue is from using these weight event insights and then drilling down using Explain Analyze once you've found the query and then diagnosing what the problem is. So if you want a little bit of a different perspective, these are two videos definitely to check out on optimizing your PostgreSQL performance. [00:05:27] The next post is postgres twelve. Highlight SQL JSON Path this is from the Michael Pacier Pacquiao XYZ blog and he talks about they've started to add compliance for the SQL 2016 standards for SQL JSON feature set for JSON processing inside of Relational database. So this is kind of their first cut at supporting the SQL 2016 standards for doing it. And he has an example here of where he created a table with a JSON B column. Now he does make reference to an additional data type that's been added called JSON Path, although with his table here he's using JSON B. So I don't know if this is eventually going to be a data type that you would assign a column or you just use JSON B fields. I'm not too clear on that yet. However, in his example he uses a JSON B field and he inserts like in role playing game Stat set here where character has a name, an ID, HP value, a level, his class as a warrior his equipment, he has rings, what is equipped on his right arm and his left arm. So it's just basically a JSON structure. And then you use these different references to pull out data. So you say essentially the dollar sign for my interpretation is essentially the column name level, which brings you the level value here and returns ten equipment. Arm left goes, chooses the equipment and then it chooses Arm left and it returns the shield of faith and you can return full array of rings. And so it's different ways to pull out JSON in a JSON B field. And here you use the query like this. You use this function JSON b underscore path, underscore query to say pull out the name. So this is some very interesting new features that are being added in version twelve. So if you're interested in that, definitely blog post to check out. [00:07:27] The next post is hypothetical indexes in PostgreSQL and this is from the Procona.com blog. So hypothetical indexes are essentially indexes that aren't there, but you can imagine they're there and see what the planner does. So essentially there's a PostgreSQL extension for hypothetical indexes called hypopg. So first they give you some instructions on how to install hypopg and then you create this extension and you have access to these different functions and it shows some of the different functions like you do hypopg create index and this essentially creates an index just for your session. It's not an index that actually gets created, it's just a reference that it keeps track of to say within this session I'm in in the memory. Essentially this index exists and when you're doing query planning, like using Explain, it will actually take into account the index you've created and assess through statistics on how it would change the query plan for you. So they have this section down here testing hypothetical indexes. So it created a table, inserted some values, they didn't explain, selecting particular values from the table, and it did a sequential scan. But when they added the hypothetical index here, it uses this hypothetical index when he does an Explain plan. [00:08:51] Now this is with the B tree index. They did try a bloom index and apparently worked, brin kind of worked, but had some unusual issues at times. But Jen just another index that they couldn't get working. But since most people use Btree indexes, this could be pretty beneficial. So, definitely interesting blog post to check out. If you want to test this, particularly if you have a large database, this could let you test a couple of different index variations to see what the planner interprets as being the most efficient path it would take to execute queries. [00:09:26] The next post is linear interpolation with PostgreSQL. So they're referencing a previous post where someone was wanting to ignore nulls using window functions to fill in gaps in data. So for example, maybe you have data on three days and four days, no data and then some data in the next set of two days. Essentially you have gaps and there are nulls there, how do you say make them zero instead? So that's kind of what this post talks about. And this one takes a step further. And here they have a climatological data set, so weather information and they don't necessarily want to make a null value zero because they're talking about why are there missing values? Well, as they say here, quote devices may break, networks might be down, disks may be full. So maybe the data is not zero or shouldn't be zero. Maybe it should be a value between what the known existing values are. And so essentially they're using a technique of linear interpolation to be able to pinpoint what the values are that are missing. And he goes over a whole set of functions that he developed in pgSQL to be able to do this linear interpolation. So if you're interested in this as a particular technique, definitely a blog post to check out. [00:10:42] The next post is Per project postgres. And this is from Jamiethesharps US and he's talking about developing different projects and he has to use a particular database for a project. He worked a way to use a postgres database almost as he says, that's almost as convenient as SQL Lite. So basically he does a local install into a current working directory called Postgres. So essentially as a postgres instance, per project he's working on, instead of having one version of a database and all the different databases residing inside there, he has one postgres essentially cluster per project within a local folder in that project. So I've never seen this, but it's definitely an interesting way to go, particularly if you have projects that are on different versions. Like if one is using 9.6 and one is using eleven, I could see the benefits of potentially doing this. And he also mentioned a few other tools that help him manage it. One is dirm, as well as a package manager called Nix. So if you have the need to do multi project databases, maybe this is a blog post you'll want to check out. [00:11:56] The next post is Physical Replication Mechanisms in PostgreSQL. This is from the PG IO blog and essentially they're talking about physical replication, which is otherwise known as streaming replication. It's streaming the wall files from a primary to a Replica. You can also do a log shipping where you actually take the log files, copy them onto the Replica and then replay them. That's another alternative. But they're talking about wall files, log shipping, recovery mode and basically tell you how to set up different standbys talking about replication slots. So it goes over how you can set up physical replication between a primary and one or more Replicas. Now, related to this, I did do a video tutorial on this called PostgreSQL Streaming Replication. So if you're interested in the video format, you can feel free to check out this piece of content as well. [00:12:49] The next post is install PostgreSQL 9.6 with transparent data encryption. So this is from Cybertechresql.com and they actually developed a method of doing transparent data encryption. So basically, as the data is written, it gets encrypted and it doesn't rely on something, the disks already being encrypted. That's one way you can do it, but this is a way where certain things in the database on the files get encrypted in terms of tables, indexes, temp, files, wall. Things that are not encrypted are things in the memory, the clog and textual log. And this is a patch that they've developed for 9.6 and it tells you how to install it and some of the caveats related to it. So if you're interested in doing transparent data encryption and you're using a version like 9.6, definitely a blog post to check out. [00:13:43] The last post is announcing Timescale Cloud, the first fully managed time series database service that runs on AWS, GCP and Azure. So, as we've seen with some other open source tools, like I'm thinking of Cytus that had their Citus Cloud that they bought on months ago, where essentially they manage it for you. Here is a Timescale DB hosted time series database solution and they talk about some of the benefits of doing it. But one thing I thought was particularly interesting is that they talking about Amazon Timestream. And from them looking at this from some of their calculations, they say AWS Timestream is ten to 100 times more expensive than Timescale Cloud. Now, of course, that seems like a very big discrepancy, but the interesting thing, they put these calculations in a public spreadsheet they listed below. But if you're interested in a time series database and you would like them to potentially host it for you, this is a potential solution for that purpose. So definitely a blog post to check out that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com where you can sign up to receive weekly note notifications of each episode. Or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 8

Scaling Real-Time Analytics, Covering Indexes, 1,500 Upgrades | Scaling Postgres 8

Episode 71

Indexing Documents, GIN Indexes, Data Science, Generated Columns | Scaling Postgres 71

Episode 336

More Postgres 17 Highlights | Scaling Postgres 336

Performance Training, JSON Path, Hypothetical Indexes, Interpolation | Scaling Postgres 69