JSONB vs. Window Functions, Security Checklist, Full-Text Search, NFL Time-Series | Scaling Postgres 176

Episode 176 August 01, 2021 00:18:51
JSONB vs. Window Functions, Security Checklist, Full-Text Search, NFL Time-Series | Scaling Postgres 176
Scaling Postgres
JSONB vs. Window Functions, Security Checklist, Full-Text Search, NFL Time-Series | Scaling Postgres 176

Aug 01 2021 | 00:18:51

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss JSONB vs. window functions, a Postgres security checklist, implementing full-text search and using NFL timeseries data.

To get the show notes as well as get notified of new episodes, visit: 

 https://www.scalingpostgres.com/episodes/176-jsonb-vs-window-functions-security-checklist-full-text-search-nfl-timeseries/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about JSON B versus window functions, security checklist, full text search, and NFL time series. I'm Kristen Jamison, and this is scaling postgres episode 176. [00:00:24] All. All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is modern data analysis. With PostgreSQL. JSONB throws window functions out the dot, dot, dot. This is from Swarm 64. Com. And they're talking about a scenario where you have say, orders you're collecting and you want to collect the status about the delivery of each order. So you have an orders table and then you have an order, say, shipment status table. And when you insert information into the shipment status, you include the order ID, which is the key on the order table. You have the timestamp of it as well as whatever the status is so loaded into van in this example. So you would have many statuses within this table per order. So it's a one to many relationship. Now then, the question is how do you find out if a package was lost and its previous status was loaded into van? Show me those records. And this is a query that the author used to retrieve that information. And it's pretty complex and it can't use indexes all that well. And in the scenario we mentioned, there's a lot of data in it. So this query is just going to be pretty slow. But there's another way to do it using JSON b, and that is create an array in that JSON B of the statuses. And that array, each object would contain the timestamp as well as the status. So when you have a new status, you would merely append that status to the existing one. And this is what you would do. You would just update that individual order, appending the status to that order, as opposed to creating inserts on the order shipment status table. Now, he did say in order to optimize this, you probably want to set a fill factor less than 100 for this table so that it won't have to create new pages all the time. But in doing this scenario, your query to answer that question of show me the records that have become lost, where the last status was loaded into the van, becomes this much simpler. So they're using this technique here, essentially the minus one to grab the last item in that JSON B array and then getting the status of it. And then similarly, you could do minus two to get the second to last element. So basically look for the second to last element loaded in the van and the last element was lost. And then what's great about this is you can add another and query where you're looking for the instances where the shipment status contains using the contains operator, both loaded into the van and lost. And then you can apply a gen index on that shipment status column to get a very fast index search. So essentially your query looks like this it's using an index very very quickly. So this does run counter to the typical relational advice of having, say, a statuses table where you're listing each status in there. But for quick queries, this is a great technique in my opinion. Now you have gone from doing inserts, which are typically non locking, to something that's an update, but hopefully each order status wouldn't be updated that quickly. But you do incur more of a locking. But overall the benefits might outweigh the disadvantages. So if you're interested in learning more about this, definitely check out this blog post. [00:03:42] The next piece of Content PostgreSQL Security Things to avoid in real life this is from Cybertechyphenposgresql.com. Now, overall when I think about security, I think of a defense in depth scenario. So basically try to prevent putting your postgres server on the Internet. Ideally, don't have a publicly accessible Internet address for it. Also implement network firewalls and all sorts of additional security. But this checklist focuses what you can do in PostgreSQL specifically. But a defense in depth scenario is the best way to do security. So their first piece of advice is avoid relaxed listen address settings. So basically set your listen addresses or keep it at a local host if you don't need to connect from outside the postgres instance. But if you do, narrow it down to whatever network interface you're using. Second recommendation is don't use trust in PGHPA. Conf. So basically never use Trust because that allows anyone to get in without a password. But apart from that, they recommend using Scramshaw 256 for your passwords as opposed to MD Five and of course enforce SSL encryption as well. They also mention avoid entries in your PGHPA comp from anywhere because ideally you want to narrow it down to only the hosts that need to contact the database and also narrow down the number of databases and users that are allowed access and what they can access. Third recommendation, which was mentioned up above, is use Scram shot 256 encryption for passwords, not MD Five. The fourth is handle public permissions on Schemas and databases. So basically the recommendation is to revoke everything on particular databases from public as well as Schemas from public that prevents a user from changing that database and potentially creating rogue objects and filling up your database. Next recommendation is avoid using Alter user set password because this can potentially be logged if you have your logging set up such that it would capture this type of statement. And they recommended using other Postgres clients to do it. So for example, psql has a backslash password command you can use to change the password, and PG Admin has a Change password dialog, so they recommended doing that as opposed to sending a statement to the database. I would expect at some point that Postgres may offer a filtering for this type of command, but I'm not aware of anything like that in the works. The next recommendation is make use of alter default privileges so when you're setting privileges and objects, also set the default privileges so you can ensure that new objects will accept those changed permissions. [00:06:29] Number seven use SSL of course, so your connections are secure. Eight write security, definer functions safely. And this is talking about when you create functions. Normally they run as the logged in user who's using them, but you can define the security definer to be someone with additional privileges, so you want to be very cautious of that. Next recommendation is avoid SQL injection in database functions. So of course you need to do this for your application to avoid SQL injection attacks, but you also need to do it in your functions. So, for example, if you do something like this where you just accept a table name in from something the user sets, they could easily fill up your database by executing a function where they're replacing a Table name with a Generate series function. Or they could do that as well as do a union to find out additional data from maybe tables they don't have access to. So definitely important to avoid that. And you can use the format functions to construct SQL query strings if you need to. Next recommendation is restrict super user access as much as possible. Again, that's a standard for all computer security. The next is update your PostgreSQL database on a regular basis, keep things patched and updated. Next recommendation is encrypting your entire server. Now they're talking about an open source free tool that they developed is PostgreSQL TDE. But you could also do encryption at rest at the operating system level to make sure your disks are encrypted. Because it looks like that's what TDE does is it just encrypts between what's in memory and what gets placed on the disks. So it's basically encryption at rest. So if you're interested in learning about the details of this post, I definitely encourage you to check it out. [00:08:18] The next piece of content postgres full text Search a search engine in a database. This is from Crunchydata.com and this is a great post to just get started working with full text search in what I would say is the recommended way to do it to give you the most speed. So you can do statements where you're just doing like in some text string, but that's inefficient indexes can't really be used with this type of search. So full text search allows you to get around some of that. Now of course their recommendation is when you want to search in a particular field is ideally you should create a TS vector column in that table using the new feature in Postgres. Twelve generated always so it's kept up to date. And what you store in that TS vector. You do a function to Tsvector, give it whatever dictionary you want to use. It could be English, it could be another language, or simple maybe in the case of names, just a simple dictionary as well as the column you want to store there. If you're using a version prior to twelve, you can just use triggers to do something similar and then you create a gen index on that new TS vector column. And now using a query such as this, using the match operator in two TS query, making sure that you match the same dictionary and give it a term you want to search for, you get very very fast search because it is able to use this Gin index to quickly search through and find the relevant information. Now in addition to just searching a single word like this, you can do things such as using the ampersand to be able to say this term and this term and this term to be able to do a search. You can also use the follow on operator so that it's only looking for a phrase of rain of debris in this example. But there's also a phrase two TS query you can use to be able to do that easily just by using this different function as opposed to the two TS query. And there's also a web two TS query as well to simulate what a web search looks like. And lastly, the post talks about weighting and ranking different results. So when you build that column you can set a ranking for each column within it. So you can append multiple columns into this one TS vector field of the table and assign ranks. This one is A, the event narrative and this one, the episode Narrative is a rank of B so certain results will be ranked higher than others based upon this weighting that you assign. And then to get that to appear in your results you do an order by and use the TS rank function with your column and then what your search query was and you get ranked results. So this is a great post that introduces you to full text search and how to very quickly get set up to doing really fast text based searches of your information. [00:11:09] The next piece of content hacking NFL data with PostgreSQL timescale, DB and SQL. This is from Timescale.com and they're talking about how the NFL has had some public data they put out that tracks a ton of information for some games because apparently NFL players have been equipped with RFID chips that is apparently reporting data at 100 millisecond intervals for each play, describing speed, distance, the players, et cetera. And essentially a big bucket of time series data. And this post goes through actually getting access to this data and how to use it and explore it as well as some queries to analyze it. So I thought this post was pretty interesting. Now you don't need Timescale DB to do this analysis, they talk about where they have some features that could be advantageous for giving you more performance in this but you can just use normal PostgreSQL to do it and they talk about all the different queries that they did to analyze this result set. So if you're interested in that you can check out this blog post. [00:12:16] The next piece of content PostgreSQL logical replicas and snapshots proceed carefully this is from Ardentperf.com. They're talking about a post that was put up, talking about getting a logical replica set up with Postgres by first doing a snapshot in RDS to get the initial data load and then turning on logical replication to start streaming the data. And a lot of recommendations you see in this post is if you're going to go this route, be very careful. Now, I've heard of different use cases for snapshots to try and copy data quickly with a running postgres instance and I've always been very hesitant about it and proceed carefully is basically what I would advise. And if you're doing any kind of snapshotting, basically not using the tools that Postgres provides to do backups, I would spend a lot of time validating to make sure that you can restore that data reliably. And you're not going to have any issues because you don't want to run into a problem where you can't bring your database back up. And this is similar in that you're copying a database from one to another because you're logically replicating it and you don't want issues to crop up or transactions to be missed like they're talking about at the snapshot point. If there's transactions in flight, what's going to be their status? Are they in the wall or not? So there's a lot of questions about it but if you're interested in learning more about this discussion, definitely check out this blog post. [00:13:43] The next piece of content increasing Max parallel workers per gather in postgres this is from Pgmuster.com and they're talking about over the years with each postgres version they've added more and more parallel functions to postgres. So it allows you to utilize all the cores of a machine when you're doing one query. Now, if you have a lot of connections to your database in a transaction processing application, you're going to be using all of those cores because the processes are going to be using different cores. But this addresses the case where you want one really long query to be able to use multiple cores of the machine to answer questions. And by default they say that the max parallel workers per gather that's essentially per query is limited to two. So if you have more cores in your machine you may want to up that to be able to use more cores to answer questions faster, particularly if you're doing any kind of online analytical processing. So they show the parameter as it exists and then how you can actually increase it per session if you want to. And there's a few other settings you're probably going to want to adjust is the work memory because each worker can utilize it separately. The total max parallel workers for the system as well as the max worker processes. They also mentioned some costing that you can adjust in the postgres configuration to determine whether you're going to get a parallel query or not. So if you want to learn more about that, definitely check out this post. [00:15:11] The next piece of content. PG. Backrest. Async behavior. This is from Fluco 1978 GitHub IO. And he's talking about PG backrest from an async perspective. Now in terms of Async, they're talking about the management of the wall. So when you're running the database in archive log mode, essentially you are going to need to save all those walls somewhere and PG Backrest allows you to save those wall in an Async manner. So if you're trying to save them to say S Three or a storage system that maybe has some latency to it to transfer files, you're going to want to use an Async method if you're going to transfer them in that fashion or just do bulk loads on a periodic basis to S Three. So this Async becomes very important in that use case. The other case it becomes important is when you're doing a restore, because you're going to need to read back those wall files. And if they're stored on s three, doing a round trip for every wall file needed will take a long time. And the database may not even be able to catch up to a primary if that's what it's trying to do. Therefore, the Async mode becomes important there because you can actually pull down a batch of files in advance of when they're needed to be applied to the database during restore operation. So if you want to learn how PG Backrest can help you with that, this post is a great example of walking you through showing two different instances of using these Async functions in their behavior. [00:16:41] The next Piece of Content deploying Pgpool Two exporter with Docker this is about using the Pgpool Two exporter which provides metrics to Prometheus. So if you're interested in setting this up for Pgpool Two, you can definitely check out this blog post from Bping blogspot.com. [00:17:01] The next piece of content. Improve PostgreSQL query performance insights with PG Stat monitor. This is from Percona.com, and they're talking about a new extension that they developed called PGSTAT Monitor. And it basically takes some of the data that's output by existing extensions such as PG Stat statements, which gives you a record of the statements that have been applied against the database for all time. The PG Stat activity which shows you in real time what statements are running against the database, as well as auto explain. If there is a statement that takes too long, it will automatically do an explain on that statement and place it in the log for you. It basically combines these functions into one extension. So if you're interested in checking out this new extension, you can definitely check out this blog post. [00:17:53] The next piece of content is the PostgreSQL Person of the week is Roman Druziegen. So if you're interested in learning more about Roman and his contributions to postgres, definitely check out this blog post. [00:18:06] And the last piece of content is we did another episode of the Rubber Duck Dev show this week. This episode was on handling exceptions in your programs. So if you're interested in that, you can check out that episode. This week's episode will be on writing secure code and will go live on Wednesday, 08:00 P.m. Eastern Standard Time. [00:18:27] That does it for this episode of Scaling Postgres. You can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks, Sam.

Other Episodes

Episode 114

May 18, 2020 00:20:28
Episode Cover

Data Directory, Vacuum Features, Binary Performance, Network Latency | Scaling Postgres 114

In this episode of Scaling Postgres, we discuss the Postgres data directory, new vacuum features, binary storage performance and the impact of network latency....

Listen

Episode 42

December 09, 2018 00:12:14
Episode Cover

Multi-Terabyte Scaling, Kubernetes DBaaS, Encryption, High Availability | Scaling Postgres 42

In this episode of Scaling Postgres, we review articles covering multi-terabyte scaling, building a kubernetes DBaaS, encryption and building high availability. To get the...

Listen

Episode 89

November 11, 2019 00:15:42
Episode Cover

Performance Impressions, Mystery Solving, GiST Performance, K-Anonymity | Scaling Postgres 89

In this episode of Scaling Postgres, we discuss Postgres 12 performance impressions, mystery solving, GiST and SP-GiST performance and k-anonymity. To get the show...

Listen