Episode Transcript
[00:00:00] Hello. In this episode of Scaling Postgres, we talk about Auto Explain overhead, postgres data, lake, citus, shard, rebalance, and SSL authentication. I'm creston, Jameson. And this is scaling postgres episode 157 one.
[00:00:24] Alright, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is can Auto Explain with timing have low overhead? This is from Pgmuster.com, and they're talking about an extension that is called Auto Explain. And what it does, when it detects a slow query, it automatically generates an explain plan in the log so that you can check it out and see the explain plan for that slow query. But as he mentions here, it says a quote many places warn about its overhead, but I found concrete information difficult to come by. So he's heard anecdotal numbers in the range from around 3% to about a factor of two. So a pretty wide range. But they wanted to look at this more in depth. One thing to keep in mind when doing an Explain analyze is that the system clock can have an overhead depending upon the system that you're using. And he has a link to an article from Angres that we actually covered in a previous episode of Scaling Postgres that talks about this. And there's actually an application included with postgres PG test timing. So you can check out how slow your system clock is for the system you're using it on. Now, for the particular one he used here, it looks like he's doing it on a Mac with about four gigs of Ram, two cores. So relatively small system. But running the PG test timing, it looked like 90% of the calls were less than a microsecond. So that's pretty good. So the first thing he wanted to do is just take a baseline. So he used PG benched to get a baseline of different latencies, and he came up with 1.5 milliseconds in terms of the average latency. Now then he configured Auto Explain, adding it to the shared preload libraries, and set it to take a full sample. So sample everything for all statements. So basically every statement will get an Explain analyze. Now, when he ran PG bench again, he saw a 26% higher increase in latency, so that's not insignificant. So of course, the thing you really want to do is only log the slowest. So here he's taking a full sample, but he's only looking at queries that run longer than ten milliseconds. When he runs PG bench again, he only sees a 0.8% difference in terms of being slower. So that's very negligible. But it hasn't really started doing the analyze yet. So in this next step of doing the analyze without the timing of each step, he turned a log analyze onto True. And with running PG bench again, it's only 0.3% slower than the baseline. So again, the analyze is not taking up a lot of time as long as you're only doing it for these low queries. Now he added the timing in to be able to track the time of each step. So we did log timing, true ran PG bench again and this time it's 2.2% slower. So definitely great information to have in that. If you're using Auto, explain that you definitely want to focus on only auto explaining the slowest queries that even if you have timing on it's not going to be a significant hit to your database performance. Now, we actually decided to add a lot more parameters to the logging in terms of triggers and settings and buffers, et cetera, as well as the JSON format. He ran it again and it was actually a little bit faster than the 2.2% of the previous one that just turned on the timing. He can't really explain this, but this may be having to do with variation, depends on how many times he's run these tests and what the standard deviation is. But overall it looks like as long as you're not trying to track every statement, the hit to the database is around the 2% mark, at least for the load he was putting on the database. In this example, your experience may be different based upon your load, based upon your system, but definitely interesting information. And if you want to find out more, I encourage you to check out this blog post.
[00:04:19] The next piece of Content build your own data lake for reporting purposes in a multi services environment. This is from Fretlink.com. So they're talking about a data lake that they set up essentially on postgres. Now, they start off talking about some of the reporting issues that they were encountering in terms of the demands for different types of data and trying to satisfy those needs from different customers, from internal users, et cetera, as well as issues with data quality, with terms of producing spreadsheets and passing them around. So it's a good read to kind of get an overview of what the environment was like. But the way that they approached resolving these situations starts off essentially in this graph here. So basically they broke up their monolithic application into three separate services a career service, a pricing service, and an expedition service. Now the main tables of the application, so tables are in green. You can see some of the table names here, here and here. They're all a part of the public schema. But what they did is they created a separate schema called reporting and then they created views within those reporting schemas. So basically, the developers who are managing the service defined an interface essentially in views on what data can be pulled out for reporting purposes and give the definition of the columns that are defined within that view. So they did it for each one of their services. The next step is to use streaming replication to stream that data into a separate reporting server. So my assumption that this is one cluster that just has multiple databases that reflect what is coming from each of the application services. So essentially it's a one to one copy. Then with that data, in this separate instance, they created a separate reporting database and that reporting database had a schema for each service. So the reporting database has a carriers schema that uses a foreign data wrapper to access its carrier service database and it essentially accesses the reporting views, then the pricing schema in the reporting database accesses the pricing service database and accesses the reporting views there to get the information needed and same thing for other services. So this essentially allows them to query this one reporting database and do queries across different databases, essentially by using these schemas. Now, one advantage of this is that pretty much all of these queries will be against essentially real time data. Now, they didn't mention that they're using materialized views or any kind of summary tables or things like that. Now that would cause the data to be not real time, but you could then optimize access to that data with queries. So this structure may work at a certain data level, but once you're hitting billions, tens of billions of rows, you may need to implement something slightly different, something less real time to handle performance for reporting purposes. Now, they go ahead and include the different scripts that they use to set this up so you can check that out. And then what they also did is took it another step to integrate other external data into the system as well. So in this example, they're showing how they have information located in Google Sheets or some other external service. Basically they set up a way to synchronize it with a separate database within the reporting server and then they do the same process where the reporting database has a dedicated schema for that data and reads it through a foreign data wrapper and the same thing for any other external data that they access. And then they follow up the post determining how they've automated all of this to be able to run and process data for them, following up with using Metabase to give access to this data in a visual and graphical format. So, this is a very interesting post of a process they went through to essentially put data in the hands of their end users and allow them to query it. So, if you are interested in that, you can check out this post from Fretlink.com.
[00:08:19] The next piece of content is scaling out postgres with the Citus open source shard rebalancer. This is from Citusdata.com and this is talking about again the open source release of Cytus Ten and how it now supports being able to rebalance shards that you have across the scale out of Citus. So they give an example here of why this is important. So maybe you start your Citus cluster with two nodes and you have three shards here, three shards here. Well, if you add a node C, you need some way to move those shards to node C because essentially node C is not being used. Now that's where the rebalancer comes in. It can take shards from other nodes and repopulate them to C and essentially rebalance across the cluster no matter how many nodes you have. Now they also mentioned there's a couple of different rebalancing strategies. So number one is by shard count, so that's essentially what this is. Node A had three, node B had three. So you essentially move one from A and one from B to node C. So now they each have an equal number of shards. The other way to do it is by disk size. Now this can become important if you're say sharding by say, customer set of data and you have large customers and small customers, well, you could rebalance by the data size. So that you're essentially balancing the data size between your nodes like here. So you move a large customer three to node A and the small customer two to node B. But they also have other examples of rebalancing strategies that you can use or develop your own. And so they have some information about that here. They also talk about a way to actually shrink your cluster. So in addition to scaling out, you can also scale down if you don't need all of those nodes that you've scaled up to. And there's a citus drain node command to do that. So essentially it rebalances the shards, collapsing them into fewer nodes. Now one downside to this they did mention is that during a rebalance process the data is readable but it's not writable. So if you need to do that, you need to take into account that okay, if you're going to be rebalancing a shard, you can't actually write to that data at that time. Now they did say on the Azure version of Citizens, there is a way around that, but not for the open source version at this time. But if you want to learn more about this, definitely check out this post from cited.com the next piece of content setting up SSL authentication for PostgreSQL. This is from CyberTech postgresql.com. They're talking about just setting up SSL on your postgres server. So in the Postgresql.com file there are a number of settings that you can set. And basically the main one is setting SSL on as long as you have a cert file and a key file. And of course there's a number of other parameters that you can adjust for SSL. Now you can do a reload of the configuration at this point, but it won't actually work yet. You also need to do a restart to get it up and working. Now the next thing you need to change on the server side is the Pghba comp file because you need to use a type of host SSL for it to enforce those SSL connections. So you need to move from just a host type to a host SSL type. Now, the next part of the post actually talks about getting that certificate file and certificate key. You can of course purchase those, but this runs through the process of doing your own self signed certificate and learning how to set that up. Now, the next thing they mentioned is that in terms of clients connecting to a postgres server with SSL enabled is that there's numerous different ways you can connect to it and the client can define the following SSL modes all the way from disable to verify full. But of course the server also has to support SSL for these to work. So you can at the point of connection, say SSL mode disabled to not get any SSL. Or usually this I would say is one of the more common option is require. So basically it's going to require SSL connections when it connects. Or maybe you might want to do prefer if your server still supports non SSL. But if you want the full protection, having a certificate on the client and doing a verify full does give you, as they say here, the strongest protection possible because you are validating the actual certificate that the server is using and the client trusts the signer of that certificate. So if you want to learn more about setting up SSL on postgres, definitely check out this blog post.
[00:12:49] The next piece of content enhancing PostgreSQL 13 Security with the CIS Benchmark this is from Crunchydata.com and this is a new version of the CIS PostgreSQL benchmark that's been published. And CIS stands for the center for Internet Security, which is a nonprofit organization. So crunchydata has helped to develop a Security Technical Implementation guide for postgres and it covers things like installation and patches, directory file permissions, logging, monitoring, auditing, user access and authorization, connection and login, PostgreSQL settings, replication and special configuration considerations. Now, this is one standard. Another standard from a post, also this week is from Enterprisedb.com and it's how the EDB postgres Stigs can be used to secure your PostgreSQL database. These are a set of standards. Or an Stig is a security technical implementation guide that is based upon the Defense Information Systems Agency. So this is the DoD, the government organization set of standards. So they have developed a Security Technical Implementation guide for postgres. And this post talks about all of that and using it to secure your postgres systems. So either of these two posts are great ones to look through to help you follow a standardized process to secure your postgres installations.
[00:14:15] The next piece of content is monitoring PostgreSQL with Nagios and checkmake. This is from Haigo CA, and it's talking about setting up nagios for doing monitoring along with checkmake, although I don't know if that's how that's pronounced, but they go through the process of all the commands you would need. To run to set this up on Redhead Enterprise Linux installing Postgres and then setting up nagios with the command files to start monitoring Postgres as well as adding checkmake to be able to give a more, they say, user friendly view of monitoring your Postgres installation as well. And apparently this uses check postgres under the covers which basically does queries using psql against the database to check number of different parameters. So it gives you more than just node information but also relevant database statistics as well. So if you're interested in that, check out this blog post.
[00:15:10] The next piece of content is musings of a PostgreSQL data pontiff. Episode One this is from Crutchydata.com and this gives a little bit of the background where they're going to be talking about a data science blog series. So this mostly gives some background on the author. The second post was also made and it's musings of a PostgreSQL data pontiff. Episode Two hot Magic analytics and graphical output with PLR here. Basically they're using Postgres to do data analysis within it and using functions within postgres to do that data analysis and then using it in conjunction with the procedural language R to generate different plots such as this. So if you are interested in learning more data science with Postgres, definitely check out these two postgres from Crunchydata.com.
[00:16:02] The next piece of content. Traveling salesman problem with PostGIS and PG routing. This is from Cybertechnposgresql.com. They're talking about essentially the traveling salesman problem where a salesman has to hit a number of points in their sales route and what is the most efficient route to do that. This post discusses using postgres and the PG routing extension to be able to accomplish that and define the shortest route between multiple set of points. So if you're interested in that type of geo analysis, definitely check out this post next piece of content. Introducing Pgquery 2.0 the easiest way to parse Postgres Queries this is from Pganalyze.com and they've released their new tool PG Query in order to parse postgres queries. And it actually uses the postgres parser to do it because it's developed as a C library that works with postgres to be able to parse the queries and they actually use it in their tool PG Analyze. So they've open sourced the query parsing part. They have a number of libraries in other languages that can use it, such as Ruby and Go and others have written others that actually use the CD library to do this query parsing. So if you're interested in that, definitely check out this blog post.
[00:17:20] The next piece of Content cloud native PostgreSQL for application Developers this is from Enterprisedb.com and by Cloud native postgres, this is a name they've given to their Kubernetes operator at EDB. And this post actually talks about using this Kubernetes operator for application developers to be able to manage your postgres instances if you're doing testing, I assume with multiple different ones. So if you're interested in that you can check out this blog post next piece of content announcing the Crunchy Postgres Operator 4.6 with rolling updates, Pod Tolerations, Node, Affinity, and more. This is from Crunchydata.com, so if you want to learn more about their new version of their postgres operator, definitely check out this blog post.
[00:18:03] And the last piece of content is the PostgreSQL Person of the Week is Anastasia Lubunyakova. So if you're interested in learning more about Anastasia and your contributions to Postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube. RyTunes thanks.