Impact of Virtualization, Compressing Data, Parallelism

[00:00:00] In this episode of Scaling Postgres, we talk about the impact of virtualization compressing data, parallelism and cross cluster monitoring I'm Creston Jameson and this is Scaling Postgres episode 18. [00:00:21] All right, our first article is the State of Developer Ecosystem Survey in 2018 by JetBrains and we're looking at the database version here. And I just like to take a look at these when they come up. And the first question is which databases do you regularly use, if any? And PostgreSQL came in a respectable second to MySQL. But what's interesting is that do you plan to adopt migrate to other databases in the next twelve months? And PostgreSQL comes in third, but it's definitely the highest rated relational database that people are moving to. The next highest one. Well, maybe SQLite or Oracle, but definitely some interesting stats to be aware of. [00:01:13] The next post is the impact of virtualization on your database. And this is from the Celiadb maybe, I'm not quite sure how that's pronounced, but this post takes a look and compares a bare metal instance to a virtualized instance. And this is due to a new instance type that is now available in AWS called I Three Metal. Now, the I Three series is a storage focused instance type. Normally for databases I see people using the memory instance types, but this is a storage instance type and it's the only one available in a metal series right now. So that's what they were comparing. And this isn't necessarily PostgreSQL specific, but it's just in general taking a look at some of the differences here because a lot of people run PostgreSQL, maybe as an RDS instance or they're using virtual machines to run them. And it's just what kind of an impact can this have looking at a metal compared to a virtualized? And these are pretty much identical configurations. They actually cost the same amount, about $5 per hour. But just going to the metal gets you some more cores and a little bit more memory. So there's already a little bit of a performance gain due to that. And they did some different analysis here and basically the metal one. But I'll let you take a look at this post. But looking at the conclusion here, they have this quote. We showed in this article that despite having 12% more CPUs, I Three metal can sustain a 31% higher write throughput and up to eight times lower read latency than the virtualized I three, confirming our expectation that removing the hypervisor from the picture can also improve the efficiency of resources. So in the interests of scaling out your PostgreSQL database instance, if you have opportunities to use bare metal, that can give you some performance boost to your application. So just something to keep in mind. [00:03:24] The next post is scaling past the single machine. Now upon seeing that title, and this is from the Mode Se blog. Now upon seeing this title I was thinking Sharding, but they're actually doing something a little bit different. They make the assessment that they're just about to hit the stage of big data and a lot of the data they are collecting is of a time series nature. So they're getting massive amounts of data in and they wanted to find a good way to help manage the growth of that data where still granting access to it. Now, how they're doing this is an interesting way using native partitioning to partition out the data and then store some of those older partitions on a second database and accessing those through a postgres foreign data wrapper. So basically they're archiving data to a secondary database server and granting access to it through a foreign data wrapper. So, pretty interesting technique to use. Now, the reason they were doing this is because they were literally using a file system that enables them to compress the data significantly. And they tried two different ways. They looked at a CSTORE foreign data wrapper that was pretty good with their insert transactions per second was pretty efficient on the disk use, but being able to select from it the performance was very poor. However, this technique of using Btrfs with Zstd and this is basically a file system that has compression applied, they were able to get savings on the disk being used, high insert transaction per second performance as well as their select performance was approaching having no compression. And they go over the steps that they use to essentially put this into place. So essentially their solution overall they have a quote here, it said the end result is a reduction of disk usage to 12% of the original. So pretty big savings while keeping query performance at around 90% of the original. So this is an interesting technique I haven't seen before, but it seems a huge win if you are concerned about how much data you're storing. So, definitely a blog post to check out and an interesting technique to take a look at. [00:05:48] The next blog post is part of the fun with SQL series that Citizens Data has done and it's functions in postgres. So if you're a developer, this is kind of a catch. All of some different functions you can use with PostgreSQL. [00:06:03] The first one they talk about is Arrays and how you can store arrays in a single column within PostgreSQL. Now, even though there is an array data type, you can also use functions that aggregate into an array typical column. So like for example, this takes aggregates an array of email addresses and the result will look like this. But you can use an additional function that they show here doing array to string which actually breaks it out with a space and a comma. So there's a couple of functions that you could potentially use if you want to without doing it on the application side. [00:06:38] Another one they talk about is some time functions. One where you can look at the differences between time. So for example, now gives you the current timestamp and you can look at differences from that period of time. Using intervals like this goes back seven days and you can also use Truncate by day or by week. So these can give you other functions that help you manipulate time based data and then they go over JSON and the different JSON functions that are available for working with JSON and JSON b data types and particularly talking about the different operators that you can use with it. So, a pretty short post, but if some of these are new to you, definitely check out the blog post. And again, this is from the Citusdata.com blog. [00:07:28] The next post is using Force parallel mode correctly. So this is a pretty good post and gives you some good insight into some of the parallelism features of PostgreSQL. But a good summary of this post is that if you're not doing debugging of code that you're writing, you should probably never use force parallel mode because it has a particular purpose when doing development in terms of testing out parallel queries. And he does mention a use case when you're creating a function how it could be beneficial, but most of the time you don't want to force parallel mode for your entire database cluster. And then if you're looking to get more parallelized queries, there's two other parameters you should actually be adjusting. One is set parallel setup cost and the other is set parallel tuple cost. And there's a couple of other ones that you can adjust too, talking about min parallel table scan size, min parallel index scan size. So a good blog post that talks a little bit about how you can change some of these other configuration parameters in order to get more of your queries using parallelism, if that is of interest to you or you feel it could give you some additional performance. So, definitely a blog post to check out. And this is from the Robert Haas blog. [00:08:49] The next post is PostgreSQL Clusters monitoring Performance. And actually this is a blog post that talks about if you have multiple postgres clusters running like multiple servers, maybe you have 510 servers, how could you potentially query and look into what's going on in each of those servers. Now this had a pretty interesting idea where basically you use foreign data wrappers. Now they're focusing on looking at the PG Stat statements view essentially to look up this information. So they talk about how you would potentially set up being able to query this view to get access to the information across multiple systems. So they set up a user, they created the foreign data wrapper extension. They set up the foreign servers, in this case, two servers that they want to contact and get statistics for. They map users to the appropriate user on the systems they want to monitor. And they created a schema for which particular table or view they wanted to access, and they're limiting it to the PG Stat statements, and they're creating a single view on essentially, their monitoring performance workstation that aggregates the different PG stat statements together and then even proposes a query that could allow you to take a look at the queries, how they're performing across your multiple PostgreSQL instances. So definitely really interesting blog posts. So again, looking at the future of PostgreSQL when potentially Sharding will be available, and maybe people are installing PostgreSQL on Kubernetes, this may be a technique to use to be able to monitor the performance of those without having to connect to each PostgreSQL instance individually. [00:10:38] The next blog post is Elephant in the Room database Backup. So of course everybody knows that they need to backup their database, but then the question is, what are you doing to ensure that that backup is going to work if it needs to? So this talks about all the various issues that different organizations have had and broadcasted about. They seem to focus a fair amount of it on some of the issues that GitLab had. [00:11:06] And here's a quote here. So, in other words, out of five backup replication techniques deployed, none are working, reliably, or set up in the first place, which is a really bad situation to be in. So he talks a little bit more about some of the issues that were encountered and basically they didn't have a good insight into whether their backup was working or not. So a part of it is just a rehash of kind of the GitLab incident, but it's also reminding you it's not just good enough to have some notifications set up. Literally, you want to have something contacting you on a regular basis that is resilient to something failing. Like maybe you want to set up some sort of job that ensures okay, the size of the database backup is consistently where you expect it to be, and it's not suddenly zero, or it's not suddenly empty as well. As on a periodic basis, be that weekly or monthly or some basis actually restoring that backup to ensure and looking at the data to ensure yes, all the data is there and it is up to date where you expect it to be. So doing those things is essential to make sure that you can restore from your backup in case catastrophe essentially happens. But definitely a blog post to check out if you're interested in that sort of thing. And this is from the hackernoon.com blog. [00:12:33] The next post is Gracefully scaling to Ten K Post, which 10,000 PostgreSQL connections for $35 a month. Part one. Now I was interested saying, why is he saying $35 a month? I believe what he's referring to, this is basically a post on PG Bouncer and how by default, PostgreSQL allows 100 connections. And you can start bumping that up to 200 connections, 300, 400. But once you get up to 400, 500, the performance or the memory utilized for those connections starts being really onerous and the solution to it is to use a connection pooler. Now. You could use PG pool or PG bouncer. And I've had more experience with Pgbouncer and it seems that's mostly been the go to one that I'm aware of. So this basically goes over PG Bouncer and how you could basically have that accept 10,000 connections but then combine that down into maybe 100 or so postgres connections or maybe ten. It depends on how fast your queries are. And for this purpose you're generally doing transaction pooling. So on the application side, maybe you have 10,000 threads in use and you're getting queries randomly sent to each of those. PG Bouncer can then take those and funnel them into say ten to 100 postgres connections so that you don't have to have as many connections on the PostgreSQL instance. But PG Bouncer handles the creation and tear down of those connections. And for $35 a month, I'm assuming that is it's not mentioned in the post, but I'm assuming they're projecting that as the cost of one PG Bouncer server. [00:14:17] Although in the post they talk about a scenario where they are using 40 PG Bouncer instances. So that definitely wouldn't be $35 a month. But they're proposing if you're using these fast efficient queries from an application, generally they can't be too long because it won't work that well. But if you have fast efficient queries that each PG Bouncer can be figured to allow 10,000 incoming connections than to only make ten connections upstream. But again, a lot of that depends on how fast your queries are. But this is an interesting blog post that goes into a lot of detail and thoughts about how best to configure Pgbouncer and how it can help you with scaling your database. So definitely a blog post you can check out. And this is from the Futuretechindustries blog on Medium. [00:15:07] The last two posts are basically from the Several nines.com blog, both related to performance. The first is a performance cheat sheet for PostgreSQL and in it it's essentially that it goes over what are some of the things you can look at if you are looking to improve your performance? One is looking through queries using Explain and then when you're wanting to get more than just the projected cost and what the planner wants to do, you can use Explain Analyze to see, okay, what will a query actually do? And actually run the query. Then it also goes into some vacuum areas and some considerations with regard to it. It goes over different configuration parameters. You can set talks about some of the different logging you would probably want to enable as well as talking about hardware considerations. So if you are a little bit new to PostgreSQL performance, definitely a blog post to check out. [00:16:04] The second post again from the Several nines.com blog is tuning Input Output operations for PostgreSQL. And this is all about IO as it relates to performance. And again, this is kind of a more beginning to intermediate level blog post, but definitely something good to check over just to compare. All right, what are you doing with your PostgreSQL instance and are you following some of the guidance in here and some general advice? So they talk about indexes partitioning. You can do the checkpointing process, vacuum and vacuum analyze, the fill factor effects, different vacuuming tips, as well as I O problems related to disk sorting and even the database file system layout. In other words, how many disks are you using, how are they partitions? And then considerations for running PostgreSQL on the cloud. So, definitely a blog post to check out, just to look at their suggestions to how you are currently running your PostgreSQL system. And maybe there are some tips you can find to help optimize yours. [00:17:08] That does it. For this episode of Scaling Postgres, you can get links to all the content presented in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 221

psql Shortcuts, Debugging Deadlocks, Find & Stop Queries, Using SCRAM | Scaling Postgres 221

Episode 313

The Future of MySQL is Postgres? | Scaling Postgres 313

Episode 281

HNSW Indexes, Vacuuming Bloat, Watch Me Now, Connections | Scaling Postgres 281

Impact of Virtualization, Compressing Data, Parallelism | Scaling Postgres 18