Multi-Terabyte Scaling, Kubernetes DBaaS, Encryption, High Availability

[00:00:00] In this episode of Scaling Postgres, we talk about multiterabyte scaling, Kubernetes database as a service, encryption and high availability. I'm creston. Jameson. And this is scaling postgres episode 42. [00:00:18] You alright? Our first article is Postgres QL affiliate projects for horizontal multi terabyte scaling. This is from the CyberTech Postgresql.com blog and this is a follow on to a post that was released a couple of weeks ago about terabyte scaling. And this talks about what do you want to do when essentially scaling across or kind of do some horizontal scaling to the multi terabyte level. And he says, quote, be warned, the robe's getting bumpy. Now we usually need to change the application and also the surrounding bits, which means basically doing some Sharding. And he mentions three different extensions or derivatives of postgres that you can use to do a multi terabyte scale out of PostgreSQL. So first he mentions Sharding via PL proxy stored procedures, which he describes as some glue to get the stored procedure called to reach your Sharded database, and that this has been battle tested at Skype. So this is one solution, but it looks like it does require using stored procedures. The next option is PostgreSQL, which he says, quote, could perhaps be described as a PostgreSQL based Sharding framework and it lives somewhat under the umbrella of a PostgreSQL development group, but it lags a version or two behind a current PostgreSQL. But it says it lets you manage or run queries on tens of terabytes of data with relatively little restrictions. Although he does mention there are some caveats and the biggest cluster he's heard about is 130 terabytes of data on it. And the next solution he mentions is Sharding with Citus. So Cytus is an extension of PostgreSQL and mentioned it many times on Scaling Postgres. So this is an extension you could add to PostgreSQL to shard your database across multiple database servers and they also offer a service as well as just using the extension, so that's a possibility. And then lastly is a green plum, which is an actual fork of PostgreSQL that's dedicated for data warehousing. So you probably wouldn't want to use this for online transaction processing applications, but generally for massively parallel processing. So those are some of the options available today if you are looking to do a scale out of PostgreSQL. But keep in mind that over the next few versions of PostgreSQL we're moving toward using the features of parallelism and partitioning and foreign data wrappers to build out the ability to horizontally scale your PostgreSQL database system. So with each new feature added to the core, we're getting close to the point where PostgreSQL will do something like this out of the box. But if you have a large data size you need to manage today and want to do a scale out, here are some great options. [00:03:14] The next post is actually a YouTube channel, PostgreSQL and Kubernetes database as a service without a vendor lock in. And this is posted on the Dalaibu YouTube channel. So basically he covers PostgreSQL in general, but then he talks about how you could run it on Kubernetes and all the different component parts that you would need to put into place to build your own database as a service. Basically bring up multiple database systems. Maybe you're going for a microservices architecture or potentially you're wanting to do your own scale out. Well, Kubernetes could coordinate all the different database systems that you need to set up and basically you can do this using open source software without having to rely upon a vendor such as AWS or Google Cloud or whomever. So this is an interesting YouTube channel that talks about how you could actually put this together. And again, with the future of scaling being scaled out across multiple database instances, this is definitely a presentation that could prove useful in the future. [00:04:19] The next post is databases versus Encryption. And this is from the second Quadrant.com blog. Now I don't, wouldn't necessarily say this is Databases versus encryption, but how do you encrypt data in a database essentially, and they talk about with a lot of the different requirements for data protection. Now one of the big ones is full disk encryption. Now there are a lot of vendors that provide it and there's tools that you can use to be able to do that. And that's kind of a baseline you would want to do. But for really protecting your data from people who are actually getting access to the database or through your application, it's probably not going to be sufficient. And he mentions extension PG crypto and it addresses some of the issues of encrypting data in your database. But a disadvantage here. He says quote, but it means the database has to know the keys that will decrypt the data and those are likely part of SQL queries. And so the issue with leaking data into server logs and monitoring systems is still there. So he contends that it's not a great solution. And what a lot of people do is do application encryption. So within the application itself, for data that you want to secure, you encrypt it, store it in the database, and then it's the application that decrypts it. So basically the data is stored in an encrypted or scrambled fashion in the database. Now he mentioned that just leaves the database acting as double quote, dumb storage. So that means it can't really compare values. And one thing that this blog post proposes is potentially doing encryption as a part of a separate service, potentially maybe on its own instance. And one of the main things is if you want to compare data, so maybe you want to search for compared data so you rely upon this service to do the encryption and decryption for you so you can compare values. Now he talked about an experimental extension CC number that is potentially looking at doing some of this. So that's definitely something to take a look at. But I've also heard of other tools that enable you to do blind indexing. So even though you have an encrypted field, you could do what's called a blind index to index some of that data, but it doesn't really reveal what the contents are. But it definitely discusses some of the issues with encryptions in databases. And if you are needing to improve your security through encrypting data in your database, this is definitely a presentation to check out. [00:06:50] The next post is waiting for PostgreSQL twelve. Add log Statement sample rate Parameter so basically this is a feature that has been added to twelve. We don't know if it's going to make it there, but hopefully it should. And it addresses the issue where you have log min duration statement and that allows you to set the logs to log any statement that is longer than a certain duration. So basically you are able to search through the logs and find slow statements, but the issue comes what if you have very short running statements but so many of them? Well, one option is to use PGSTAT statements that does a sampling of queries. So you could deduce what those many, many queries that are being run but they're relatively fast, but overall they're making potentially the system slow. That's one option. The other option is this enhancement where it actually takes a sample so that you can put the log min duration a little bit lower or potentially log all statements but just do a sample rate of them. And in their example here that they did a test on, they were able to reduce the size of their log file over tenfold because they set their sample rate at essentially 10% from set it from one to zero one. So this could potentially be a great feature if you wanted to reduce your log size but still get some visibility of the queries that are going on in your log. [00:08:13] The next post is evaluating high availability solutions for timescale DB and PostgreSQL. So basically, even though they're talking about timescale DB, which is an extension for allowing PostgreSQL to act as a timescale database, they're talking about high availability. So how well can a cluster of PostgreSQL instances be able to fail over? So they're talking about you have a physical replication, so you have one primary database and two replicas. And they did an evaluation using the criteria that they mentioned here to find what automated solution they could find that will allow them to automatically fail over from one of the primaries to a Replicas. And basically the one that they chose was Petrone. And they talk about how you basically set it up. You need some sort of consensus holder to flag which one's primary, which one's a replica, and they're using etcd here. And they talk about how the process actually works in terms of doing the failover when you have the primary fail, and then how it elects the next node to be the new primary. So if you're potentially looking for a high availability solution, you might want to check out this post and see if you agree that that perhaps Petrone would be the solution you'd like to use. [00:09:29] The next post is how to upgrade PostgreSQL ten to PostgreSQL eleven with zero downtime. And this is from the Several nines.com blog and basically this is using essentially logical replication to do the upgrade from PostgreSQL ten to eleven. Now we've seen some other posts in previous issues that have discussed doing this process. So this is another post that essentially describes the process of setting up logical replication and then how you might switch over to it when you want to do an upgrade. Basically you have an existing database server that is acting as a publisher for logical replication and then you have a version eleven, one that is serving as the consumer or the subscriber, and then once all data is synced, you switch the application over to use the new primary. So with the release of PostgreSQL eleven, if you're looking to potentially use logical replication to upgrade from ten to eleven, this is definitely a blog post to check out. [00:10:28] The next post is at 22 years old, postgres might just be the most advanced database yet. And this is from the Arcentry.com blog. And basically this is a post talking about some of the feature sets that are important to this organization and how they can use it and its extensions and all its different features to provide exactly the features that they need, from pub sub messaging to triggers to foreign data wrappers JSON b different add ons. So if you want a little bit of insight as to how someone is using the features of PostgreSQL it's the blog post to check out. Last post is Why the RDBMS is the future of distributed databases featuring Postgres and Citus. And this is from the citrusdata.com blog. And again, this is a discussion of all the different features in PostgreSQL and what makes it unique and different compared to other database systems. And they're making the argument here. One of the most important things in terms of choosing what database system you're going to be using are what are the development costs? So they're arguing that using a relational database system such as PostgreSQL and then also having the scale out features of Cytus is probably the best choice today. So if this type of content is interesting to you, definitely a blog post to check out. [00:11:51] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the Show Notes. Be sure to head over to Scalingposgres.com where you can sign up to receive weekly notifications of each episode. Or you could subscribe via via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 77

Timeseries Queries, Arrays, Ranges, C Triggers | Scaling Postgres 77

Episode 370

New Scale Out Options | Scaling Postgres 370

Episode 107

Showing Plans, Atomic, Configuration, Migrations | Scaling Postgres 107

Multi-Terabyte Scaling, Kubernetes DBaaS, Encryption, High Availability | Scaling Postgres 42