Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about recursive queries, indexing, analytical workloads using GPUs, massive streams of data, and multiple forms of replication. I'm creston. Jameson. And this is scaling postgres episode two.
[00:00:25] It all right. The first article is indexing with Postgres when less is more. This is on medium part of PlanGrid technology. And so this post goes into a query that was running slower than anticipated. They use Python and they use SQL alchemy as their orm. And they had a query when they ran Explain Analyze. It was running in 1.8 seconds, and this was slower than they felt it should be, given the amount of data that they were pulling. What's interesting about this article, it goes into depth how postgres works to retrieve your result set in terms of doing index scans to retrieve the data, potentially retreck conditions, as well as doing bitnet heap scans to actually pull the data from disk. And he goes into how some things were happening that he didn't quite expect. Basically, he had to go over his query again and really thinking about how he was pulling the data and what he needed to do to make it more efficient, and even said he did some experimentation as well. But doing that experimentation with the query and taking a look at the Explain plan and the performance, he was able to make the query much more efficient. Like before, he was using two columns to do a join to the table that he was interested in, but he just realized that he was able to filter it down to one and make the query much more efficient. So he goes from, I believe it was 1.8 seconds down to 23 milliseconds. So definitely a big win. So some of the points he covers here is that when you're having a performance problem, you definitely want to be testing out assumptions on a production like database, because postgres takes into account table size, cardinality of the data resources available. All of those come into play when it's determining the best route to take to process a query as efficiently as possible. And even though he says at the end, less is more isn't an axiom you can live by, but measure twice, cut once is. So basically when you have queries that are slow, you definitely need to test them against production data to assure that they're as optimized as they can possibly be. The next arter we're going to look at is from thenxtplatform.com, and it's called bringing GPUs to bear on bog standard relational databases. This isn't explicitly about postgres or doesn't explain how it works, but it brings something to the forefront that I think more people need to think about and be aware of. And they're talking about the separation between operations and operations that need to be in terms of transactional databases and then analytical databases. Historically you have online transaction processing OLTP and online analytical processing. And typically there were two different workloads. So you would have your transaction processing database is very efficient in processing transactions as fast as possible. Inserts, updates, selects, deletes, whereas your analytical database is tailored for workloads where you're analyzing a lot of data, maybe doing a lot of joins, and it's okay for queries to run longer, but you usually had less users processing those queries. So that separation is still in effect today. They make a quote here, every company, even Google, so this is a quote, every company, even Google and Facebook, has this split between transaction processing and warehousing. And another quote, the split is not going away anytime soon. So it's basically this recognition that this has to happen. What I've seen in some clients is that they use the one database for everything, and at times they get into problems doing a lot of analytical, even customer demanded things to process analytics. And sometimes it's best to separate that off into a separate database, or even process summaries or aggregate tables in a database to be able to handle those queries very fast, as opposed to trying to process and read across millions of rows at a time. So it's just something to keep in mind. The other interesting thing about this article is it goes into people looking at using GPU processing capacity to do massively parallel operations like you would do in an analytical type database. And interestingly, many of the vendors that they're mentioning here have Postgres SQL as the core engine that does these massively parallel queries. So I thought this was a very good article to check out. Again, it's not necessarily educational in terms of teaching you something, but it gives a broad view of the state of using massively parallel analytical databases as well as their future potential with GPUs. Okay, the next article we're going to look at is from Medium, again from Alibaba Tech, how to create a stream system with tens of millions of feeds. So this is a blog post from Alibaba, which I believe is essentially China's Google, but they go into how they're using Postgres, and there's a diagram here where they're streaming all of the data that's coming in into Postgres SQL instances, then passes it off into a storage service. And then they're using this database system called Hybrid DB for Postgres SQL to actually do the analytical processing. So here again, this is an example of the separation from the transaction processing that's happening and the analytical processing that's happening. Now, if you're curious what Hybrid DB is, essentially that is their own product that they developed from Postgres SQL, and I believe it's mentioned a little bit in this blog post, but they basically took the Green Plum database. I'm going to the Green Plum database site now, and this is an open source massively parallel data platform. So they do this kind of analytical processing and use Postgres SQL core as the engine. Now, it's not the up to date postgres SQL. I believe they're targeting moving to version nine as the core relatively soon, but it is meant for very massively parallel processing of data. So basically they took that Green Plum database and modified it for their own use case for developing this hybrid DB. So it goes into their data ingestion process here and under the heading advantages of the scheme that they've come up with, the number one thing they say is performance. So Postgres SQL's powerful concurrent write performance combines with hybridb's analytic performance. So a single postgres SQL can support millions of writes and batches loaded into their hybrid DB analytic database. Now their hybrid DB analytic database enables real time millisecond level queries through MPP's superior analytical performance. The other thing they mention is data handling and cleaning that they've set up for their system, as well as some other benefits that they go into. So I thought this was a great article to look at to get again the perspective of how people are using Postgres, what it is best for in terms of transactional processing. So I definitely encourage you to check out this article and if you're interested in analytical processing and you're interested in open source, maybe check out the Greenplum database site which is greenplum.org. Okay, for the next article, this is a bit of a fun one. This is not necessarily related to scaling your database, but in terms of something of interest to developers in Postgres. I thought this was a fun article to take a look at because actually I am a weekend dungeon master. So this blog post is exporting a hierarchy in JSON with recursive queries, and this is from the site tapoueh.org. So there'd definitely be the link for this link in the show notes that you can take a look at. But what was interesting is that they're using Dungeons and Dragons characters and using with Recursive SQL in order to produce a JSON document out of or using SQL. So not necessarily related to scaling Postgres, but definitely a fun and interesting article to take a look at. The next article we're going to take a look at is actually a YouTube video. So if you are interested in multimaster replication, there was a recent YouTube video that was posted that I felt was very good. I watched through the whole thing. The title is geographically distributed. Multimaster replication with Postgres, SQL and BDR. And BDR is an acronym for bi Directional Replication. So natively Postgres does not have a facility for doing multimaster replication, but there are other solutions that are available from vendors that can do that for you. And this talk goes into kind of two different camps that these solutions fall into. One is a tightly coupled relationship between the masters and one is a loosely coupled relationship between the masters. So what they say in terms of tightly coupled is that with tightly coupled, most things are local. So for example, your storage is sharded or excuse me, shared. Typically in terms of consistency, in isolations, it tends to preserve the acid model. So a lot of applications can work with it fairly well. Disadvantage is geographical distance. It doesn't really work that way to have your masters geographically separated because a lot of them work. It works synchronously, particularly if it's shared storage. Transactions are happening in a synchronous fashion in terms of data conflicts and collisions. They can be prevented more easily and the application compatibility is pretty transparent. The other camp is loosely coupled. So this is where your storage is pretty much independent. So you have entirely separate databases with their own storage system and replication happens between them. So the storage is not shared and it depends on eventual consistency. So in location A, some data gets updated and then B it eventually ends up in location B. So geographically speaking, it's very good at handling that. It's not so great for consistency, of course. So you tend to have database conflicts and usually these solutions rely on optimistic resolution to those data conflicts. And the committing of transactions to the database are typically asynchronous because there's such a distance between these solutions. Now this talk focused more on BDR, which is a loosely coupled solution and they like to think it a little bit different using an acronym pace.
[00:12:22] So basically when things are connected, meaning the whole system is working, you have multiple masters and they're replicating data between one another. You can either choose to have latency between them being updated or consistent. But when you have a breakdown in your replication so when it's partitioned you can either choose from availability or consistency. If you choose consistency, it's probably only going to mean one of those masters is operating. Whereas if you choose available, probably multiple masters if the link is broken, are still working. But now you're potentially in an inconsistent state. Another thing that was really great about this presentation is if you're going for a loosely coupled solution, he emphasized the need to do application changes and he goes over how to handle conflicts.
[00:13:11] How are you going to handle keys?
[00:13:15] Are you going to use something like UUIDs? Are you going to have a central key distribution for the keys on your tables? How are you going to handle schema changes? And definitely, really emphasized the importance of testing. So if you're considering multimaster replication and you're using postgres, I definitely suggest you check out this video. Okay, moving on to something related to replication I have the next article is from the Citus Data blog and it's three approaches to postgres SQL replication and backup. So it talks about three approaches, although reading it I kind of felt more like it was approaches to replication and recovery. So the first solution is you have streaming replication between two postgres databases and if you want to bring up a new node, essentially you have to start replicating data from the primary using streaming replication and basically it has to stream that entire database over the wire to be able to set up replication as a new Replica. The second level is where you're doing volume level replication and you replicate at the storage layer. Now, this is not something that Postgres SQL supplies natively, but this is a data storage solution to replication. The third approach is basically taking backups from your primary, including the wall files. Then you restore that backup, apply wall files until the Replica is close to the primary, and then you start streaming. So the advantage of this approach is that you don't have to have the load on the master while rebuilding this new Replica. So they don't explicitly show how to do this or don't have code examples. But if you're thinking about different techniques for bringing up Replicas for your postgres installation, I definitely suggest checking out this article to see their discussion about it. The next article we're going to cover is Replication Topologies in Postgres SQL and this is from the Ops blog. And basically this is a more basic post that kind of just goes over Postgres SQL replication in general and it presents how do you handle do master slave, how do you do hot standbys. It talks about eventual consistency because generally when you're setting up a hot standby, it's asynchronous it does not write to both synchronous in a synchronous way. But it also talks about some settings you can do to actually make them synchronize standbys. And it also talks a little bit about the logical replication that postgres provides, as well as cascading replication between for your Replicas. So if you're just getting started in replication, this would be a good post to look at. Similarly, another post that was updated is Postgres SQL Replication Slots Again by Ops Blog. And here they talk about using replication slots. Historically you haven't needed to use replication slots to do replication, but this talks about setting them up and how to get them running for your replication setup. Now, the last piece of content we're going to take a look at is another YouTube video. This is postgres SQL Replication in 2018. So again going on this theme of replication, it goes through and talks about all the different replication options and things to look at. So if you're looking at replication and thinking about changing the scheme you're using, this is definitely a video to watch. Now, I will warn you that the audio of this presentation is not great, so just be aware of that as you're reviewing it. But definitely a piece of content to take to look at if you're looking into replication. Now that does it for episode two of Scaling Postgres. Be sure to check out the links in the show notes to each article or piece of content that I mentioned. Be sure to head on over to Scalingpostgres.com so that you can get weekly notifications of each episode or you can feel free to subscribe via YouTube or itunes. Ends. Thanks.