Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we discuss serializable JSON and JSON B, fast data loading and new features. I'm creston. Jameson. And this is scaling postgres episode eleven.
[00:00:21] All right, our first article is serializable in postgres resql eleven and beyond. So this was a really great post and it talks about serializable in terms of postgres eleven, but it also talks about a number of other issues that can really expand your knowledge, particularly if you're a developer. So they're talking about basically some new features that are planned to go in in version eleven, where they're doing predicate lock support for hash indexes, gen indexes, and gist indexes. Basically, when you're using serializable transaction isolation, it'll be much more efficient. But a lot of this post goes into transaction isolation in general and includes some really great links, including a research paper that talks about the importance of writing your code essentially in a concurrent fashion to avoid attacks. But it goes in depth talking about the implementation of Serializable, how it interacts with indexes, and even talks about the future in terms of trying to enable this type of feature with regard to parallelism what's available in terms of replication, and even talking about skip locked. So I would say this post definitely gave me the most educational material for this week. So definitely one I suggest checking out. And I should mention, this is from the Writensku Blogsbot.com blog post.
[00:01:57] The next article is PostgreSQL data types JSON, and this is from the Tapoeh.org blog. And this continues along with his series on PostgreSQL data types. And he's talking about essentially JSON and JSON B and goes into depth about how the JSON data type is basically just text. So it stores the data presentation exactly as it is sent to PostgreSQL, including whitespace, annotation, et cetera. Whereas JSON B is an advanced binary storage format, so he goes into use cases for each one. Basically, you would probably always want to use JSON B because you can make the access to data very fast using gen indexes.
[00:02:48] And the only real case for using the JSON and not JSON B is if you're wanting to preserve the entire payload of some API you're consuming. But generally for storing information, the typical best practice is to use JSON B, so definitely a blog post to check out if you want to learn more about that.
[00:03:09] The next post is an in depth video tutorial I did on the Scalingposgres.com blog and it's fast PostgreSQL data loading using Ruby. So I am a Ruby developer and I had a need to develop a couple of terabytes of test data and I went over a Ruby script I developed to load this fake test data using a library called a Gem, using the faker gem, and I went over different types of implementations.
[00:03:39] So doing single row inserts, doing multi row inserts per insert statement and looked at the performance and seeing how I went about maximizing how fast I could load data into PostgreSQL. And then I also compare it to generating just a CSV file using Ruby and then using the PostgreSQL Copy command to import it, because generally copy is the fastest way to import data into PostgreSQL. But it's an interesting comparison looking at the different timings that I do in this tutorial. So I welcome you to check it out.
[00:04:17] The next article is actually a YouTube video and it was performed at Linux Fest Northwest 2018 and it's called Ten and 1010 New Things in Postgres Ten. And this was done by Basil Bork, I believe, and he basically goes over his Ten favorite features for PostgreSQL. He is a developer, so it is from a developer's perspective. So basically his first one is replacement for Serial. So basically Postgres Ten developed the Identity feature to be compliant with the SQL standard. So it's a different way of essentially generating auto incrementing primary keys for your table. The next topic was traceable commit. Basically, if you request a commit but the database server crashes or has some issue, did it truly commit? It's basically you can track on what the status of that commit you sent was. He talks about parallel queries, he talks about cross column statistics where you have relationships between data and you can define statistics for those. So for example, there are only certain cities in a given state or only certain zip codes in a city state, so you can define those relationships for statistical purposes. To get the best query plan for queries, he talked about the ability to do new collations in terms of sorting and he also covered very briefly, logical replication, simplifying timestamps and some other features. So if you're a developer, it's definitely an interesting YouTube video that you can view.
[00:06:00] The next piece of content is actually a YouTube channel that I'm linking to, and this is the Green Plum database YouTube channel. So if you're unfamiliar with Green Plum, I have mentioned it on the show before. Basically it's a massively parallel database that's based on PostgreSQL and they recently had a conference, or were a part of a conference, and in the last week they've posted a number of presentations talking about Green Plum. And some of the topics are what it means to be a massively parallel PostgreSQL. DBA green plumb in action on AWS Azure and Google Cloud. Greenplum contain coordinating many PostgreSQL instances on Kubernetes. So basically Greenplum allows you to use multiple PostgreSQL and query them in parallel. So as you're looking to scale your database, if you believe you need to start going into Sharding or running your database on multiple PostgreSQL servers, definitely some presentations to check out.
[00:07:09] The next blog post is near zero downtime. Automated upgrades of PostgreSQL clusters in the cloud. And this is part two, and this is from the second quadrant PostgreSQL blog. Now I mentioned the part one in a previous episode of Scaling Postgres and this goes into more depth. Now they're talking about a tool that they have called PGL Upgrade which stands I believe stands for postgres PG Logical which is their logical replication. This is different from version ten's logical replication because this is a tool that they had developed PG logical prior to it being implemented in version ten as logical replication. So this works in versions less than ten as well. And in part one of this article they discussed how they used logical replication to do an upgrade from a version nine PostgreSQL instance to a version ten.
[00:08:11] Now they don't include the tool but they're basically in this post describing the structure of it and how they're using ansible to do it in an automated fashion. So they include examples of what the inventory file would look like, the configuration file and basically the steps that you would need to do this. And this is a great image I'm showing right now where it goes through the exact sequence of steps.
[00:08:36] So first you set up a new primary database, then you set up standbys that are replicating to it. Then you set up a publisher on the old primary database, set up a subscriber on the new primary database. This technique also uses Pgbouncer as a connection pooler. So the next step is to wait for replication lag to be below a certain level, pause the connections, reconfigure to point to the new primary, wait for full replication to catch up and then resume the connections.
[00:09:11] At that point you are now operating on the new primary database. At that point you can then remove the publisher from the old primary, stop that instance and then stop all the old standbys and then set up any new standbys that you would need. So once again, they're not providing the code for the tool but they're giving the structure so that you can replicate it. Or I'm sure you could contact them if you want the implementation done on your behalf, but definitely a very interesting blog post to check out if this is something you're considering doing in the future.
[00:09:47] The next post is with regard to or the next two posts are regard to features coming in PostgreSQL eleven. So the first is post is Partition Elimination in PostgreSQL eleven. And again this is from the Second Quadrant blog and they talk about what they hope to get into version eleven is Faster partition Pruning and partition Pruning at execution time. So it's basically when you're using postgres ten s declarative partitioning Prune down to the exact table that you need to access the data that's being queried or inserted or updated into. And they talk about historically you could only do it at planner time, determining what is the best or what partitions you need to access. However, this can cause problems with prepared statements, but with this particular patch they are able to improve the performance. So they have a benchmark and some graphs here that show the dramatic improvement from in the green here, going from this type of performance with non prepared statements to a prepared statements this type of performance in version eleven whereas version ten there was essentially no improvement. So definitely there are some use cases where this could be very beneficial. So if you're interested, I encourage you to check out this post.
[00:11:10] The next post is Tech preview PostgreSQL eleven. Create Procedure so for years in PostgreSQL you've been able to create functions and triggers, but we've never had the ability to create a procedure whereby you can, within that procedure, use, commit and rollback functions. So that's what this feature enables for PostgreSQL eleven. So this is very short post. If you've been wanting to do this with your functions or have this type of functionality, it looks like it's slated to get into postgres eleven.
[00:11:49] That does it. For this episode of Scaling Postgres, you can get links to all the content presented in the show notes. Be sure to head over to Scalingposgres.com where you can receive weekly notifications of each episode or you could subscribe via YouTube or itunes. Thanks.