Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about terabyte scaling postgres eleven, managing connections and terabyte backup. I'm creston. Jameson and this is scaling postgres episode 35.
[00:00:17] One.
[00:00:21] Alright, our first piece of content this week is a notification that PostgreSQL eleven has been released. So as expected, it was released on October the 18th. And this post talks about all the new features that we've gone over in previous episodes. Basically increased robustness and performance for partitioning. So they've really fleshed out the declarative partitioning that existed, intent to make it even easier to set up and use and with more features in PostgreSQL eleven, the introduction of stored procedures that allow you to do transactions, kind of like within a function. Improvements upon the types of queries that you can run in parallel. Doing just in time compilation for certain expressions as long as you set it on and they list general user experience improvements. Talking about being able to add a column with a default and not have it rewrite the whole table. That's a huge improvement. Adding covering indexes where you can essentially include additional payload in the index, some additional functionality related to window functions, and then being able to just quit and exit from the psql command line. So if you want to know a little bit more about PostgreSQL eleven, you can check out this announcement.
[00:01:33] Now. Related to that, what happens soon is that PostgreSQL version 9.3, the most recent release, is projected to be November Eigth, and I believe at that point it will be considered unsupported or a period of time after that. So if you're on 9.3, you should probably consider going ahead and upgrading. Now given that this is a new release. There was an interesting post on the Momgen US blog talking about a three year cycle and basically this is just a quick little post that he did, but he's talking about the types of problems that the PostgreSQL team is addressing, take a long time to do and that because there's a release every year. Sometimes these features take multiple years to complete. So that's why you tend to get essentially maybe a quarter of what you expect in terms of features or half of what you expect. So parallelism kind of started in 9.6, more was added at ten, more was added in eleven, presumably more will be added in twelve. The same thing with partitioning. Partitioning has kind of been there for many releases, but declarative partitioning came in ten, it's been made much better in eleven. I'm sure they're going to continue on that path in terms of twelve. But basically when looking at the PostgreSQL project, he basically says consider it essentially in a three year cycle to get certain features until they perceive that they're in a complete state. So just an interesting insight into the development cycle of PostgreSQL.
[00:03:05] The next post is adding new table columns with default values in PostgreSQL eleven. And this is from the second quadrant.com blog. Now, we've mentioned this in multiple episodes and described it, but this short post was actually made by the author of the feature and he goes into a little bit of the detail about how it works. So a great resource. But what's interesting here is he says the default value doesn't have to be a static expression, it can be any non volatile expression, but volatile expressions such as random will still result in table rewrites. So you still have to be a little bit careful if you're trying to deal with functions in terms of the default. So you might want to check out this post if you tend to do that to see under what conditions you're going to get a full table rewrite when you're adding a column with a default and when you're not.
[00:03:56] The next post is Ideas for Scaling PostgreSQL to Multi terabyte and beyond. And this is from the CyberTech Postgresql.com blog and basically he goes over things you should keep in mind when you're wanting to scale your database up to a terabyte scale. The first area he discusses is lightweight special purpose indexes and he particularly focuses on partial indexes that allow you to index only a subset of the data. Now, I myself have been able to get huge wins using partial indexes and I would say this is probably my favorite performance based feature because if you have a status column and there's only a finite number of statuses, you can create partial indexes on those statuses to get much higher query performance in a lot of use cases. Now, he also mentions the gen index, which gen is typically used for columns that you want to do a text search on or maybe JSON B columns. But one thing he mentions is that quote, it can be perfect for indexing columns where there are a lot of repeating values. So for example, if you have a status that has Mrs miss, so a very finite number of statuses indexing on this, it could actually create a very small index because Jen says, quote, gen only stores every unique column value only once, as opposed to what a Btree would do. So this could be a particular use case if you wanted to save storage and potentially get better query performance. He talks about Brin indexes and again, this can be a good use case in terms, again, space savings, but on very well ordered values like time series data. So that could potentially be a use case for it. And he also mentions Bloom indexes, but they're relatively exotic and actually I haven't had experience using those as of yet. But basically the indexes are his number one way to start because they don't require application changes and you can get performance boosts simply by using the right index for the right use case. The other area he goes into table partitioning. So this is basically you have a really large table and maybe you don't tend to access the historical data in it. That's a perfect use case for partitioning like perhaps by date time. So you can set up multiple partitions by year, by month, by day. That will enable you to keep in the cache perhaps only the recent hot data. And the data that is accessed infrequently won't be in the cache, but enables you to get high performance with more recent things in the cache. And again, depending on how you set it up, it doesn't require any application change. So that's another consideration when you're wanting to achieve terabyte scale. The next area he mentions is table spaces. So basically moving data to maybe your cold data that is accessed infrequently to maybe slower more inexpensive storage and keeping your recent or your hot data on SSD drives, that's a way to help scale your database or even spread them across multiple table spaces to get more performance. And he talks about using multiprocess features maximally. So it's basically utilizing the new parallel performance improvements that PostgreSQL is offered in 9.610 and version eleven. And then he talks about query load balancing with replicas. So basically creating multiple replicas and sending your read only queries to those replicas. And he does note that talks about a mirrored mode and what he means is that you can set up your synchronization in a synchronous fashion and only have commits on the master once the replicas have committed as well. So basically you get one unified view across your master and however many replicas you have, you won't have a delay between some data has been updated on the master but not on the replicas yet. Now you're going to have a performance hit doing that. So you'll have a little bit of a delay in writes, but at least you'll have a consistent view across all the replicas in the cluster if you wanted to set it up that way. And as he says, this works well only if the read queries are purely OLTP online transaction processing basically very fast. So he basically says that these are the areas you first want to tackle. However, you can get into presumably more exotic approaches. And one he's talking about foreign tables. So basically this is using foreign data wrappers where you can communicate with other database systems such as PostgreSQL or even MySQL or Oracle or set up compressed data in files on a different file system that keeps it compressed and then use a foreign data wrapper to be able to access that data. So it's basically kind of a way to scale out where you're just pulling data from this foreign data source using a foreign data wrapper and then getting into essentially Sharding. Now, PostgreSQL doesn't have Sharding built into it, but you can start to use foreign data wrappers to be able to kind of achieve that process. And he talks about bringing in table partitioning and having child tables residing on remote nodes. So these are just some ideas in this post and I highly suggest checking out this one. If you're considering how to potentially scale your terabyte database, the next post is how to manage connections efficiently in postgres or any database. This is from Brandure.org and basically he's talking about you set up a database for your application, things are running fine, and then suddenly you get a fatal error. Remaining connection slots are reserved for non replication super user connections. So basically you've run out of connections in postgres. So what do you do? So generally people just start increasing the number of connections and essentially there's background processes that get created for each of these connections so they have a fair amount of cost. And what he notes here is that there's still a limiting factor where there are certain processes that use shared memory that kind of become bottlenecks. So apart from the memory, you're potentially going to run into these bottlenecks. And he has a graph here that talks about the performance of a simple task degrading as the number of active connections in the database increases. So presumably there's not necessarily a memory limit here, but just contention in the system as it grows from one to 1000 connections. And you could see the delay in seconds starting to increase as you get up to the 1000 connection level. And of course he says what to do at this point is basically bring in a connection pooler. And he talks about applications a lot of times, provide connection pooling. So he's mentioning Rails here and you can configure for active record, you can configure a connection pool and how large you want it to be. But normally when you hear about connection pooler it leads to things such as Pgbouncer. So even though some of the applications of connection poolers, you're probably going to want to get to PG bouncer once you're at the hundreds of connection stage. And he goes over the basics of Pgbouncer and the type of pooling it can do. It can do session pooling, which is not necessarily useful for having a large number of connections. But he talks about transaction pooling and statement pooling and how typically your PG bouncer resides in between your nodes accessing your database and the database itself. So again, this was a really good post that kind of talks through connections and ideas to have in mind, not necessarily for postgres, but application developer, how best to work with your connections.
[00:11:20] The next post is Pgbouncer monitoring improvements in recent versions. And this is from the Okmeter IO blog. Now this is kind of a follow on from a post we covered recently, which was use Red and Real Word World PG bouncer monitoring. And this one he talks about some things that are new in version 1.8. They have some new settings you can get like total transaction time. So this is the total number of microseconds spent by PG bouncer when connected to PostgreSQL in a transaction, either idle in transaction or executing queries. And you can use a query time metric to be able to see how often essentially something is idle in transaction and you can graph it. So it tells you how often you can essentially get an Idling percentage using these new stats to see how often transactions are essentially open, but no real database work is happening. So essentially these are connections that are just hanging out there, not being used. And perhaps in your application you could do things to make this more efficient. And so they also talk about total query count and total transaction count. So again, two more metrics that you can get access to. And he demonstrates how to use that to get average queries per transaction and then you can even get the average query time and the average and wait time. So if you run PG bouncer, this is definitely a great blog post to check out in terms of your monitoring on looking at some new or different ways to potentially monitoring its behavior.
[00:12:50] The next post is managing PostgreSQL Backup and replication for very large databases. So this post basically goes over kind of why you would need to do backup. And he lists some items here and he goes over what I tend to call logical backups using pgdump or PG dump all to dump a whole database cluster and you can then use the Pgrustore command to restore them. Now, from my perspective, a logical backup is a great thing to do when your database is relatively small. However, once you start scaling, you're going to want to move to essentially physical backups that utilize a point in time recovery. So generally this will be faster for larger databases and you can restore to a point in time, whereas logical backups you can just restore to the point in time at which that was started essentially. And he talks about setting up point in time recovery backup where you're using PG based backup to take a base backup of the data files, but then you also need to be copying over the wall files as well. And you could copy over the wall files using an archive command, or you could use the streaming protocol to use Pgrseive wall. And I have a few tutorials on this if you want to check out scalingposgres.com, but it goes over some of the advantages of point of time recovery and some of the potential disadvantages. Now he advocates a strategy for using both strategies for backing up your database. But once you get up to really large sizes, the logical backup takes a ton of resources and is really hard to do. So generally you're kind of using the physical backup process. Now you can use just the PostgreSQL tools that are provided via the community postgres, which is PG based backup and PG receive wall or using the archive command. However, there are other tools, again, open source tools like Barman that help coordinate this process or PG Backrest. So those are two other choices you can explore to help coordinate this process. And they say they're using Barman at their organization. So if you're rethinking your backup and restore process, this is definitely a blog post to check out.
[00:14:58] The next post is commenting your postgres database. So this actually refers to a previous similar post that I reported on last week about Comment. Now first he's talking about being sure to comment your queries, particularly if it's a very long query, just using single line comments using the two hyphens here. But you can also comment a schema using the comment command. So you can say comment on table, give the name of the table, or comment on column and give the name of the column and give it a description. And when you describe that object, it will show the description here. So again, like I mentioned last week, I'm not sure I would necessarily use this in my database system, this comment command, because some of that documentation is elsewhere, but I can imagine for a large database with a large number of users, this feature can be beneficial. So if you're interested, definitely a blog post to check out.
[00:15:54] The next piece of content is actually one that I've mentioned previously, but they just keep adding videos. So the Postgres Open SB 2018 YouTube channel has added many more videos to check out about Postgres. So definitely a great resource and at some point I will get an opportunity to watch them to see which ones I particularly like. Now, related to that, there's another YouTube channel that's just posted a whole slew of videos and this is for Postgres. Comp in South Africa that happened this month. So they've posted about 20 different videos on PostgreSQL. So definitely another resource to check out for recent PostgreSQL presentations.
[00:16:32] And last piece of content is along with Postgres Eleven, PG Pool two 40 has been released. So if you use PG pool or considering using it, definitely a blog post to check out.
[00:16:44] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.