Storing Bits, Autovacuum, Assertions, Logical Recovery

[00:00:00] In this episode of Scaling Postgres, we talk about storing Bits more auto vacuum assertions and logical recovery. I'm creston jamison. And this is scaling postgres episode 16. [00:00:21] All right, the first article this week is using Bits Efficient Storage in Postgres. And the subtitle is an example of how you can efficiently store millions of data points in postgres using binary. And this is from the Open Digirati blog. [00:00:38] Now, the problem that they had is they wanted to track streaks of doing a particular activity. In their case, it was daily reading of a bible. But they wanted to be able to track these streaks and they were debating the most efficient way to do it. So they tried one way and it would give them a table that would grow by 4GB a year per million users. But they wanted to see if they can do better. [00:01:04] So what they decided to do is to store one bit a day, because what they could do is do a 32 bit integer for each month. So one bit per day per person they're doing per person per year. So they're using an array of twelve integers. So basically one user per row with this data type enables you to store a whole year's worth of data. So basically allowed them to only grow by 78 megabytes per year per million users. [00:01:39] So this was a pretty innovative and interesting way to find a way to store data efficiently when they plan to have quite a bit of data collected. So it's not too long a blog post, but definitely one I suggest checking out just to see the techniques that they used. [00:01:58] The next post is auto. Vacuum slides from Pgcon 2018 Ottawa. This is from the Dataegrid.com blog and in it they have their slides from the Talk. And again, if you're having particular issues with your database that you suspect is vacuum, the solution generally is to vacuum more. But I find it amusing. Here the slides, guys, what's wrong with the database? And the queries are canceling. Where's the standby? It seems vacuums is killing everything. So what to do? Turn auto vacuum off, but then that causes tons of other problems. So definitely a presentation to go through that talks about how best to configure auto vacuum. It kind of explains the Y four, again in terms of NVCC. But the meat of the presentation is about around the middle where it talks about proper tuning of auto vacuum. So again, another great post to check out if you're in the process of tuning your auto vacuum to be more efficient. [00:03:01] The next post is actually a YouTube video. It's called RailsConf 2018 dropping into B Trees by David McDonald. This is on the Confreaks YouTube channel and in it he talks about binary trees and what they are in terms of databases. Now, he's a developer by trade, so he's describing what binary trees, b tree indexes are from a developer's perspective. And he talks about some of the questions that he had and basically he did some research to determine what the correct answers are. So for example, when and how do I use an index? Why is an index on an integer faster than on a string? Why do inserts bring a performance penalty? And how can you drop an index and not hurt the data? So a lot of this is described in this presentation from an architectural standpoint to be able to answer these questions. So if you're wanting to learn a bit more about Btree indexes, definitely a presentation to check out. [00:04:00] The next post is from a presentation. It's actually the slides and it's assertions and how to use them. And this is from David Feder and this is the GitHub IO presentation link. Now, in terms of what assertions are, there are data constraints expressed in SQL which can span multiple tables in ways we haven't been able to do before. So there's constraints that you can place in a single table for what data is allowed, but assertions operate across tables. [00:04:33] Now, PostgreSQL does not have this feature yet. However, in this presentation he goes over how you could potentially do it using a trigger with its own function for each table for each assertion. So he goes into the code here. So if you're looking to potentially set up an assertion like this, you can use this technique today. However, he also mentioned in this presentation that some assertions were started in 2013 in the Postgres dual code base. But more recently they've been doing some additional patches. And this one's from this April. So this is not planned for PostgreSQL Eleven, but perhaps a future version, maybe twelve or 13. This feature will get in there. So if you're potentially interested in what assertions are and potentially how to set one up with the tools available today, be sure to check out this presentation. [00:05:30] The next post is Recovery use cases for Logical replication in PostgreSQL Ten. [00:05:37] Now, this is a really long blog post and it goes into extreme technical detail. Basically what it is talking about is cases where you have logical replication that has some failure point, it needs to be restarted. How do you get it restarted again? Now, PostgreSQL Ten is great for getting logical replication set up and there's minimal things you need to do. But what happens if it fails? How do you assure the data gets back in sync? And whatnot or if you have a failover to a replica, how do you handle that? This particular individual, and this is from Avitotech has been using another logical replication solution. I believe it was Londist. But he was investigating PostgreSQL Ten and he is documenting kind of what potential changes would be needed for them to be able to rely on PostgreSQL Ten. [00:06:34] And right here, as you can see, a very long blog post near the end here, he talks about some of the things that kind of would be needed to do to be able to properly recover the state of logical replication. I know the big one here that says blocker for production usage, and this is something I myself have looked at in terms of logical replication, is that the logical slot on the standby, in other words, logical slots are not replicated to standby. So you're going to have a gap in the data after a replica is promoted. And there was some discussion around this particular issue, like there was a proposal for failure slots, but nothing's really become of that. But there's some people that are apparently interested in it. And this blog post mentions a number of big companies in Russia using PostgreSQL, where having these type of recovery features for logical replication would be a huge benefit. So if you use logical replication and want to learn more about how recovery can potentially be done in some of the areas to watch out for, definitely a blog post to check out. [00:07:43] The next blog post is actually a presentation. Slides will postgres. Live forever. So this was a quick and easy presentation, but it basically talks about how because Postgres is open source, it's going to have a really long lifetime. The code will always be there, it can always be revisited and upgraded and improved based upon what the community chooses to do, whereas a for profit company, as soon as that product is no longer generating revenue, essentially it will be abandoned, whereas the code is always available to be changed with an open source project. So basically this presentation is saying that Postgres will have a very, very long life due to its open source nature. So definitely a presentation to check out if you're interested in that. [00:08:29] The next post is I or the tiger. I believe that's I of the tiger benchmarking cassandra versus timescale DB for time series data subtitle is how a five node timescale DB cluster outperforms 30 Cassandra nodes with higher inserts up to 5800 times, faster queries at 10% of the cost, a more flexible data model, and of course, full SQL. So this is from the Timescale.com blog. Now, I have presented this before that timescale DB is actually an extension that you add to your PostgreSQL instance. So in this, they tested setting up a Cassandra, actually a five node Cassandra ten node and a 30 node Cassandra DB to compare insert performance versus a five node timescale DB. So they go through over the methodology how they tested and essentially they saw improved performance with Timescale and up to the point where they tried to match what timescale DB was doing compared to Cassandra, they went from five to ten to 30 nodes and still didn't achieve that performance. Now, of course, this is timescale DB that has done this analysis, so you have to take that into consideration. But definitely it appears like really large performance using timescale DB compared to Cassandra for a much lower cost because you're having less nodes to manage and pay for. So if you're interested in that, definitely a blog post to check out. [00:10:02] The next post is multidata center setups with PostgreSQL. And this is from the Several Nines blog. Now, this basically goes over when you want to have multiple data centers set up potentially around the world. What's the best way to synchronize the data between them and allow access to the data? So it goes over when you can think about doing synchronous replication, when you need to be doing asynchronous replication. It also discusses some different tools that are available at different providers to be able to achieve this. So relatively short overview post about that. So if you're considering a multidata center setup, definitely a blog post to check out. [00:10:44] And the last post is altering default privileges for fun and profit. What not to Expect When You're expecting. And this is from the Celia US blog, and in it he goes over Setting default privileges. So basically you can use grant statements to grant access to schemas or tables for particular users, but setting default privileges enables you to set for new objects created moving forward. It doesn't go back for existing objects you have to use grant, whereas if you want to do future objects, you use set default privileges for them. And he goes over how to use it and compares it a little bit to coming from Oracle on what the differences are. So if you have to manage complex privileges to your database, definitely a blog post to check out. [00:11:33] That does it. For this episode of Scaling Postgres, you can get links to all the content presented in the show Notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can sign up via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 306

Scalability Limits From SLRU & Lock Manager | Scaling Postgres 306

Episode 59

Popularity, Load Testing, Checksums, pg_hba | Scaling Postgres 59

Episode 269

Upgrade Struggles, Hyperloglog, PgBouncer Usage, Postgres Scaling | Scaling Postgres 269

Storing Bits, Autovacuum, Assertions, Logical Recovery | Scaling Postgres 16