Scaling Real-Time Analytics, Covering Indexes, 1,500 Upgrades

[00:00:00] In this episode of Scaling Postgres, we discuss real time analytics at scale, covering indexes, 1500 postgres upgrades and postgres versus MySQL. I'm Kristen Jameson, and this is scaling postgres episode eight. [00:00:25] Welcome to Scalingposgres.com. As a reminder, we prioritize developer related content first and then follow up with more database administration type topics. So our first article is Building real time Analytics APIs at scale. And this is from the algolia.com blog. [00:00:47] And in it they have a search service and they were looking to change the data store that they were using. So they were using Elastasearch and they started looking for a different data store that would meet their needs better, particularly for their analytics. They investigated redshift and BigQuery and ClickHouse, but eventually they settled upon using Cytus data. So Cytus is an extension for PostgreSQL and they have a service that they work with as well. [00:01:20] And they also made note of several extensions that were useful to them from an analytical perspective. One being hyperlog, log and top end. Now, what I find very interesting about this blog post is not necessarily their usage of Citus, but how they've structured it to achieve a very high level of performance with the data they're using. So it's the technique I was interested in, not as much the sharding, but in terms of aggregating the data to deliver very fast results. [00:01:55] And that starts under the section creating near real time analytics. And this discusses how they ingest raw events into the system, periodically aggregate these events into dedicated roll ups, and then the actual queries hit the roll up tables to get the results in milliseconds. So they're doing roll ups at a five minute resolution as well as I believe they mentioned a day resolution. And they have some of the queries about how they've created these roll ups and how it's pretty much refreshed every five minutes. And this allows, with these roll ups, it allows them to have very fast query performance. And here's a graphical representation in terms of the data streaming into the raw tables. Then they have a five minute roll up and then a daily roll up. And again, they mentioned the API targets, the rollup tables, never the raw tables. [00:02:50] And they have another example of a query about how those are done. [00:02:55] And they have a big picture overview of the analytics API and the SQL queries that are hitting the Citus database. And then the ingestion pipeline for Populating, that Citus data database. And they talk a little bit about the architecture, about how they designed it. So this is a really great blog post to demonstrate how to structure your data to give you the best performance. So I definitely highly suggest checking out this article. It's really great. [00:03:23] The next article is postgres eleven, highlight covering indexes. [00:03:29] So this is a new feature that is hopefully going to be coming in postgres eleven. And this is by Michael Pacquiao, or Pacquiao. So what this feature will do is let you use index only scans more frequently. So in the example where you have a table that he uses here with columns ABC and you have a unique index on A and B. [00:03:57] When you do A select where A and B exist in the where statement or in the columns returned, you can get an index only scan which is very efficient. You don't have to go to the heap to pull the data for the table because everything you need is right in the index. [00:04:14] So today with postgres ten and lower, if you're actually going to choose three columns, one column C that is not in the index, you're going to get a standard index scan which is still fast, but it's going to have to go to the heap to pull more data. However, they're offering an include keyword. So here we're creating an index still just on A and B, but we're including the column C. When you do a query that is just pulling C as an option, it won't be in the where statement, but in this scenario, C will be available and you can get an index only scan, which will make these queries much, much faster. So it's not essentially a key column of the table that would speed up ordering and where statements, but when you're wanting to pull back data, if you include those columns in the index, it will enable you to do an index only scan and not have to hit the heap. So this could be a really large performance boost in certain cases. So it's definitely something to be aware of and track. To see if this feature makes it into PostgreSQL eleven, the next article is idle in transaction session timeout terminating idle transactions in PostgreSQL. Now in terms of PostgreSQL, idle transactions can be very, very bad because they cause problems with vacuum, they can cause locking when you're trying to alter a table. So basically you want your transactions to happen as quickly as possible. And if something is idle in transaction, that's usually a bug or some type of mistake. [00:05:58] Not always, but particularly if it's held for an extended period of time. So this post goes over this particular feature. You can set, you can set a timeout that if these are detected, it's going to go ahead and close the connection. [00:06:12] And it discusses a little bit about the problems of Bloat and how to simulate it by starting a transaction, doing a select and then doing nothing. But then in a secondary session, you're updating deleting, inserting updating, that's going to cause the table to Bloat. Essentially. It's not going to be able to free these rows mark for deletion as a result of these operations. [00:06:35] And this essentially sets a session timeout such that it will go ahead and close that connection if that's left open, which helps protect your database from long transactions. Now, he does make a note here maybe don't consider setting this at the database level, but perhaps at a user or a session level. So definitely a setting to be aware of that can help protect and scale your system. [00:07:02] The next post is from the same blog again, the CyberTech Postgresql.com blog get Rid of Your Unused Indexes. So this is a pretty short post. It talks about why you should get rid of them in terms of indexes use up space. They can slow down data modifications with inserts and updates, and they prevent hot updates, which are a heap only Tuple update, which basically means with updating certain type of data, it's more efficient to do it when a column is not included in an index. But he goes over, of course, all the benefits and how indexes are used, but still, if they're not used, it is a best practice to go ahead and remove them. And he includes a query to be able to try to identify the unused indexes in your system. So definitely a blog post to check out. [00:07:54] The next post is called PostgreSQL Bloatbusters, and this is from the Dataegrit.com blog. And he goes over the importance of vacuum and how it works to help remove obsolete rows after deleting update statements. But how when you're actually running vacuum or auto vacuum, it doesn't necessarily free the space in the table, it makes space available, but it doesn't essentially compact the table once vacuum is run on it. To do that, you would need to use the command Vacuum Full, but that will lock out the entire table for selects and updates, essentially all activity while it's run. But it does compact the table. And he mentions two tools you can use that can help you compact the table if this is something you need to do. One tool is called Pgrapak and the other is PG compact table. So if this is a particular issue you're dealing with and you don't want to run Vacuum full, maybe check out these two utilities he mentions in this blog post. [00:08:58] The next post is PostgreSQL data types, date timestamp and time zones. So there's actually a series of blog posts in the first link here that says PostgreSQL Data types, where he goes over all the different data types. I just picked the one that I found the most interesting because doesn't everybody love time zones? And I know there's a bit of a, I guess I'll say a difference of opinion on whether time zones should exist in the database or in the application layer, or at least the setting and manipulation of them. Like, for example, I tend to use Ruby on Rails as an application framework, or Phoenix, and they tend to not store the time zone in the timestamp field, whereas it seems a lot of database focused individuals advocate storing the time zone in the timestamp fields. So I felt this blog post was a good viewpoint about how to use date and times and timestamps in PostgreSQL, so you may want to check that out to see if you want to move some more of these use cases into the database as well as check out the other data type blog posts he's done. And again, this is from the Tapoeh.org blog. [00:10:19] The next blog post is the Ultimate Postgres versus MySQL blog post. And this is from the Dianemphae.com blog and it is biased towards PostgreSQL, but I felt it was interesting to look through at all the different advantages that are listed for Postgres versus MySQL. I know years ago I made the decision when I was evaluating which database I was going to be using for my applications, which database system I was going to be using. I had predominantly been using Oracle prior to starting my company and after evaluating both of these open source databases, I chose PostgreSQL and haven't looked back. But it's a very interesting overview on some of the differences, biased though it may be, but an interesting post to check out. [00:11:14] The next article is titled very simply postgres ten upgrade. However, it is a massive post about how they upgraded 1501,500 postgres clusters to Postgres Ten. And previously they were in version 9.3. And they go in extensive depth on all the planning for all of these databases and doing the upgrades in an automated fashion with a downtime window of just 15 minutes, with 30 minutes on the outside maximum. And they go through each of the steps and each of the commands to run. So it is a really comprehensive post and if you're embarking on something similar to this or your own upgrade, I definitely suggest checking it out to see the techniques they used and see if any of those could be used in your upgrade planning that may be coming up. Definitely a great one to check out. [00:12:15] The next blog post is how to perform Failover with the PostgreSQL Kubernetes operator. So this is from the Crunchydata.com blog and they're discussing a product that they have using essentially container technology to doing failover multiple PostgreSQL instances. And essentially it states that enables users to provision and manage thousands of PostgreSQL databases at scale. So I haven't investigated this particular product, but in the interest of scaling postgres, this is definitely another approach to take in terms of containerizing postgres and managing multiple, multiple instances of it. So if this is something you're considering, definitely check out this blog post. [00:13:09] And the last article is PostgreSQL replication failback with PG rewind. So this is an in depth video tutorial I set up that talks about doing replication failback using Pgrewind. So I had done a previous tutorial talking about easy failback, and that is if you promote a database and the old primary does not have any rights to it, after that point in time of promotion of a replica, you can easily fail back to it, but by just getting it to follow the new primary database. However, if you have rights to that old primary, it's not going to be able to follow the new primary. So one solution is to delete the database cluster and do a full restore and resynchronize with the new primary. Another option is to use PG Rewind, and that rewinds essentially the wall files to get it back to the state that it was at the time of the promotion. So that old primary can now follow the new primary database cluster. The thing to keep in mind, of course, is that with PG Rewind, essentially you may be rewinding transactions that had occurred on that old primary. So it's just something to keep aware of. And I go into detail about how to set this up and how to do it, and the errors that you run into. So if you think this might be useful, I welcome you to check it out. [00:14:40] That does it. For this episode of Scaling Postgres, you can get all the links to the content presented in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode. Or you could subscribe via YouTube or itunes. Thanks, Sam.

Show Notes

Episode Transcript

Other Episodes

Episode 133

Postgres 13 Released, What's New In Postgres 13, Debugging PL/PGSQL, Monitoring | Scaling Postgres 133

Episode 37

Postgres Goodies, Materialized vs Rollup, Column Oriented | Scaling Postgres 37

Episode 57

Checking Vacuum, Fast Distinct Count, Group By Speed Up, Health Check | Scaling Postgres 57

Scaling Real-Time Analytics, Covering Indexes, 1,500 Upgrades | Scaling Postgres 8