Real Time Analytics, Index Decrease Performance, work_mem | Scaling Postgres 17

Episode 17 June 18, 2018 00:19:40
Real Time Analytics, Index Decrease Performance, work_mem | Scaling Postgres 17
Scaling Postgres
Real Time Analytics, Index Decrease Performance, work_mem | Scaling Postgres 17

Jun 18 2018 | 00:19:40

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we review articles covering using Postgres for real-time analytics, how indexes can decrease performance and how to configure work_mem.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/17-real-time-analytics-index-decrease-performance-work_mem/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about realtime analytics, indexes, decreasing performance, configuring work, mem, and window functions. I'm creston. Jameson and this is scaling postgres episode 17. [00:00:23] All right, our first article is scalable incremental data aggregation on postgres and citus. And this is from the Citusdata.com blog. Now, that title was quite lengthy. However, mostly what they're talking about is they've come up with a new technique for doing real time analytics. So they have a scenario here where you're loading events into a raw data table. They're saying in batches, and then periodically you want to aggregate these new events into a roll up table. So creating a summary, for example, and select from that roll up table to display a real time analytics dashboard. So this is a new technique that they came up with, and they have a link here in this hyperlink new approach that takes you to the actual code that they present in this. [00:01:16] And they tested it using their Citus data solution, which is basically an extension of postgres that lets you set up nodes and direct codes and queries to those specific nodes. So some of what they tested is not stock postgres. They did test that, and the techniques that they discuss here can be used for standard postgres. However, they saw much better performance using their Citus extension for postgres. But still, the rationale and the techniques could be beneficial to you as a developer using postgres. So, the first thing that they set up is a rollups table to manage what roll up tables exist. And basically, they want to retain the sequence number for each row that has already been calculated. So if you have a minute roll up of a particular set of data, or a table that has 1 minute roll ups of the data, or ten minute roll ups or hourly roll ups, you would use this table to store that table and the last sequence that the aggregates were calculated for. Now, this particular technique works as of postgres ten because they are able to use the PG sequence Last Value function. [00:02:33] So here they show the code that they're using. So they created a new function called incremental rollup window that actually does the roll up process. So first they select from the rollups table what roll up name they're interested in, and then they do some of this trick that they called it doing a lock on the table to ensure that all rights have been written to the table before they do the aggregation. And here's where they're using the PG sequence Last Value, and they update the rollups table. And it looks like the output to this function is the last aggregated ID, or the window end, as well as updating the rollups table. So this updates the rollups table, essentially the master list of what the last aggregated ID will be. Now, their raw data table that they're using is called page views because they're tracking page visits like perhaps a CDN would be doing. And they found using a brin index gave the best performance for looking up a range of sequence IDs. And they have this additional step here that's possible with Citus. You wouldn't do it with standard postgres of course, but they actually create a table distributed across different shards because in this scenario they're using a four shard Citus data solution. Or I'm sorry, maybe that may not be the number of shards but at least four instances of Citus. And then they created a table called Page Views one min that stores the roll ups of the counts of views for every minute. Because if you're having thousands counts of views per minute, this allows this table to be queried much faster because there will be less rows contained in it. And they're inserting this 1 minute roll up table into the master list of the roll up tables that they have. So they just have the one 1 minute roll up table and then they create a second function for actually doing the aggregation work. So relying upon the previous function, we looked at incremental roll up window that grabs the next window that they want to calculate, finding the start and the end for it and inserting that new aggregate into their Page Views 1 minute table and doing on conflict do an update instead. So basically insert the new value for the new minute or update that value. And with this function in place, you can simply call select all from Do Page View Aggregation in order to run the aggregation and update the 1 minute View table. And they say here by running the Do Page View Aggregation function frequently, you can keep the roll up table up to date within seconds of the Raw Events table and they give an example of what the performance is like querying the raw data versus the aggregated data. So they go from 869 milliseconds when querying the Raw Page Views table versus the same counts when querying the Page Views 1 minute roll up table. They get it in five milliseconds, so over 100 times faster when you're aggregating data in this fashion. And then of course, they compare performance between Citus and Raw postgres and they saw dramatically better performance with Cytus due to its ability to do insert select commands across multiple server instances because it can handle not only the insert load from the raw data but also the aggregation. Now of course they are invested in Citus so some of these statistics may be a bit biased. But they did use a larger instance here and its performance didn't keep anywhere in line with a smaller series of four Citis instances. So whenever I see this, I wonder if there's different techniques that could be used to maybe not achieve cytus's level of performance, but get close to it in terms of doing things like would partitioning of the table data help in any way in using these techniques? But definitely an interesting blog post to check out for understanding the techniques and how they're doing it. And even though you may not want to use this exact technique, maybe you can take some of the ideas they use to figure out ways to scale your database more efficiently. So definitely a blog post to check out. [00:06:54] The next post is adding an Index can decrease select Performance. And this is from the Cybertech.com Blog, and they talk about when you add a new index, usually the cost you're thinking about is your inserts and your updates are going to be a little bit slower because you have to maintain that index. And generally when you add an index, you expect it selects to go faster. Now of course, if those index cover the data that you're selecting from, it will be faster. But there are some cases where it can actually reduce performance of some queries. And he had an example of a query here where they have a three column table they were selecting for two of those columns, and one of them had an index on it. And he's ordering by the Sort and then limiting it. Now, things are fine with this query. It finishes in four milliseconds, given how much data he has in it, which it looks like maybe a million rows. But when they add an additional index on the Sort column to make sorting, presumably make sorting faster, the planner chooses a different plan and tries using this new index to find just 20 values that match the criteria. But it ends up reducing performance significantly from four milliseconds and increases it to 28 milliseconds. And I've seen this on occasion when you're ordering and there's a limit, because sometimes if the limit is small, the planner says, well, I can just start going through this index or even do a sequential scan. And I'll eventually run into the values of interest based upon the statistics, whereas a lot of times, just using a direct index is much faster. And you can look for this too. If you say eliminate a limit clause from a query, sometimes that may run dramatically faster. Now, he mentions two ways to actually speed this type of query up. One is drop the index. That's misleading, essentially the planner. The second is to rewrite the query so that postgres cannot use the offending index. And he uses an interesting thing here. One is using an expression as a Sort key because you can't use an index from an expression. And the other is using a sub query with an offset of zero. But I also found interesting that there were some other solutions in the comments that may be more effective. One is proposing a partial index oh, I'm sorry, a conditional index. And the other is doing a multicolumn index. So these scenarios may also work to give the most optimum performance. But if you run into this situation I would definitely consider removing a limit clause and if you need to an order by clause just to see what those impacts have and maybe try some of the different techniques that are available that the blog poster mentioned as well as in the comments here. And of course using Explain plan you should be able to find the most optimum way to write the query and or structure your indexes. [00:09:55] The next post is Materialized Views with PostgreSQL for beginners and this is a little bit of an older post, but I saw it come through the PostgreSQL ecosystem this week and of course it's talking about materialized views for beginners and it seems that this is from a developer so this would be from a developer's perspective. And they had a situation where they have real time traffic going into an object data store and then there was a daily batch of data that was being sent to postgres but they say unhappy user because they wanted apparently more frequent data. So basically they wanted to build an aggregation table using views. Or I should say at first they did not start using Views, they just wanted to create an aggregate table and refreshing the table by deleting rows that have been updated and inserting new versions every 2 hours. And they quote here as a bad idea because the aggregation table is experiencing read access locks in the table. So basically they were running into locking issues. So then they looked into materialized views and just to remind you, a view, just a standard view is a virtual table that you can specify precisely how you want it to look, but it looks to the raw tables to actually generate the data for you. Now materialized view is actually no longer virtual, it's like a real table. So you are storing the data that is comprised in that view, in this materialized view and you can add indexes to it and treat it like a normal table in that respect. And indexes on a materialized view can give you a pretty large performance boost. So they started using this technique but then they ran into another problem because as they were refreshing it because this materialized view is not dynamic, you have to tell it to be refreshed on a periodic basis. It was locking out users during the rebuild. So what they came upon is actually refreshing the materialized view concurrently, so much like indexes, you can index a table but it impacts access to that table. If you run it concurrently it takes longer, but it doesn't impact general read, update or insert access to the table. And the key thing that she also mentions is here is to use the refresh concurrently, you must define at least one unique index on your materialized view. So this post gives a kind of a good overview and shows you some of the pitfalls that they ran into when they were setting up their materialized views and even chose that as a particular solution. So definitely a blog post to check out the next post is Configuring memory for postgres. And this is again from the Citusdata.com blog. Now this particular post is focused on workmem. Now, workmem is a per connection setting. So it's not a global setting for the whole database server, but it's used per connection and actually depending upon the operations, multiple amounts of workmem can be used for particular operations within a given session. Now, they start off by saying by default workmem is set to four megabytes in postgres and that each postgres activity, such as joins certain sorts, can consume four megabytes before it starts spilling to disk. And you can find out if it's spilling to disk by searching for temporary file in your PostgreSQL logs. If you have log temp files enabled now they go down here to say perhaps a sane default is something like 64 megabits. Excuse me, 64 megabytes for work mem. If you're looking for a one size fits all answer, that seems pretty high from the experience that I have, they mentioned something a little bit lower and increasing it further down in the post. That seems like a more sensible thing to use. So definitely if you're going to consult this post, be sure to read it all the way through and not just look at the first suggestion here. And then they go into the thing I mentioned that the setting essentially impacts per connection and talk about how much doing calculations based upon how much memory your server has versus how much is using for different operations and particularly parallel operations. Parallel queries can also have a big impact on how much of the work mem you're using for each connection. And down here in the section, more workmem, more problems. This is a good setting I can think of that, say starting small with say, 16 megabytes and gradually increasing workmen when you start seeing temporary file logging indications. Now they go on to mention if you start seeing out of memory errors in your logs or even the out of memory killer in Linux starts being kicked off, the thing you want to do is actually reduce your work memory because you're using too much of the memory available in postgres. And just to add a few guidelines or how I think about it, if you have a highly transactional database where the queries are relatively simple, but you've got a lot of them and you want to have a high number of connections, then you can keep your work memory pretty low because you're not going to be needing it to do a lot of joins or sorting operations. Whereas if you're having a more analytic type load and you have very complex sorts, very complex joins or a lot of them and pulling back a lot of data and maybe you don't have quite as many connections, then go ahead and bump up you can then go ahead and bump up your work memory a lot higher because with less connections, you can have a larger work memory to handle those types of loads without having to fall back to disk. But generally this is a really good post with really good guidance for workmem. So definitely check it out if you do postgres database tuning. [00:15:44] The next post is using window functions for time series Internet of Things analytics in postgres BDR. Now, I found this post interesting, not necessarily with the postgres BDR aspects, but it does give a very good introductory overview of using window functions. So in this example, they're using an example of IoT temperature sensor data that's being collected and using window functions to show how you can query this data to rank them. So as you see here on the right column, it shows the different rank based upon the reading. So it has a very basic window function doing rank ordering by the temperature reading, how low or how high it is. Then they talk about what impact dense rank will have on the data and then even using the Lag function to show the difference between the previous value. So I know this was talking about BDR, but generally this is a great post just to give you a good introduction to Windows functions and how they work. So if you're interested in that sort of thing, definitely a good blog post to check out. Oh, and I should say this is from the second quadrant PostgreSQL blog. [00:16:56] The next post is how Redshift differs from PostgreSQL and this is from the Stitchdata.com blog. So if you're not familiar, Redshift is basically a column or analytical database provided by AWS. It's Amazon Redshift and as they say here, it's the most popular cloud data warehouse available. And if you didn't know it, Redshift is actually a fork of PostgreSQL. So it has a lot of similarities in how you use it. But there are some key differences. And this post is interesting because it goes over a little bit of the history in terms of the companies involved. So I found it pretty interesting post and then it goes into some of the differences in SQL that you're going to be running into versus creating tables, inserting data vacuums. It even talks about a lot of the unsupported features and data types because this was forked a while ago from my understanding from postgres. So there's a lot of features that it doesn't have but it has been built up to be optimal for data warehouse purposes. So definitely an interesting blog post to check out. [00:18:10] The last post is deploying PostgreSQL on a docker container. Now, at this stage when we don't have native sharding available with Postgres, I haven't really seen this a lot with people putting postgres on a docker container, but perhaps particularly because anything in a docker container, the storage is essentially ephemeral so you have to be very careful with how you're using it within a docker container. I haven't really done it, or don't really have that much experience with it, but if you're thinking about or interested in it, here's an introductory blog post of how to deploy PostgreSQL on a docker container. And this is from the several nines.com blog. But perhaps in the future, when PostgreSQL has more native sharding, that implies that you're going to have to be managing multiple database instances to satisfy query requests, then running those in docker containers may make the most sense, but definitely a blog post to check out. [00:19:16] That does it. For this episode of Scaling Postgres, you can get links to all the content presented in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks, Sam.

Other Episodes

Episode 9

April 23, 2018 00:11:01
Episode Cover

High Volume Processing, pg_wal Issues, Restore with pg_receivewal | Scaling Postgres 9

In this episode of Scaling Postgres, we review articles covering high volume processing, indexing, pg_wal issues, Postgres data types and restoring when using pg_receivewal....

Listen

Episode 77

August 19, 2019 00:12:14
Episode Cover

Timeseries Queries, Arrays, Ranges, C Triggers | Scaling Postgres 77

In this episode of Scaling Postgres, we discuss timeseries queries, using arrays with queries, range use cases and developing triggers in C. To get...

Listen

Episode 75

August 05, 2019 00:15:40
Episode Cover

Combining Indexes, Regular Expressions, Parallelism, PostGIS | Scaling Postgres 75

In this episode of Scaling Postgres, we discuss combining indexes, regular expressions, parallelism in Postgres and PostGIS content. To get the show notes as...

Listen