Better Plans With Statistics | Scaling Postgres 293

[00:00:00] Most of the time when you have a bad plan, it's usually due to these statistics. Maybe they're out of date, maybe you need them at more granularity. Well, another case is when there are some dependencies between columns that Postgres doesn't realize, and this week Lucas covers how to handle that. But I hope you, your friends, family and coworkers continue to do well. Also would like to thank and welcome everyone who decided to join the course where it was open for early adopters this past Black Friday through Cyber Monday. I'll provide more information in the coming weeks, but open enrollment for everyone else is starting in January, so I'll talk more about it then. But the first piece of content is how to use Create Statistics to improve a plan, rows, estimate, and why it matters. This is from Pganalyze.com and this is part of Lucas's five minutes of postgres, and he's talking about a situation he actually saw in his own database at PG Analyze. And that issue was with Statistics. And with Statistics, the system automatically collects single column statistics, but it doesn't do anything about dependencies. That may exist between two different columns, but you can add those using Create Statistics and they're basically called extended statistics. And he had two columns that are dependent upon one another. Normally, I think the example use case is between city and a zip code. So there's an alignment between zip codes and cities, or there's alignment between cities and states. But in his case, I think it was something like servers to the organization that owns those servers and they were getting a bad plan where there were close to 500 rows that existed when he ran Explain Analyze. But when you just ran the Explain plan, it thought there was one row. The reason being it was doing a bitmap and and combining two indexes together. So it was looking at these statistics independently and clearly it was giving you an inaccurate estimate. So to handle it, he ran a Create Statistics command. He also needed to increase the amount of statistics that were being collected for this table. But after he did that, it was giving a more accurate estimate to how many rows that were actually in the table. So you have to watch the video to get more of the details behind it. But I definitely encourage it because he also explains how the statistics worked and how it was coming up with one row instead of 500. But the reason this is important because if these statistics are off based upon what you're querying, you could get a bad plan, which means there exists another plan that's more performant. So if you're seeing some performance issues and you think it's related to the statistics being collected, it may be a case that you need to increase how many statistics you're collecting. Or maybe you need to add some of these extended statistics to give the planner more information to pick a better plan, but check out this piece of content if you want to learn more. [00:02:54] Next piece of content. There was another episode of Postgres FM last week. This one was on subtransactions. So these are independent transactions within an overall transaction of the database. And usually how you create these is you begin a transaction block and then you can do save points within it. It basically saves your work part of the way through. Now, the main advice was basically don't use them or try not to use them if you can avoid it. Because particularly once your database gets active, you can run into problems with them. And I think Nikolai mentioned one way he does use them, and that is if he's doing some DDL changes that has to be able to lock things, he may create some save points during that process. And then iterate over a particular, say, busy one because it's trying to grab a lock. And once it's locked able to make the change, then it can continue. That's the only example use case I think they mentioned, but otherwise potentially try to avoid them. They talked about four issues that they've seen with regard to them that are mentioned in some of the posts down here. One was that it burns through your transaction IDs. So every transaction uses a transaction ID, but every subtransaction you create creates another transaction. So one single transaction could have multiple subtransactions. So you're just using up your transaction IDs faster. The second issue is too much nesting. I think there's a limit on how many subtransactions you can have within a transaction, so you just need to be careful of that. There could be some issues with multitransaction IDs, particularly with I think it's a select for update that they were mentioning so that's something to look out for. And then they also mentioned a potential replica issue that can happen as well. But great piece of content, definitely encourage you to check it out. You can listen to the episode here or watch the YouTube down here. [00:04:42] Next piece of content is actually a presentation PDF. This is postgres extension shape. The future. And this is on Rustprooflabs.com. And Ryan Lambert did it. This is a really long 70 slide presentation and because it's from Rustproof Labs and Ryan, who's the author of Mastering PostGIS and OpenStreetMap, definitely covers a lot of geographical information systems related content in Postgres. But it talks all about extensions, their history, particular ones to use, how to create them in different ways, and like using Pgrx here framework. So definitely a lot of great information about extensions you may want to check out. [00:05:22] Next piece of content indexing timestamps. This is from Momgm US and he has a scenario where you want to put an index on a timestamp in your table, but you want to be able to query it by the date. And maybe you have an idea you want to cast the index to a date. But the problem is this won't work in this particular case because you're using timestamp with time zone. So presumably this will work if you're using timestamp without time zones. But a way to get it to work is to give it a specific time zone when you're casting it to a date. Then it will allow you to create this functional index. But the most important thing he mentions is don't really even need to do that. You can just have a basic index on the timestamp and then just query it using part of the date without the time component and basically it will default to as you can see here in the explain plan, all zeros for the time portion if you're only including the day portion and just do greater than or equal to and less than. And this is kind of something I do intuitively already, but apparently people were trying the previous possibilities and he said it even works to cast this particular one to a date and add intervals to it and it still does the appropriate conversion to give you the answer you want to. So this is the most flexible option, essentially. And as he says here, this particular technique gives you unlimited granularity so you can define how granular you want to be with your data, your time. And it's a flexible time zone. So check out this piece of content if you want to learn more. [00:06:50] Next piece of content what we learned from benchmarking Amazon Aurora PostgreSQL serverless. This is from Timescale.com, and apparently they had some people that have been converting from Aurora to Timescale, so they decided to take a look at cost differences, performance differences, things like that. Now, they did use a time series workload to test this, so you'll have to take that fact into account when looking at this. But using this type of workload that Timescale was designed for, they saw that it was 35% faster to ingest the query. Times were roughly equivalent a little over one to 16 times faster. But two queries, they were actually slower. So Aurora one, it was 95% more efficient at storing data. This is predominantly due to the column store compression that Timescale offers, but they achieved pretty significant cost savings even with this better performance in terms of 52% cheaper per hour for Compute and 78% cheaper for the data store. Presumably a lot of this is due to the compression that Timescale offers, but if you use Aurora and you have a time series related data set for the most part, maybe you want to check out using Timescale instead of Aurora for some cost savings. Next piece of content, join the preview of Amazon Aurora Limitless database. So this is a new offering by AWS and it looks to be their sharding solution. So basically it's one instance that can grow limitless, they say, and it's in preview now, so not generally available. So presumably you do have to generate sharded tables and shard keys. So this essentially seems to be what Cytus is doing. It's a sharded database. So a scale out solution and I will say what was pretty funny was that I did find this on Postgres weekly and this is the weekly newsletter talking about Postgres and under their heading they mentioned so not only can your database be limitless, your AWS bill can be too. So that's kind of funny. Next piece of content accelerate HNSW indexing and searching with PG Vector on Amazon Aurora PostgreSQL compatible edition and Amazon RDS for PostgreSQL. This is from Amazon.com and they basically took the newest version of PG Vector 0.5.1 compared a little bit to 0.5.0 as well. And their new Graviton Two and Graviton three instances to see what their performance differences are and of course using the new HNSW indexes and scrolling down to the results here. So first they looked at ingestion and as the number of clients climbed, a more recent version of Pgvector looked to be over three times faster. So definitely if you're using an older version, definitely move up to the latest version of PG Vector if you want better insertion performance. They also looked at insertion performance with regard to their instance types and they saw I think it was generally 20% to 30% speed up improvement. So new instances do better. And apart from the insertions, they also looked at query performance not with PG Vector 0.5 versus 5.1, but with regard to their Graviton instances. And that looks like they saw a 40% to 60% performance improvement with those instances. But check this post out if you want to learn more. [00:10:05] Next piece of content creating a fast time series graph with Postgres materialized views this is from Timescale.com. They created a scenario where you have an electrical provider that is tracking demand of electricity for their service and they wanted to do some sort of dashboard to present the information. They had particular query that did it. So basically they materialized it into a view. Say they created a view that basically creates a real table behind it and of course whenever you do this you're pre generating the data that makes it easy to query upon it's much faster. Downside is you have to refresh that materialized view periodically, which means you have data constantly flowing in but you need to take the time to actually refresh the view periodically. Now you can do it concurrently, but basically if you have years of data within this view, it has to calculate everything. But Timescale offers the opportunity to basically aggregate continuously. Now it takes a little bit of performance hit, but basically it means you don't have to rebuild all the data. But looking at what this does, I don't think there's anything necessarily stopping someone from rolling their own solution that does kind of these continuous aggregates in stock postgres itself. [00:11:17] But if you want to learn more, check out this blog Post next Piece of Content compression Face Off Postgres Toast versus Timescale Compression this is from Timescale.com. Now this probably isn't a quite fair matchup because I think the primary responsibility of toast is just to move large data out of the main table area. So you have the huge JSON document or huge text document, you want to store all that data outside of the normal row storage. And I think compression wasn't top of mind when designing toast. Necessarily it does it, it's just not to the degree at which Timescale has focused on compression. Plus, there are only certain types of data that make it into the toast. So when they are looking at the example here, where presumably you have columns that aren't so toastable, there's essentially nothing in the toast. So essentially nothing's compressed versus 73% compression using Timescale. [00:12:12] But once you store data that gets toasted, although looking at this chart, I'm not quite sure these are zeros for the Toast size because it said they added a JSON B data type. So I would expect the toast to be populated. So don't quite understand that. But it still says they get about double of the compression. And some of these wins are when you look at exactly how they do it. And they're basically using kind of a column store to do it. So as you can see here, they batch up to say, 1000 rows into a single row. And what they do is they take those thousand rows, they take all the timestamps, put them in this one column as an array all the device IDs in this column as an array. Status codes, temperatures so, similar data types together in an array in a single column. And then they use specialized compression techniques given the data type to compress it as much as they can. And with this solution, they get much higher compression rates. So that is what makes Timescale better for compression. But check out this blog post if you want to learn more. [00:13:17] Next piece of content. It's time for PgSQL. Friday. Again. This is 14, this is the invitation from Pgsqlfriday.com and this is the one from CyberTech. And basically it's focused on PostgreSQL events. Now, these aren't events in PostgreSQL, but they're events to meet other people, using SQL to talk about things like reflecting on personal experiences, looking at particular highlights, comparing virtual versus in person experience, networking opportunities, et cetera. So feel free to participate in this blogging event for this December 1. [00:13:53] Next Piece of Content fun with postgres, looped functions and linear progressions. This is from Crunchydata.com. This is day 21 of the advent of Codes 2022 Challenges, and I won't mention it further because it'll spoil how it's being done. [00:14:09] Next Piece of Content also from Crunchydata.com is fun with postgres, text file mazes charts and routes. This is the day 22 challenge, so you can check this out if you're interested in it. [00:14:22] And the last piece of content database generated columns, Django and PostgreSQL this is from Poliox. Net and this focuses a lot on Django, but it does talk very briefly about the generated field capability within Postgres. I don't know if you necessarily need this feature in Django to use generated columns, because I know I use Rails and as soon as the generated column feature was available I could start using it in my database so I didn't have to wait for a particular Rails feature to use it. But if you want to learn more, definitely feel free to check out this blog post. [00:14:57] I hope you enjoyed this episode. Be sure to check out Scalingpostgres.com where you can find links for all the content mentioned, an eventual transcript of the show, and the podcast version as well. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 152

New Releases, Reconnect After Failure, Bablefish, Troubleshoot Disk & Memory | Scaling Postgres 152

Episode 182

Boundless Text, Revoked Permissions, Index Bloat, Hardware Performance | Scaling Postgres 182

Episode 261

Easy Foreign Data Wrappers, JSONB Cheatsheet, Updating Cost Limit, Parallel Aggregate | Scaling Postgres 261