Global Indexes, Caching Aggregates, Vacuum Processing, Effective Cache Size | Scaling Postgres 91

Episode 91 November 25, 2019 00:13:22
Global Indexes, Caching Aggregates, Vacuum Processing, Effective Cache Size | Scaling Postgres 91
Scaling Postgres
Global Indexes, Caching Aggregates, Vacuum Processing, Effective Cache Size | Scaling Postgres 91

Nov 25 2019 | 00:13:22

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss global indexes, ways to cache aggregates, how vacuum processing works and the purpose of effective cache size.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/91-global-indexes-caching-aggregates-vacuum-processing-effective-cache-size/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres we talk about global indexes, caching aggregates, vacuum processing, and effective cache size. I'm creston. Jameson and this is scaling postgres episode 91. [00:00:21] Alright, I hope you're having a great week. Our first piece of content is Proposal for Global Indexes in PostgreSQL. This is from the Percona.com blog and basically PostgreSQL for partition tables does not have a global indexing. The indexes actually exist on each partition table. And for explanation purposes they're calling this a local index in this post. But they've sent a proposal to the community, according to this post, to propose having global indexes. So this is, as they are describing, a one to many index in which one index maps to all the different partition tables. Now one area that I think this is incredibly important for, for data integrity is being able to have a unique constraint that crosses all partitions. Because I've run into a situation where I wanted a unique constraint that is actually not part of the partition keys. And you can't really do it, you can set a unique constraint on the local partition, but you can't really have a global one. And if there was a global index, presumably that would be supported. The other area that they're considering is that when you're again wanting to cross partitions, you could potentially get better performance with a global index. And they talk a little bit about the syntax proposed and some potential changes that would be needed. Now one disadvantage they mentioned is that you're probably going to take a hit on writes once this is implemented. But this is an interesting feature that seems to be in discussion, so we'll see what comes of it. The next post is pulling the trigger. How do you update countercaches in your Rails app without Active Record Callbacks? So because it's talking about Active Record and Rails, this is about Ruby on Rails, but a lot of it pertains to PostgreSQL database and triggers and locking and things of that nature. So they have a scenario where they have an orders table and users and they want to collect aggregates on them. So how many total orders per user for example. And they filled up a database with 10,000 users and 100,000 orders. And basically the performance is not what they would hope when you're just querying it. So their idea was to essentially cache or create a user stats table that by user stores the total orders amount and the total orders count. Now with Ruby on Rails, one method you could use is Active Record Callbacks or Countercaches to do it. But as they say here, that really won't work for the amount portion. So they wanted to use a trigger. Now they're using a Ruby gem called Hair Trigger to actually implement this, but the raw SQL code is here. It just makes things a little bit easier to do. So they're just going to insert into the user stats table summing by the orders amount and the count of the orders and if there is a conflict with the user ID, so that record is already there, it's going to do an update and set what the new orders amount and orders count is. But they ran into a problem with this because they had some race conditions. So they created a job that ran on four separate threads and tried to update and they had some race conditions. Now they mentioned they could change the isolation level, but actually what they wanted to do is use a transaction advisory lock. So these are application level locks and all they had to do was create an application lock using the user ID and added to this. And that was able to do these inserts or updates and avoid the race condition that they were encountering before. And they even have another way that they did that. They're calling the lock free alternative where they're using Deltas. So essentially they're doing an insert or updating and the update amount they're just adding to the orders whatever the new order amount is and whatever order count they're adding to it. And they say as long as you don't use any sub queries, race conditions would not be possible. So these are two possible implementations to cache aggregates in your PostgreSQL database. Oh, and I should mention, this is from the Evilmarsians.com blog. [00:04:40] The next post is an overview of vacuum processing in PostgreSQL and this is from the Several nines.com blog and it just gives a general overview of vacuum and kind of why it exists. So the first thing that they mention here is that PostgreSQL does not do in place updates. So when there's a delete, it actually marks the row as dead and it must be vacuumed later. And on an update, the current row is marked as dead, but a new row is inserted in its place essentially like a delete and then a new insert. And because of this, these rows must be vacuumed up once they're no longer visible to any other sessions connected to the database. Along with it, they describe the visibility map that defines what tuples on pages are still visible to existing transactions. They talk a little bit about the free space map, the importance of freezing transactions to define what is still visible to existing transactions. And then they talk about vacuum's responsibilities in terms of scanning all the pages of tables to get the dead tuples, freezing old tuples, removing the index tuples, pointing to the dead tuples, remove the dead tuples of pages corresponding to a specific table and reallocate the live tuples on the page, update the free space map, the visibility map, truncate the last page that's possible, and update all corresponding system tables. So a fair amount of work that vacuum has to do. Now they also mentioned this does not of course, free up space actually on the disk. It allows space to be available for new inserts within the database system itself, but you're not going to save disk space. However, a full vacuum that reorders the data does free up disk space. The disadvantage of course is that it takes an exclusive lock on the relation so that prevents any selects, updates, inserts, deletes happening. So generally you don't want to do this on a running production database. There are certain specialized tools or extensions you can use to do the equivalent of a vacuum full and then they discuss the purpose of auto vacuum and go into a little bit about vacuum and full vacuum. So if you're interested in learning a bit more about vacuum and its processing, definitely a blog post to check out. The next post is Effective Cache Size what it means in PostgreSQL. And this is from CyberTech Postgresql.com and this is a setting in your postgresql.com file that helps gives the database insight into how much memory is on the system that's potentially available for file system caching because the greater amount, the more likelihood an index will be in memory and the more likely PostgreSQL will use an index scan. So they had a scenario they imagined here that you have a database system with 100 gigs of Ram and said okay, maybe 2GB is the operating system, 3GB for PostgreSQL. And then you set the shared buffers to example here they said 24GB and then pretty much whatever's left, at least on a Linux system would typically be used for file system cache. And then they talk and show the actual comments of the C code of how as you get up in memory database sizes. The benefit of the effective cache size is essentially increases the probability that an index scan will be used because in large memory sizes more of this file system cache could be used for caching and it's a way to communicate to PostgreSQL the rough size of how big this cache is. So if you're interested in learning more about this setting, definitely a blog post to check out. [00:08:18] The next piece of content is actually a YouTube channel and they have just released a number of videos about 20 or so for postgres comp in South Africa that happened in October. And this is on the Postgres Comp South Africa YouTube channel. So definitely some additional content to check out with regard to PostgreSQL. [00:08:41] The next post is it's all in database, and this is from Rafiasb Blogspot.com and it talks about sizing and how to determine the size of your different tables or relations in PostgreSQL. And he talks about the backslash DT plus table name command to give you an indication of its size. You can also use the system table's. PG table size and also PG relation size. And they do note that table size won't give you the size of the indexes and for that you have to do PG Total Relation Size. And then he also mentions an extension called PG Stat Tuple to give you several additional metrics if that's of interest to you, particularly if you're trying to find Bloat or things of that nature. So if you're interested in different methods of determining the size of your database, definitely a blog post to check out. [00:09:33] The next post is Twelve Common Mistakes and Missed Optimization Opportunities in SQL. And this is from Hakibanita.com, and this post describes exactly that. The twelve are be careful when dividing integers. Guard against division by zero errors, which are both of those seem general programmatic things to be aware of in any language. Know the difference between union and union all basically union should be excluding duplicates. Be careful when counting nullable columns. Be aware of time zones. [00:10:07] Avoid transformations on indexed fields. Basically, if you're doing a function like lower or upper to it, that type of function may not be used on the index unless it's an expression index. For example, use between only for inclusive ranges to avoid this issue. I generally only use greater than or less than I usually don't use the between syntax add faux predicates. So basically look for index columns that could help you get some better performance. Inline CTEs, which is something that they recently changed the default for PostgreSQL twelve fetch only what you need general good advice reference the column position in group by and order by so you can use integers to reference the column position as opposed to doing the exact name format your query. Now again, this is a personal preference, but again some things that is mentioned. So a lot of these items are more on the simple or basic side, but if you want to become more familiar with some of them, definitely a blog post to check out. [00:11:06] The next post is Similarity in postgres and rails using Trigrams. Now, a lot of this is actually based upon Rails, but they discuss Trigrams and the PG Trigram extension and they give a rough Ruby implementation, but then they go into doing Trigrams in Rails. They enable the PG Trigram extension and then use essentially the select count to get the similarity to what word is entered against cities like Toronto versus cities. So if you want to learn more about the PG Trigram extension, definitely a blog post to check out. [00:11:43] The next post is Enhancing PostgreSQL twelve Security with the CIS benchmark. This is from Crunchydata.com and this is a center for Internet Security benchmark that they've been publishing and it has now been released for version twelve. So if you want to find out more information with regard to that, definitely check out this. [00:12:04] The next post is PostgreSQL Development and Configuration with Puppet. So if you're looking for a configuration management tool to be able to deploy your PostgreSQL instances, definitely a blog post to check out. Personally, I use Ansible, but Puppet is also a viable alternative. And the last post is Ogrfdw Spatial Filtering and the headline is the Ogrfdw now pushes spatial filters down to remote data sources. So this is definitely related to postgres, and I'm an area I'm not too knowledgeable about, but I know they're talking to foreign data wrappers and OGR is apparently a data model. So there is way to push down queries to the data source that you're connecting to that presumably would result in better performance. So if you're interested in that definitely a blog post to check out. [00:12:58] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode. Or you could subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 159

April 04, 2021 00:09:32
Episode Cover

Tablespaces, Streaming Replication, More Postgres 14, Security | Scaling Postgres 159

In this episode of Scaling Postgres, we discuss when to use tablespaces, setting up streaming replication, features coming in Postgres 14 and implementing security....

Listen

Episode 196

December 19, 2021 00:17:12
Episode Cover

Easy Recursive CTE, Zheap Undo, High Availability, Loading Data | Scaling Postgres 196

In this episode of Scaling Postgres, we discuss easily writing a recursive CTE, the zheap undo capability, high availability considerations and fast ways to...

Listen

Episode 244

December 04, 2022 00:14:59
Episode Cover

Index Merge vs Composite, Transparent Column Encryption, Trusted Language Extensions | Scaling Postgres 244

In this episode of Scaling Postgres, we discuss merging indexes vs. a composite index, implementing transparent column encryption, developing trusted language extensions, and reviewing...

Listen