Parallel Query, Real Time Analytics, Killed Index Tuples

[00:00:00] In this episode of Scaling Postgres, we talk about parallel query, realtime analytics, killed index tuples, and downtime. I'm creston. Jameson. And this is scaling postgres episode 38. [00:00:22] All right, we actually don't have a lot of content that was released this week, but there is a lot of video content that I'm going to be presenting in this week's episode either way. The first article is update on Basecamp Three being stuck in read only as of November eigth at 12:41 P.m. Central Standard Time. So if you're not familiar with Basecamp, it is a software as a service that does essentially project management. So they had an outage that lasted a number of hours and the reason was they ran out of integers for presumably an index. Now I'm not sure if they're using Postgres. I thought maybe they were using MySQL, but still, I felt that this blog post was a good reminder to people to be sure to check the size of your integers, particularly if they're being used for primary keys, and you want to make sure you don't run out of those or really use essentially big integers for your primary keys. So basically they had some downtime and it goes through the process and what happened in terms of getting the system back up. So even though this may not be postgres specific, it's always a good reminder to check and make sure you don't overrun your integer sizes, particularly when it's the primary key for particular tables. So definitely a blog post to check out. Oh, and I should say this is from the Signal Versus Noise blog. [00:01:51] The next article is killed index tuples and this is from the Cybertechn Postgresql.com blog. So basically this is a concept and it's actually a feature that was introduced in 8.1, but this is a recent blog post that's describing it and it talks about the concept of different sessions being able to see what tuples are active in the database. And you basically use x, min and max in order to determine that. And once a tuple has been classified as dead, so for example, that tuple has been deleted and no other session can see it, it can get vacuumed up. And this section basically talks about the table heap in that respect. However, they're also talking about index and they say a visibility information is not stored in the index. So where they use x, min and max on the heap for each row or the data table for each row, that visibility information is not stored in the index. And as a consequence, an index scan has to inspect the heap tuple to determine if it can see an entry. And of course this access can result in random I O. And they mentioned that this architecture can be more expensive than other database systems, but they've done a few different things to minimize that. Like in 8.1, they introduced an bitmap index scan. So it's basically scanning a list of heap blocks to try to avoid some of that random I O. And they also introduced an index only scan as long as all the columns required are in the index, and the visibility map shows that all the tuples in the table are visible. Now specifically here they're talking about killed index tuples. Now they're saying there's an example where suddenly you can get very fast index responses, or at least variability in the response time of certain indexed queries, and they believe some of that is due to these killed index tuples. So in their example, they created a table called Whole. They inserted about a million rows into it, did an analyze, and then they created a hole in the table, essentially deleting all the IDs in this range, and then they did another analyze. Now they ran the query and they got a hit in the shared buffers. So it was essentially cached information and executed in 222 milliseconds. But when they ran it again, that same query again, they get a hit, the buffers that's less, but it completes much faster. So 14 milliseconds. And what they say is, quote, what happened is that the first execution had to visit all the table blocks essentially in the heap, and killed all the index tuples that pointed to the dead tuples in the heap. The second execution didn't have to do that, which is the reason why it was ten times faster. So this is definitely an interesting behavior and might explain why you're getting some variable queries if you have cases where a lot of tuples are marked as dead and then removed and you get some better index performance. So if this is interesting to you, definitely a blog post to check out. [00:05:02] The next article is Parallel query in PostgreSQL how not to use it or misuse it. And this is actually a presentation done by Rafia Sabi and I believe Amit Kapila, they're from Enterprise DB, and I have the link to the presentation here and it talks about parallel query and PostgreSQL, and it goes over how it works and recommendations for how to essentially configure it, whether you're on 9.6 or ten or eleven. So if you're interested in doing some parallel query work, definitely presentation to check out. [00:05:40] So the next post is going back to the YouTube channel. Postgres opened Silicon Valley in 2018, where they presented a bunch of different YouTube channels. I've watched about eight more or six to eight more presentations, and I would say these are kind of the top presentations that I would suggest you review if you haven't reviewed any already. So I would say my top one for developers and DBAs is the ANSI Schmancy presentation. This is by say rope sarkuni of Jackdb. Basically he's making the argument that a lot of application frameworks try to treat the database as an agnostic data store, whereas he's saying that you can really expand your capabilities and performance and do a lot of interesting things and make things a lot easier if you actually use the power that's within the database. And of course, they're talking about PostgreSQL. So I did mention this presentation, not this exact one, but a previous one he gave at another conference. But I believe there's some additional information in this one. And if you haven't gotten a chance to review it yet, I would definitely suggest watching this one because there's a lot of good stuff in it. [00:06:57] The next recommendation of the presentation again at the Postgres Open conference is building a real time analytics API at scale. And this is done by someone from Algolia, which is a search engine provider essentially. And he goes through and talks about how they've developed a real time analytics API basically using roll ups. Now they're using Cytus. So his example is using the Cytus, which is an extension that does sharding of your PostgreSQL instances. And they have presented a number of blog posts and presentations from the Citusdata.com blog where they describe how they build roll ups and essentially this is an implementation of that. And they talk about other extensions that they use to be able to do really fast approximate distinct counts using Hyperlog log and also doing top end calculations. So if you're wanting to do real time analytics, this is definitely a presentation to check out. [00:08:00] The next suggestion is a wall for DBAs, almost everything you want to know. Now this isn't necessarily for developers, but it kind of gets into the nitty gritty of wall and how to use it, how to manage it. So I would say this is a suggested presentation if you wanted to check it out. Another one of the ones that I would say is pretty good. So if you want to know a little bit more in depth about the wall, definitely a presentation to check out. [00:08:26] The next piece of content is actually another YouTube channel. And this is the PG Day Paris YouTube channel where they've posted about ten videos from PG Day Paris. Now these were posted months ago, but I actually just found them on the feed. So if you haven't checked these out, another source of content, the last piece of content is actually another YouTube channel. This is CEO saying there's a lot of video content this week is Pgcon and they have 30 to 40 different videos that have been posted about PostgreSQL. Now, some of the presentations don't look so great. Like, for example, I was looking at this one and it never shows the slides. You get the audio, but it's probably going to be really hard to watch this video to get a lot of content out of it. Others are showing the slides and you get the audio with it. So some of these presentations may be hit or miss, but again, a lot of video content from presentations this week that does it for this episode of Scaling Postgres, you could get links to all the content mentioned in the show. Notes be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 61

Constraint Triggers, Simple Shards, Data Streams, OIDs | Scaling Postgres 61

Episode 3

Modeling, Query Performance, Statistics, pgmetrics | Scaling Postgres 3

Episode 362

Optimizing For Time-Series Data | Scaling Postgres 362

Parallel Query, Real Time Analytics, Killed Index Tuples | Scaling Postgres 38