Index Include Clause, Exporting CSV, JSON, Zedstore

[00:00:00] In this episode of Scaling Postgres we talk about index include clause, exporting, CSV, JSON and Z store. I'm Kristen Jameson, and this is scaling postgres episode 62. You alright? I hope everyone's having a great week. All first piece of content is a close look at the index include clause and this is from Usetheindexloop.com blog. Now he also covers more than just PostgreSQL, so he's talking about Microsoft SQL Server IMDb two oracle of course, but the include clause was something that was added recently to PostgreSQL eleven. So to start off, he kind of goes back to give a recap of Btree indexes and basically you have the binary tree that you walk down to find a particular key and then he says there's also this Doubly linked list. Now this is usually what you consider part of the index, which is the index indicated here and then this is the table data, or the heap sometimes it's called. And then this points to the particular table data. So the operations is you search through the b tree and then the Doubly linked list and then you pull out the table data to present the rows to the client. And then he also does a recap of an index only scan. So in this case he looks at an index on the sales table looking at a subsidiary ID and then I guess the euro value of it in a query that takes the sum of the euro value from the sales table by subsidiary ID. Now, because this value is a part of the index, you can do an index only scan to do this calculation. So you go through the B tree and then through the double e linked list and you can get the value because it's in the index, you can get the values that you need. You do not need to then go to the table data. So that's what is mean by an index only scan. So what the include clause does, it allows you to have an index only on this key value, so the index just includes the keys. But in the Doubly linked list you can include additional values. So when you actually find the particular key or keys you're looking for, all the values are there and you don't have to go to the table data. But yet the vast majority of the index at this stage does not need the value here because it's only included in the leaf nodes, as it were, of the B tree. Now what he says here quote compared to the original index definition, the new definition with the include clause has some advantages. So the tree might have fewer levels because there's less data to store. The index should be slightly smaller, although I would think that would be a little bit more percentage. But he says it's around 3% because the leaf node level is what takes the most space up anyway and the other has an other benefit is it documents its purpose, it's not being searched on, however, it's to be included with the index. Now, one consideration he mentions is that if you needed to add a use case where you also wanted to sort by the TS column, which he's referring to as a timestamp, you would have to think about what your column order is to make searching efficient. However, if you had this Include and had your index like this adding an additional column, yes, it does make the index bigger. But now a use case that needs to search on subsidiary ID as well as the timestamp, you can just modify this index without having to go over the debate, without the include of what's the best order do. We need two of these types of index. One, index with value include can handle many more use cases. He also talks about something that you typically can't search through a Btree index like using like you could potentially use this type of filter on Include columns. However, he does make a note this is not the case with PostgreSQL, so there is some limitation given how their implementation of MVCC is. And he also considers cases where unique indexes can work with the Include clause. Now, the reason why some of the PostgreSQL with not being able to filter on the Include is because of row visibility and being able to track those x minx max values to be able to identify what row is visible to a particular client that exists on the table data. It's not within the index. The index does have a visibility map, but it's not very granular. So a lot of times you have to checking the visibility map means you also still need to go to the table data. Now, maybe this implementation can change with new storage solutions they're talking about with PostgreSQL now that they have added some an API to be able to have different table storage engines. But this time this limitation still stands. So if you're wanting to go into more depth about how the new Include clause can be used with your indexes, definitely a blog post to check out. [00:05:25] The next post is exporting CSV from PostgreSQL. And this is Felipe Blog. And the first part of this post just talks about basically how you export data from a psql session so you can use the copy command. And he talks about know you have to use absolute values for it. And if you need to use a client to do it, there's also the backslash copy command and you can specify what format like CSV add headers, so it goes over just the basic copy command and the features that are available for it. And he also shows how you can use a psql command in order to run this copy statement to be able to output the values. And he was finding he was doing this so frequently, he actually made a utility that called psql two CSV. That's actually a thin wrapper around psql, and it just basically seems to save a bit of typing in order to run some of these queries. And it has the advantage of being able to handle some of these variables. So for example, he can send in some variables to vary the output of a particular report. So if you have a need for something like this, definitely a blog post to check out. [00:06:38] The next post is having lunch with PostgreSQL, MongoDB and JSON. This is from the Ongres.com blog and he says the purpose of this post in the front, this is basically Alvaro Hernandez's opinion on the topic of postgres JSON developer productivity and MongoDB. [00:07:00] So it's a counterpoint to a previous blog post that was done by Buzz Moshetti. So basically what this does is he counters the different points that were made saying MongoDB has the advantage and his opinion is postgres is pretty good in these respects. Now, I don't want to get into debate about it, but this blog post does go over some JSON use cases and some practices that you may want to put in place. So it's a good overview of different ways you can use JSON and some suggestions on how to work with it and structure it within your application. So putting the debate aside, it's a good learning post for JSON in PostgreSQL. So if you're interested in doing that, definitely a blog post to check out. [00:07:48] The next post is waiting for PostgreSQL twelve. Allow vacuum to be run with index cleanup disabled and this is from Depom and it does exactly what it says. Basically you don't have to run the index phases in vacuum potentially in PostgreSQL twelve. Now this could be a huge benefit for some databases I work with because they have a lot of indexes on some large tables and it's the index phase that takes so long to do. And particularly with the enhancement that is also coming with twelve where you can re index indexes concurrently. I could see in certain use cases being able to vacuum just the table in the heap and leaving the indexes, maybe you just want to do a reindex of those. That could have some advantages. So definitely an interesting feature added and this blog post goes through and looking at some of it. So if you're interested in that, definitely a blog post to check out. [00:08:47] The next post is the second one waiting for PostgreSQL twelve add settings option to explain to print modified settings. So basically this post talks about it and shows what it looks like. And the advantage is basically if you're sharing an explain plan, it can output the settings so you can know for example, the sequential scan has been turned off for this particular explain plan. So I could see this could have some use cases and potentially this is coming with PostgreSQL twelve. [00:09:18] The next post is Zstore compressed in core column storage. And this is from the Postgresql.org message board from the psql hackers list. And basically with the advent of the new table access methods APIs, basically being able to replace the storage engine of PostgreSQL, these individuals have started an implementation of something they called Z store, which is a compressed encore. That means core part of PostgreSQL, I believe columnar storage. So it's basically storing data by columns as opposed to storing data by rows. And they believe they have some performance thing here. Yes. So for example, in here, taking an average of a single column, because data is stored by columns, performance can go from say, 4600 milliseconds to 380 milliseconds. So over it looks like over a tenfold improvement are the kinds of things that a columnar storage could do. So definitely very interesting with the new table access method API that's recently been committed to PostgreSQL with this and potentially Zheep coming down the line as well. So if you're interested in this, definitely Post to check out the next post is one to one relationship in PostgreSQL for real. [00:10:45] So basically they're talking about defining a one to one relationship where you can't enter a record here that doesn't exist already there. And he said years ago they did some implementation that didn't quite work because you could have a zero to one, not a true one to one. So tried it this way, but doing a foreign key and referencing each other. But it doesn't really work. It gives an error. But what you can use is deferable constraints. So you can define these two tables and have them reference these to others, each other in terms of foreign keys and then make it deferable, initially deferred. And this kind of implementation will work. So if you're wanting to do this type of constraint enforcement, definitely a blog post to check out. [00:11:33] The next post is Masquerade, a postgres proxy to mask data in real time. And this is from Tonic AI blog. And what they're doing, they're showing exactly right here. When you go through a proxy that they have set up, it obfuscates the data. So you can see the name here is Glenn Raymond, and here it's been scrambled and even it looks like the birthdays. So what it does in real time, it alters the data as you're seeing it. So there's a proxy that sits between the client you're using and the PostgreSQL database. And their use cases they're thinking about is if you have a Dev or a QA that wants to test off production data but don't want to necessarily reveal it to them, you could use this type of obfuscation or maybe you want to generate some reports, but not let them see particularly sensitive data. Now they have this as part of a docker container now to set up and test out. And there's also a second post that goes into some of the detail about how it functions. Now of course, I was looking and thinking about this. How is it working? Is it looking at the protocol across the wire, which kind of looked like it was doing? I'll have to look into it more, but this is potentially very interesting to me. Now, it doesn't support Https yet, which is how I would want if you have data that you want to obscure like this, you're going to want to keep it secure anyway, at least between these two points. But it doesn't support Https yet. But this definitely looks like a project I'm going to keep track of, and if I hear more about it, I will share. So if you're interested, definitely a project to check out. [00:13:17] The next post is Indexes in PostgreSQL Gen, and this is from Haber.com, and I believe this is translated from the site Postgres Ru, which is a site in Russia. So this is a super in depth description of the gen index and how it works. So if you want to get into technical details of how gen indexes work, in particular, use cases, definitely a blog post to check out. [00:13:47] The next post is actually a YouTube video and it comes from the channel Pgcasts. So these seem to, I believe, stopped a while ago. But actually this one came up in my feed recently and it's called Altering Databases and it goes over a few alter database commands for PostgreSQL and it's only about three minutes in length, so it's super short, very brief. It is on the simple side, but I'll keep track to see if they're going to keep producing them and share them when they're available. [00:14:18] The last post is Chicago open rideshare data set getting started. And the subline is how to get started with mapping GIS data. And this is from the CMC 1213 GitHub IO blog. So this goes over setting up PostgreSQL PostGIS, as well as a JavaScript library called Leaflet to present the data in this way. Now, this is a fairly long post, but he goes over in a lot of details, so he goes so far as to show you, all right, here's how you set up your DigitalOcean server, here's how you install PostgreSQL and PostGIS. Here's how you extract the data, load it into PostgreSQL, and then set up the JavaScript library Leaflet in order to present it. Some of it is a little bit basic, but if you're wanting to get started with PostGIS, perhaps this is a pretty interesting way to do it. [00:15:16] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to, where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 71

Indexing Documents, GIN Indexes, Data Science, Generated Columns | Scaling Postgres 71

Episode 183

Broken Indexes, Trademark Issues, Percentile vs. Average, Logical Improvements | Scaling Postgres 183

Episode 124

Partition Migration, Like Performance, Best Fillfactor, In-Memory Tables | Scaling Postgres 124

Index Include Clause, Exporting CSV, JSON, Zedstore | Scaling Postgres 62