Boundless Text, Revoked Permissions, Index Bloat, Hardware Performance

[00:00:00] In this episode of Scaling Postgres, we talk about boundless text revoked permissions, index Bloat, and hardware performance. I'm Kristen Jameson and this is scaling postgres episode 182 point. [00:00:23] I hope you, your friends, family and coworkers continue to do well. Our first piece of content is postgres boundless text in Back Again. This is from Brandier.org, and he's talking about the situation with postgres you can use a vercare data type, a veracare with a particular limit, or a text data type, and there are no performance implications from any of those choices. Basically any of those work equally well in terms of performance. Other database systems, you have to pay penalties for one over the other. Like maybe variable character is less sufficient than a simple character field, but with postgres it doesn't make a difference. Now, because of that, a lot of people who use postgres find freedom in that where you can just say everything is a text field and forget about it, but then you run into problems of end users really exploiting that boundless capability. And he's mentioning here a story from Stripe from 2018. Now, I don't believe this is with regard to postgres specifically, but he's talking about a case where they had a boundless ability to insert data without limits into a mongo database, it looks like. And it started having issues with, he says, crashing Http workers, tying up database resources, et cetera. So basically they had to try and find those limits. But those limits were exceedingly large. So they basically worked around the solution and had some exceptions for certain customers. Then he's talking about a new position where he equally saw that the database had no constraints on a lot of different fields, like text fields, for example, and he didn't want to repeat the same issue. So what he's kind of advocating for is that even though there's not a performance difference, just for management of the database sake, you want to set some type of limits. So one area he's considering is one type of length for short strings like names, addresses, email addresses, et cetera, have a little bit of a longer description, say 2000 characters for things like descriptions, and then a size for really long text blocks like say 20,000 characters. Now he's saying even though you have these limits in the database, of course you want to make these limits at your application layer as well, because a lot of times you can present a more friendly message than if you just ran into a database constraint. Now I actually agree a lot with this because I know in my experience using Ruby on Rails at one of the versions, they changed how they created the database migrations to remove the varicare limit. So it used to have a default limit of 256 characters for any variable character field. Now of course you could adjust that up or down with the migration, but by default that was a limit. And I actually appreciated that. But now there's essentially no limit, so you would have to impose your own limit if you want. So part of me did like the limits that were set by default historically. Now he was thinking about some different ways maybe you could implement this as a standard and he's thinking about using domains. So you could create, say, a text standard domain that has a particular limit, a text long or a text huge domain. But the problem is it kind of hides what those values are if you're using these domains. So you're not quite sure. But anyway, I felt this was an interesting blog post and discussion and if you want to learn more, you can check out this one. [00:03:51] The next piece of content waiting for PostgreSQL 15 revoke public create from Public Schema, now owned by PG database owner. This is from Depsc.com and he's talking about postgres 15, not 14, which is going to be released this fall. But the next version of postgres there is a patch done which is a pretty big change in that you can no longer create objects in the public schema by default, you have to be a database owner to do it. So this is best as seen for the example that he gave. So he created a database called X, then created a user called Test, and then creating another database called Test, where that owner he just created, called Test is the owner of it. Now when that user, Test tries to create a table in the X database where it's not an owner, it fails. So it is unable to create objects in essentially the public schema of this database, but because it's the owner of the Test database, it can easily create a table in that database. So this is a pretty big change that may unfortunately break some applications, but overall I think it's a worthwhile change for the security improvement. So if you want to learn more about this, you can check out this blog post. [00:05:13] The next piece of content, index bloat reduced in PostgreSQL version 14. This is from Cybertechyphenposgresql.com and he's talking about the case of index bloat. So basically when you have data that's being updated and there's indexes that are impacted by that data, those indexes need to be updated. And if a row gets removed, for example, or a row is in the index, if that row needs to be removed because it was updated, it's an older version that index entry needs to be removed and therefore you can cause some table Bloat. And if you update the same row multiple times, you can actually get page splits and a larger bloated index due to that happening. Now he mentions that there are some features that Postgres already has that handle some of this. So the first feature is hot updates. He mentions Hot Tuples, but I think he means Hot updates because Hot means heap only Tuple updates. So basically if you're going to be updating data that does not impact an index, you can basically just update the tuple and not have to impact those indexes. So the indexes don't need to be altered. So that helps avoid table blow. The other instance, which as he says here quote when an index scan encounters an entry that points to a dead tuple in a table, it will mark that entry as killed. Therefore it doesn't have to scan over the entry and then it can delete some of these entries when the index page is full to avoid a page split. But what version 14 does is what they're calling a bottom up index tuple deletion. It quote is able to delete index entries that point to dead tuples right before an index split is about to occur. So basically helping you to avoid page splits in the index and which will reduce bloat. Now, they have an example case here where he created a table with three columns. The first is an ID with a primary key. The other is another integer that they're not going to change, it's unchanged, it won't be updated, and then another integer that they're going to be changing. So he did a lot of iterations of updates of the changed column as demonstrated here below. Then he looked at the size and the level of bloat in the indexes. You can see, first off is the unchanged index. You could see it's eight times larger in version 13 compared to version 14. So at the top is version 13. This is the size of the unchanged index and you can see it's an 8th the size of the index above, plus its bloat in version 13. Its bloat is at 95%, where it's at 60% in version 14. So this unchanged index with these additional features is able to avoid these page splits due to all these rows being updated, as long as it's not being impacted by that update. And it also looks to even help out the primary key here, which does get accessed periodically. So in version 14 it is a little bit smaller and there is less bloat about 13% compared to 35% on version 13. So definitely a great improvement if you're going to have a lot of updates happening within a table to help manage and control index bloat. So if you want to learn more about it, you can check out this blog post. [00:08:29] The next piece of content is Aurora PostgreSQL DBR six G compared to DBR five with Ybio. This is from Dev two in the AWS hero section. Now, I don't cover a lot about Aurora in scaling postgres, but I thought this post was interesting because it gives another example of hardware differences between the Graviton processors that Amazon offers to the intel processors that they offer. Because that's the question, would this make sense to move to the Graviton processors and what is their performance like? And given the cost now, Aurora is based upon Postgres but it's not a postgres of course, but I think this is an interesting post to look at from hardware performance perspective. Now there was a tweet by someone that was saying they had some performance issues with the Graviton processors but not the intel ones. So this post walks through a performance comparison between those two platforms as well. And basically his results, it was a little bit slower but on the whole pretty equivalent and the Graviton processors offered a 20% cheaper option. So if you're evaluating different hardware, maybe you want to check out this post to see what his results were for the comparisons. And of course you would always want to do your own comparison as well. This just gives you another data point. [00:09:54] The next post how we went all in on Sqlcpgx for Postgres and Go. This is from Brandyour.org and this is talking about client utilities for accessing postgres, specifically in Go. So in terms of the Go language, he mentioned about seven different libraries that can do PostgreSQL connections and each have their own way of handling it. Essentially think of them as orms for Go for accessing Postgres. Now, what he ultimately decided on was SQLC which basically using comments you denote the query you want to send against the database and then in the comment it becomes a function or a method you can call to execute this query. So I thought this was an interesting post exploring different ways of accessing postgres with Go. So if you're interested in that you can check out this blog post. [00:10:53] The next piece of content Psychopg Three initial review this is from Rustprooflabs.com and this is talking about another client library, this time for Python. And he's looking at migrating from psychopg two to psychopg three. And basically his assessment the migration was easy and he thinks the connection pool rocks. So basically they've made a number of improvements in version three with a lot of improvements it appears for connection pooling. But overall in all of his analysis he saw great performance gains in general, particularly when the connection pooler was used. And there's a companion post called Thinkingpsychopg Three from Verazio.com where he discusses its development and the changes he wanted to make with this version three. For example, using Pgeexec params instead of just PG execute with the params already merged in and the number of improvements to the connection pooler that were talked about and also doing a lot of async work as opposed to a lot of things happening synchronously. Now, this is a little bit of an older post, but if you want some more background in terms of client libraries working with Postgres, maybe you want to check this out. [00:12:04] The next piece of content using JQ to get information out of PG backrest. This is from Fluco 1978 GitHub IO and this is focused of using JQ which is a way to parse out JSON from the command line. So apparently PG backrest its output is generated in a JSON format, so JQ allows you to parse that and pull out information in a command line utility. Now, I actually have explored something similar because AWS's client like if you're wanting to do backup or snapshots or things of that nature, also requires a way to parse JSON unless you're forcing the output to be text. So I thought DJQ was a great tool to be familiar with if you're working with Postgres and need to do some JSON parsing. So if you want to learn more, you can check out this blog post. [00:12:53] The next piece of content. The PostgreSQL Person of the week is Akshay Joshi. So if you're interested in learning more about Akshay and his contributions to Postgres, you can definitely check out this blog post and the last piece of Content we did another show of the Rubber Duck Dev show. This week. We covered programming with dates, times and zones. So if you want some more long form, developer based content, definitely check out this episode. Our upcoming episode. This Friday at 08:00 p.m Eastern Standard Time will cover rest versus GraphQL APIs. [00:13:26] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show. Notes be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks. [00:13:44] Our Channel.

Show Notes

Episode Transcript

Other Episodes

Episode 125

Scaling Suggestions, Slot Safety, SCRAM Channel Binding, Effective Cache Size | Scaling Postgres 125

Episode 22

Concurrency, Listen & Notify, ZFS, pg_rewind | Scaling Postgres 22

Episode 272

Postgres 16 Beta 2, The Rise of Vectors, FDW Performance, Unused Indexes | Scaling Postgres 272

Boundless Text, Revoked Permissions, Index Bloat, Hardware Performance | Scaling Postgres 182