Shaped Sample Data, Version Changes, Missed Bottlenecks, Indexes for Newbies | Scaling Postgres 199

Episode 199 January 24, 2022 00:12:17
Shaped Sample Data, Version Changes, Missed Bottlenecks, Indexes for Newbies | Scaling Postgres 199
Scaling Postgres
Shaped Sample Data, Version Changes, Missed Bottlenecks, Indexes for Newbies | Scaling Postgres 199

Jan 24 2022 | 00:12:17

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss how to generate shaped sample data, track changes between Postgres versions, identify missed bottlenecks and use different index types.

To get the show notes as well as get notified of new episodes, visit:

https://www.scalingpostgres.com/episodes/199-shaped-sample-data-version-changes-missed-bottlenecks-indexes-for-newbies/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about shaped sample data, version changes, missed bottlenecks and indexes. For newbies. I'm Kristen Jameson and this is Scaling postgres episode 199. [00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is how to shape sample data with PostgreSQL, generate series and SQL. This is from Timescale.com, and they're talking about a third in a series that they're doing about generates. This is about generating data that you can use for testing purposes without having to say, have a copy of production data. This part focuses on potentially avoiding just a random function for generating data, but allowing you to generate all sorts of different types of, I'll call them waveforms. So for example, generating something that looks like that, using techniques like sine waves and mathematical variations to get the type of waveform you are wanting to simulate. So I found this to be a very interesting post. And if you're wanting to generate more realistic data, then definitely check out this blog post and find some techniques about how to do that. [00:01:24] The next piece of content configuration changes across PG versions. This is from Depsc.com, and he's talking about his site Yupgrade Dep. [00:01:37] And it's basically a site that allows you to put in two different postgres versions and it shows you what has changed between those versions. And it shows you things like parameters that have changed or default values that have changed. For parameters. For example, when you put in 13.5 to 14.1, it shows you that two parameters were removed and are no longer there. There are 17 new parameters that you can configure and three parameters have had their default values changed. So to me, this is a super useful tool. If you're ever wondering what's changed between versions, maybe try using this tool to help you get a better sense of it. The next piece of content five easy to miss PostgreSQL query performance bottlenecks. This is from Pavelurbaneck.com and the first area he mentions is when you're searching by a function call or basically you have a query that's using a function, well, that's not going to use any index that exists on it. You actually need to use a functional index. He says he's not really a fan of doing that, but there are some other ways you can do it other than just making a functional index. You could change all the data. So to ensure that the database always just has lowercase data in it, you could do the functional index. But his solution is using the sitetext extension. Basically that lets you create a column with a different data type of site text, a case insensitive text. So you would change the table data column to be that site text for email. And then when you do a search, it can use the index. The next bottleneck he mentioned is searching by a pattern. So using like or I like to try and find partial matches within a field. So in his example here, he looked at wanting to see all emails that end [email protected]. Now that's not going to use an index and he has a particular technique he used with some extensions to be able to do that. Now, if you only wanted to look at the left anchored portion. So if your email addresses start with A, if you did a percent, you can do what is mentioned in a comment down here, use the text pattern Ops usage for that email address and you can use the index for like or I like, but that won't work for an right anchored pattern. So it won't work in this case, but the technique he uses here does. Now he says adding the Btreegen index, I don't know why he mentioned that it's not needed because this works when you're wanting to combine a search on say, an ID and JSON field. So I did some tests and I was able to get it working without this. So I'm not quite sure why this is here, but you would need to add the trigram extension and then create a gen index using the Gentrygram Ops addition to the index. And once that is done, you can do a write anchored or even, I believe, a no anchored search within email and it will use the index. The next bottleneck he mentions is ordering by Nulls Last because if you don't have this in your index, it's not going to be able to use it. So you can tell this is doing a sequential scan once he adds Nulls Last as a part of it because you don't want to bring up users who have null email addresses in the ten limit that he's using here. Now you can add Nulls Last to the index. And he says that adding custom indexes on a per query basis is a bad practice to me. It depends on performance. If this is a main core query that's going to be used a lot, it would make sense for it to be added because his alternative here is doing two queries and these two queries could have worse performance than just adding another index. And plus it requires two round trips to the database to do the query, unless of course you're doing a union, for example. But it depends on the use case. I might want to add Nulls Last to the index, the fourth bottleneck Bloated Null indexes. So this is the case where you have a column that's being indexed and it has a lot of null values in it. Now, generally you probably don't want to do that and you'll probably want to use a partial index where the values are not null and that'll just give you those core values. So you would end up with a much smaller index that should be able to be searched faster and use less space and also stay cached more frequently because the index is smaller. The last bottleneck he mentioned is update transaction scope. Now, I wasn't really sure what this title meant, but I think he's just talking about when you're doing a lot of changes, a lot of updates, a lot of deletes, it's important to batch those because you're going to be doing an awful lot of Row locks, as he mentions here. Row exclusive locks, which you can tick. But whenever you're deleting or updating a lot of records, each of those has to be locked. So it's best to do that in a batch process. But if you want to learn more about these bottlenecks, you can check out this blog post. [00:06:29] The Next Piece of Content postgres indexes for Newbies this is from Crunchydata.com and this is an introductory post about some of the main indexes. In postgres they talked about Btree indexes, which are your general index type, brin index which are block range indexes. So these are great for doing indexing over blocks of ranges. So this is good for date ranges or timestamp ranges. It's not as fast pulling out specific values, but for doing ranges it can be a really good index for that use. Case talking about Gist indexes, which are typically used for spatial type indexes as well as some text searching, although I tend to use gen indexes for that. And the gen indexes are useful when you have multiple values in a single column. So that could be text search, that could be JSON or JSON b fields, and the post goes over these different index types and how to use them. So if you want to learn more, you can check out this blog post. [00:07:24] The Next Piece of Content waiting for PostgreSQL 15 introduced log destination equals JSON log this is from Dep.com and they have added JSON output. I'm not a huge fan of having the log output be JSON because a lot more has to be written for every record. So essentially the column of the data will be written in every single record. So I'm not a huge fan of that. It'll probably take more space, but I can see how this becomes much easier to consume for monitoring applications. But if you want to learn more about this feature coming in postgres 15, definitely check out this blog post. [00:08:03] The next piece of content surviving without a super user. Part two. This is from Rhos blogspot.com and this is a follow up to the post where he's talking about potentially the need to have a type of user that is below a super user, but it actually has control over a fair amount of schemas and objects within a segment of the database. So this could be useful for service providers where the service provider running the database on your behalf would have the super user access. But you could be granted this secondary set of permissions to be able to manage different objects under your purview and he talked about some things we'll want to take into account, like allowing these special types of users to set up logical replication, or allow them to set up event triggers or even some alter system commands. Not all of them, but a subset of them. So this blog post describes some of the things that would potentially need to be done to make this happen. [00:09:01] Next Piece of content why I enjoy PostgreSQL infrastructure Engineers perspective this is from Shayon Dev, and he was talking about the difference between MySQL and PostgreSQL in terms of how infrastructure engineers actually prefer MySQL. I did not know that. But he's mentioning some things that he likes about PostgreSQL, and he's talking about making schema changes and talking about how you can create an index or drop an index concurrently to avoid locking during that change. And he also mentions you can apply foreign key constraints without significant locking by using the not valid command. So you would add the constraint using not valid and then at some point later you could validate the constraint and allow that to run and this would not be blocking. He also mentions being able to add not null or other constraints or even a default value. Now, since postgres eleven, I believe you can do this without locking. So I think this procedure would only be relevant for versions below eleven because you can add a not null and not have it do locking with versions eleven and later to my knowledge. And then the last area he covers is Extensibility. So there's definitely tons of extensions that you can add to change the different features of postgres. So if you're interested in learning more, you can check out this blog post. [00:10:20] Next piece of content PostgreSQL bi directional replication using Pglogical. This is from AWS, Amazon.com. They're talking about trying to get bi directional replication going with Pglogical and they walk through the steps of how to do it and get it running. Now, the only way it is set up to handle conflicts is basically the last write wins. So even though this is doing some bi directional replication, if there is a conflict, the last one to update basically wins. But if you want to learn more about how to do this, you can check out this blog post. [00:10:55] The Next Piece of Content data Normalization in PostgreSQL this is from CyberTech Postgresql.com and they're talking about normalization in general in the first 2nd 3rd normal forms as well as the voice cod normal forms. So if you are interested in learning more about normalization in postgres in particular, definitely check out this blog post. [00:11:17] The next piece of content. The PostgreSQL Person of the week is Kuntal Ghosh. So if you're interested in learning more about Kuntal and his contributions to postgres, definitely check out this blog post and the Last piece of Content we had another episode of the Reproductive Dev Show this past Wednesday. This episode was on the dark side of open source. Now, this is mostly focused on some Ruby and JavaScript libraries, but given Postgres is an open source project, you may find this particular content of interest. So this is a longer form, developer based discussion. But if you're interested in that, we encourage you to check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 257

March 19, 2023 00:19:16
Episode Cover

Adopting PgCat, Time Bins, work_mem Settings, Bad Constraints | Scaling Postgres 257

In this episode of Scaling Postgres, we discuss Instacart adopting PgCat, binning or bucketing your data by time, the best settings for work_mem and...

Listen

Episode 234

September 25, 2022 00:16:20
Episode Cover

Rust for Extensions, Timescale vs. Postgres, Uninterrupted Sharding, Data Flow | Scaling Postgres 234

In this episode of Scaling Postgres, we discuss using rust for Postgres extensions, performance comparisons of TimescaleDB vs. Postgres, uninterrupted writes when sharding in...

Listen

Episode 109

April 13, 2020 00:12:35
Episode Cover

Show Plans, WAL Monitoring, Using UUIDs, Default Logging | Scaling Postgres 109

In this episode of Scaling Postgres, we discuss how to show live plans, how Postgres 13 allows WAL monitoring, how to use UUIDs in...

Listen