Postgres 14 Released, Using JSON, Not Using Indexes, Sequence Gaps | Scaling Postgres 185

Episode 185 October 04, 2021 00:21:19
Postgres 14 Released, Using JSON, Not Using Indexes, Sequence Gaps | Scaling Postgres 185
Scaling Postgres
Postgres 14 Released, Using JSON, Not Using Indexes, Sequence Gaps | Scaling Postgres 185

Oct 04 2021 | 00:21:19

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss the release of Postgres 14, how best to use JSON, why are your indexes not being used and sequence gaps.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/185-postgres-14-released-using-json-not-using-indexes-sequence-gaps/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about postgres 14 being released using JSON, not using indexes and sequence issues. I'm Kristen Jameson and this is Scaling Postgres, episode 185 one. [00:00:24] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is PostgreSQL 14 released. This is from Postgresql.org and they've released version 14 and they mentioned a number of the different features in this announcement here. Some of the main ones that they talk about is being able to support JSON subscripting, so much similar to most programming languages in order to extract certain values out of JSON field. [00:00:54] Support for multirange types within the range type, data type performance improvements with regard to connection with, they say some benchmark showing a two X speed up, or at least the transaction performance doesn't drop by half as you increase the number of connections, like up to 10,000, for example, it maintains its performance. Some improvements to be sure. To reduce Bloat on Indexes, they talk about the pipeline queries, which is basically able to send a query before receiving the results back so you can pipeline multiple queries and then receive the results in an asynchronous manner. [00:01:32] Improvements to logical replication in terms of not having to wait for large transactions before data is transferred, and a number of other improvements, including changes to the foreign data wrapper vacuum improvements that we'll talk a little bit about more later, as well as a number of other security and performance and convenience features. So if you want to find out more about that, you can check out this post. Also going to talk about this post, which is how we shipped PostgreSQL 14 on Azure within one day of its release. This is from Citusdata.com. That's pretty impressive because sometimes it takes months for some of these hosted postgres services to get out the next release once postgres is released, and for them to do it within one day is pretty impressive. Now, they discussed that part in the end of the post, but they also mentioned some of the features that they find favorable. Of course, Microsoft did contribute to some of these big performance features that were mentioned. So for example, one is the connection scaling. So they did most of the work, to my knowledge, on improving the connection scaling so that when you try to go out to say, a 10,000 connection count, the performance doesn't drop almost by half compared to being able to maintain it at a high connection level. So for example, it looks like the difference between 100 connections and 10,000 connections doesn't seem to be significant, at least in this read only PG bench workload test. There were also improvements done to vacuum speeds that they say is up to 25% performance improvement as well as crash recovery, being able to do that much faster, like up to 2.4 times faster. Now, it looks like this recovery in its sync method is kind of related to being able to get a much faster startup recovery, but there's some considerations involved. So I would definitely check this out. If you run a large instance and want your database to come back online if it happens to crash, hopefully that's rare or nonexistent for you, but that might be something to look into. And then like I mentioned, the latter part of this post actually talks about how they got their extensions ready so quickly to be compatible with 14, as well as get Citize data, their scale out extension ready as well. So definitely encourage checking out this blog post. [00:03:56] The Next Piece of Content PostgreSQL 14 PostgreSQL is still a tweenie. This is from Instacluster.com and this is just another viewpoint of the improvements that he thinks are significant. Again, he's mentioning the connection scaling improvements. Again, that's a big one that most people mention they talk about a number of processes have been moved to the background to be able to do work in parallel, such as refresh, materialized, view, return, query and queries to foreign data wrappers as well as some improvements to concurrently so that you can do more actions concurrently. Or at least some improvements to those actions that are done concurrently. And then he mentions a whole host of visibility improvements. So these are improvements to monitoring the activity that the database does. So for example, there's a PG Stat progress copy to be able to monitor copy operations. There's PG Stat wall to give you more information about the write ahead log. PG stat replication slots. In addition to mentioning that they exist, it now shows you the status of the replication. So there's a lot of additional statistics and observability improvements that have been added to different areas of postgres. So if you want to learn more about that, you can check out this blog post. [00:05:11] The next piece of content upgrading PostgreSQL 9.6 to PostgreSQL 13. This is from Migops.com. Now, with the release of postgres 14, that means that PostgreSQL 9.6 will become end of life essentially on November 11. That's the last release. So if you're still on that version or even version prior, you're going to want to upgrade. Now of course you can do a PG dump and a PG restore into the new database version, but if you have a large database, that will take an incredibly long time. So this post actually runs through the process of using PG upgrade and the hard links method, which can give you a very fast cutover time on the order of seconds or minutes. If you have an incredibly credibly large database and they go through a pretty good process, one that I adopt myself in, that you have a primary and then a streaming standby server. And in preparation for the replication, say you're going to be upgrading this one, you create additional standbys ready in case you need to fail back to the prior version because of some issue. But this post goes through all the different commands that you would do to do this type of upgrade. So if you're interested, check out this blog post next piece of content. JSON in PostgreSQL how to use it right. This is from Cybertechn postgresql.com. This is talking about the tendency at least today, or they say as of the store everything in a large JSON, which is not really how you want to use JSON in postgres. And he has an example here, what he calls a bad example of where, say, you have a people table that you have some people in it and you have a set of rooms and basically you want someone to reserve particular rooms. And to do that you've decided to just use a big large JSON B field to store the information about who's assigned to what room. And so there's a whole host of things wrong with this that he mentions in the post below. First mistake is model regular columns as JSON attributes. So the first thing he's talking about is the name room. Like you could easily make this its own column within the table and you would want to do that. So don't put all of that information in the JSON blob. Second mistake model tabular data as JSON arrays. So for example, this could easily be essentially its own table, a reservations table, and that's exactly what he modeled here, created a reservations table and linked the people ID to the room ID, and then has a range of when it would be reserved. So that's a much better way to model it. The third mistake is storing foreign keys in JSON. So that means you're going to have to do this awful type of query here to link rooms to reservations, whereas you could easily do it with the data model changes he's proposed. The fourth mistake is it makes it incredibly hard to modify the JSON data if you store it like that. And this example was just making a new room reservation, whereas with his model it is a simple insert to do a room reservation. The fifth mistake trying to enforce constraints on JSON, which is nearly impossible to do, or there is nothing standard to do it. Whereas postgres has constraints you can put on each data field to enforce constraints either by data types or custom design constraints. An example here of using a Btree just index is to be able to ensure that a room is not reserved more than once. So there's a particular technique to do that. It's called an exclusion constraint. So those options are available to you if you model it in the way that he suggests. The 6th mistake is complicated searches in JSON. So looking to see if a time is free can get quite complicated up here, whereas it's a much simpler query to do it below that. So those were just some bad examples of issues you can run into if you try to store too much in JSON. Now he talks about a good example of it is you have your standard table, you have all the standard columns, but there are some additional attributes that don't apply to every object. So this is something in a warehouse and everything pretty much has a price, a weight, whether it's available or not, who manufactured it, packaged size, so those are consistent for any item. Whereas maybe you have certain attributes like he talks about, maybe it's a type of power plug, something unique for that item that is important for say a customer to know. You could store it in the attributes field but that's a good example. But if you want to learn more, you can check out this blog post. [00:09:57] The next piece of content using recursive queries to get distinct elements from a table. So this is from an example where someone had an 800 million row table with a set of stations and each stations had a certain number of channels. [00:10:13] And this is the query that they did where they wanted to get all the channels for a station. So they would select the station and then do an array aggregation of the channel and group it by the stations so it will show all the channels for a station. And basically it took about five minutes to run this. So he tried to do a similar thing and said how can we improve performance now he added the index on station in channel and it ran in 136 seconds so that takes a bit of time. Now he said one path you could do is that you could pre cache that value and maybe you do it as a materialized view or you create a separate table that you then update with triggers. That's one way that you could do it and you could get a query back in less than a millisecond. So basically you're caching it essentially. But he came up with another solution which is basically a skip scan. So the first thing you do is you do a select it from the table and limit it to once so it's going to get the first one. And again, that happens at about a millisecond. And then you do greater than that value that you just received and you want to get the next value that exists for that station, the next channel, and you're going to skip everything in between it and then just limit one. And again that ran at about a millisecond so you want to continue doing that for all the channels. So he did it as a recursive query, so it recursively runs through and finds all the channels for his station and with that he got the query down to 157 milliseconds so that's with no caching at all, that's significant improvement over 136 seconds. So definitely a great technique to consider if you need to do something similar. [00:11:59] Next piece of content. Why isn't postgres using my index. This is from Pgmuster.com and he says the two main reasons that maybe this happens is that number one, postgres can't use the index, or two, it thinks that it could do the query without using the index faster. And it covers some of these and some of the reasons he mentions that you can't use it, that maybe a function is preventing its use. And the main reason that would be is like if you're using a lower defined email, unless that has been defined as a functional index and lowering the email field, it's not going to work. Another type is a data mismatch permitting its use. And he uses an example of where he's trying to search with a numeric and it just basically does a sequential scan instead or the index does not exist. Now I'll add a fourth option here and that is if you're say using like or I like, and you only have a B tree index on it, it's generally not going to work. And then he goes into some of the reasons that it can use the index but doesn't think it should. One is the table is small so it can just do a sequential scan. Significant ports of the rows are being returned, so if it's returning 80% of the rows, there's no reason to use an index. It's faster to just do a scan. And limit clauses can have an issue with that, so sometimes you need to be cautious of that and they actually have a link to a Twitter thread talking about the potential dangers of using limit. So something to be aware of. And then the other consideration is also cost estimation. So if your statistics are off, it's not going to be using an index. So this is a pretty comprehensive post about why postgres isn't using an index, so definitely encourage you checking that out. [00:13:43] Next piece of content zero downtime postgres schema migrations need this lock timeout and retries. This is from postgres AI and he's talking about the importance of when you're doing schema migrations is to use a lock timeout because otherwise you can get things waiting behind a lock queue and not being freed up in your database system basically going down. So anytime you want to do any sort of a DDL change, you're going to want to use a short lock timeout and depending upon your implementation, you're of course going to want to retry it if it does get locked. And he advocates different ways to do that and particularly using an exponential backoff and jitter a randomization. So then when it tries again it's going to do it at a random time. So definitely encourage doing that. He also talked about subtransactions, but I would specifically avoid those because of issues with them. But definitely good advice to follow about using lock timeout when you're doing DDL changes. [00:14:44] The next piece of content gaps and sequences in PostgreSQL. So they're talking about how the gaps can happen in sequences in your, say, your primary keys that are integers. One way it happens is that if you're doing an insert and you get a failure with it, it will just start over at the next sequence number. It won't go back and do that sequence number. So you can see here we have a gap because when it tried to insert this row, that particular row failed and it's not going to reuse that sequence number, it's going to just get the next one. The other possibility is that caching so you can cache sequences so that your session doesn't consult the sequence every time it has a set of values that it can just reuse. But if you do a disconnect and start a new session, it's going to start that sequence after the point of that value cached. So the disadvantage of caching these is that if you have sessions constantly connecting and disconnecting, you're going to run through your sequences quite quickly. [00:15:47] The next possibility is that a crash. So if something is crashed and not saved to the database, it's possible that the sequences were not logged to the wall and you can get a gap of say, up to 32 numbers. So that's another gap potential. And that also that sequences can jump backward in the event of a crash if nothing is logged to the wall. So for example, if you start a transaction begin and you do Next file, next file, next file, next file, and then it crashes when the database comes up and you do the next file, it'll start back over at one because nothing was committed to the wall, essentially. Now, normally that's a problem, but he says don't use sequence values from an uncommitted transaction outside of that transaction because again, that could cause problems. Now he talks about a way to build a gapless sequence, but that's basically using a separate table that's tracking the numbers. It's basically creating your own sequence and that'll just be pretty slow. So I definitely wouldn't recommend that. But it's a way that you could do it or this is the way to implement it if you're looking for a guaranteed gapless sequence, but you're going to get performance problems. But if you want to learn more, you can check out this blog post. [00:17:04] Next post is three ways to auto increment with postgres. This is from mydba notebook.org. And these three ways are three ways that you can get sequences. They're no different. They're essentially all using sequences but in different manners. So the number one way to get something to auto increment is just create a sequence so it's not really set to the primary key. You can just use a sequence for any purpose to give you something that auto increments. The next way is through a serial data type and that explicitly creates a column that has an associated sequence that is the primary key and is not null the third way to do it is an identity column, and this is an SQL compliant way versus a serial data type that Postgres has. This is a SQL standard way of generating auto incremented IDs. I believe it still uses essentially the sequence mechanism in the background, but it doesn't show that explicitly here. So you can see here it's using the next file of a particular sequence, whereas here it just says generated by default as identity. Now, what they did mention is that this also has the support to do a generated always as identity, and what that difference means is that this prevents someone from manually inserting an ID. So basically it will always be done through the Identity column. But if you want to learn more about these three different ways to generating auto incrementing numbers, definitely check out this blog post. [00:18:33] The next piece of Content devious SQL dynamic DDL in PostgreSQL this is from Crunchydata.com and they're talking about different ways to do dynamic DDL in postgres. Now the example that they're having here, say you have logical replication and you have two tables and the sequences get out of sync. So basically you're wanting to do a restart of the sequence with what the max ID is in the table. So they go through some different ways to do that. Because if you try to do an alter sequence with a restart and do it as a sub query, it won't work because it's expecting a single essentially scalar value. So they show these different ways to do essentially dynamic SQL to accomplish this. So the first way to do it is to use Psql's variable substitution. So this is using the psql client to do it and you use GSET to set a particular variable, and then you use that variable in the next command to alter the sequence. The next way is to do the exec command within postgres where you define your query, and then you do a G exec on it to actually alter the sequence. The other way to do it outside of psql is to do it as part of an anonymous do block using plpgsql. Or you can also create your own exec function to do it, and that's another way that you can accomplish the same thing. But they do make a note that if you're trying to do some dynamic SQL that has reindex concurrently, that can run into issues because the concurrently operation needs to be run outside of a transaction block because it does its own control of the transactions to be able to accomplish doing things concurrently. So you need to be aware of that if you're trying to use dynamic SQL with it. But you can check out this blog post next piece of content is the postgres goal. Person of the week is Jeff Davis. So if you want to learn more about Jeff and his contributions to postgres, you can check out this blog post and the last piece of content, we did have another episode of the Rubber Duck Dev Show. This one was on when to use UUIDs and when not to. So if you want to learn more about that, you can check out this episode. The upcoming episode is on whether or not to comment your code, and that show will be live on Wednesday at 08:00 p.m. Eastern Standard Time. [00:20:54] That does it for this episode of Scaling Postgres. You can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks you.

Other Episodes

Episode 324

July 14, 2024 00:16:29
Episode Cover

Four Million TPS | Scaling Postgres 324

In this episode of Scaling Postgres, we discuss experiments to achieve four million transaction per second, the importance of extended statistics, parallelism in Postgres...

Listen

Episode 120

June 28, 2020 00:12:12
Episode Cover

Postgres 13 Beta 2, System Stats Extension, Tuned, Join Strategy | Scaling Postgres 120

In this episode of Scaling Postgres, we discuss Postgres 13 Beta 2, a system stats extension, tuning your operating systems for Postgres and different...

Listen

Episode 220

June 19, 2022 00:18:57
Episode Cover

PG 14.4 Release, Subscription Skip, Using Schemas, Open Source Citus | Scaling Postgres 220

In this episode of Scaling Postgres, we discuss a PG 14.4 release, how to skip transactions with logical replication, how to use schemas and...

Listen