Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about postgres 14, Beta One, PG 14 improvements, best primary keys and composite keys. I'm Kristen Jameson and this is Scaling Postgres episode 166 One.
[00:00:23] Alright, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is PostgreSQL 14 Beta One released. This is from Postgresql.org and they're announcing that the Beta One is now available for you to test out and install and try. It mentions a number of highlights here covering different areas of performance. Some of the ones that were particularly of interest to me is that having the large number of connections allow the database to operate more efficiently, potentially without using PG bouncer, whether those connections are in an active or idle state. There's been enhancements to partitioning in the postgres foreign data wrapper as well as reduced bloat on tables with frequently updated indexes. There's also improvements in data types and SQL. So one of the interesting things to me about here is the new subscript syntax that is possible when working with JSON, but definitely a lot of other improvements areas of administration, particularly with regard to vacuum and how it operates, and including the new idle session timeout capability, as well as replication, recovery and security improvements. So if you want to get an idea of what's coming in postgres 14 in the fall, you can definitely check out this blog post.
[00:01:40] The next piece of Content an early look at postgres 14 performance and monitoring improvements. This is from Pganalyze.com and they looked at a number of areas of the new postgres 14 that's in Beta One. And the first thing they looked at is the active and idle connection scaling that was done in 14. And the yellow line represents postgres 13.3 and the black diamonds represent the postgres 14 Beta One. And you can see postgres 14 maintains its connections better than 13.3 did, but not to a degree that was presented in the previous post. So this previous post where they talked about the patch, you could see a much greater improvement in the TPS, maintaining consistency out to around 10,000 connections, whereas previously it was falling off much more before. But we don't really see that with the test that they did here. So not really quite sure what's going on, if it's because it's still Beta One and there's maybe debugging set up or if there's something different about the configuration that was done. Not sure, and they don't really mention it in this post, but they do say that even with their testing at 5000 connections, it was a 20% improvement in TPS and a 50% improvement in TPS at 10,000 active connections. So an improvement that they've demonstrated, but not as dramatic as the one that maybe people were hoping for, or maybe again it's just a configuration or because it's still Beta One. The next area they looked at is memory use with the PG backend memory contexts. So this is a new view that you can get information about your connection and even specify particular connection with this function to be able to print out a given PID of a connection to get information about the memory that's logged to postgres's logs so you can get an idea of how memory is being used. Next here they looked at is some monitoring where there's a new PG Stat wall that gives you an indication of how many wall records have been produced, how many full page images, how many total bytes have been written. Now my understanding is that this is cumulative. So basically you would have to reset the stats periodically to get these to reset, but it's a cumulative indication of what's going on. So you could find this beneficial with resets to track the state of your wall generation over time. And then they mentioned the new query ID. So this is present in PG Stat statements, but this capability in Postgres 14 allows you to turn computation of the query ID on. So here they're making a configuration setting change and now it appears in the PG Stat Activity table. So you can track these are essentially the same plan. Even though the query varies by these parameters to the internal system, it's going to be producing the same query plan. So you can do something interesting and see okay, out of the active queries in PGSTAT Activity, what query ID is the most common? So you can see this query ID is using up most of the connections. So definitely highlights a number of improvements coming in postgres 14. So encourage you to check out this blog post to learn more about these specific features.
[00:04:52] The next piece of content UUID serial or identity columns for PostgreSQL auto generated primary keys so basically to me this is which auto generated primary keys are best to use. Now the first thing they talk about you can go with what's known as a natural primary key. So if you have a user's table and it has a unique username, well you could use that as the primary key or you could use the email address as the unique key for that table. The other possibility is to use a surrogate primary key or an artificial primary key. And a lot of application frameworks automatically generate these keys for you as the primary key, a surrogate primary key. So usually this is what I see, this is not as common using a natural primary key, but this post talks about if you're using an artificial primary key, what's the best way to keep it updated? Auto generated. And there are basically two techniques that you can use. One is using a sequence of numbers, the other is using UUIDs. So for generating a sequence of numbers, the best thing to use is a sequence in postgres. So it's a separate data object that generates a new sequence of a number each time. The other alternative is using the UUIDs and in postgres 13 and higher you can use the gen random UUID function when you're creating the column free primary key. So for example, with a Uuid you can set the default to be this function and classify it as the primary key. Otherwise, if you want to use a sequence, you can create an int or a big int, default it to the next value in the sequence. Another function that uses sequences and designate it as the primary key. But you're going to have to have this sequence in place. Therefore, Postgres offers some convenience data types that they're calling serial or big serial. And what that does is when you specify, say, a serial as the data type, it automatically creates it as an integer, creates the sequence and then sets that default as the next value in the integer. So it does that for you automagically. And of course a big serial uses a big int instead of an int. Now, there is a new way in Postgres to set up auto incrementing integers with a sequence and that's using identity columns. So basically you say name of the column. The data type is big int generated always as identity with a primary key and that essentially manages a sequence for you behind the scenes to give you an auto incrementing column. And this is basically the way to go because it's the SQL compliant way of doing it. Now, for completeness they do mention you can use triggers but that's definitely the slowest approach to produce this. And then they cover a few different questions that someone might have. So it says should I use integer or big integers for my auto generated primary key? And his argument is you should always use a big int because if you have a small table, the difference in size is not going to make that much of a difference because you're not storing that much data. But if you have a big table, you have a real risk of running out of integers. So he definitely suggests always using a big int and I definitely agree with that. Second question should I use big serial or an identity column for my auto generated primary key? And his response is you should use an identity column unless you have to support old PostgreSQL versions. Now, I would add to that that if your framework automatically uses the old syntax, I tend to use what my application framework uses. I don't want to kind of swim against the current or make things more difficult for myself. So when the time is appropriate, definitely move to the identity column, but they're still going to be supporting serial, big serial, things of that nature for quite a long time. Next question, should I use big int or UUID for an auto generated primary key? And his advice is to use a sequence unless you use database sharding or have some other reason to generate primary keys in a decentralized fashion, basically outside the single database. So I agree with this as well. I always use integers for my primary keys. The only time I don't is if some other service is generating a piece of data that I need to store. So if a JavaScript client is responsible for creating a unique identifier, then I would create it as a Uuid in my system because UUIDs have a number of performance disadvantages. Now they did talk about here where they did some benchmarking and found it slower in some cases, but there can be some huge performance implications as your data sizes increases with generation of wall. And there's actually this post from 2018 in November from second quadrant where they're showing how the amount of wall generated significantly increases, mostly due to full page image rights. And the reason this happens is because of the random writes. So if you have a lot of data being inserted into a table with a integer column, it's just going to keep increasing and usually just add it to the tail end of any indexes or the table as the data comes in. But if you have a Uuid, those index insertions are very random across the entirety of the index. So you're going to end up with a lot of bloat as well as a lot of full page image rights because a lot of different pages are touched and need to be saved to the wall. So they were seeing significant hits to TPS and size of the wall with UUIDs compared to sequential IDs. Now what this post talks about is there is an extension that allows you to generate semi random UUIDs. So they're partially sequential at the beginning and then they're random at the end. And that made the performance of the system much better. So if you want to go with UUIDs and you're dealing with a lot of data, maybe you want to consider using this extension that gives you better performance but still gives you kind of sequential random UUIDs. But definitely a great post talking about different options for auto generated primary keys. So if you're interested, check out this post from CyberTech Postgresql.com, the next piece of content database design using composite keys. This is from Endpoint.com and since we're talking about auto generate primary keys, this is an argument basically for natural keys. And in addition to using natural keys, he expands on it and says you could also use composite keys for your primary keys. So basically more than one column is in the primary key. And they give an example here of a typical company where you have departments and employees and say you need to add a new column to the department table that specifies the teams. And how would you do that if you were using a surrogate key for a primary key? And he's arguing that in a lot of cases using natural keys is better and when you do that, you could define a composite primary key that is composed of two different columns in that table. Now, some of the arguments I can agree with, but some I don't necessarily agree with and you definitely can go this way and there may be some benefits to it, but some of the benefits I can see it's not as much a benefit as he suggests. Like he's talking about performance considerations and that these natural keys could potentially use up more space and there's performance advantages to that. But whenever using more data, it is a hit on performance. But most of the reason that I haven't used this or explored this option highly or it's very much on a case by case basis, is that my application frameworks automatically generate surrogate keys and I've been able to operate my databases just fine using these surrogate keys and not using natural keys. Now there may be certain cases where I might reach for them, but generally I tend to follow what my application framework does again because I kind of don't want to swim against the current too much. But if you want to learn more about using natural keys, particularly composite ones with your database, you can check out this post from Endpoint.com.
[00:12:51] The next piece of content is actually a YouTube playlist called Postgres Pulse. Now these videos were posted a while ago, but here is a convenient playlist. Each of them is about 30 minutes in length and I think there's about twelve to 14 of them. So if you are looking for some postgres video content, maybe you want to check this out and see if there's any of this content that you want to look at that you've potentially missed.
[00:13:15] The next piece of content. PostgreSQL 14 substantial change to Full text query Parsing now I think what they meant here is there's a substantial change to the full text query parsing. So basically there's been a bug in full text search where when you try to search on something like PG underscore class, you weren't getting the expected results. Like here's a web search to TS query where you're searching PG class base PG and even though this text exists, it's false so it's not finding it. So there's basically some bug in it and they basically had to resolve this bug doing certain ways. As a result, you may get slightly different results from full text search in Postgres 14 based upon resolutions to this bug. So if you use a lot of full text search and you want to be aware of any potential changes that may happen to the results of that, you may want to check out this post to see if it would potentially impact you. So if you want to learn more you can check this out at Alexander Korotoko's blog at GitHub IO.
[00:14:22] Next piece of Content waiting for Postgres 3.2 St underscore Interpolate raster so this is from Crunchydata.com and they're talking about a use case where you have temperature readings, say across the state of Washington, and you want to actually get temperature readings interpolated between where the measurements were made. And if you want to do that using a Raster, there's apparently this new feature or function that enables you to do exactly that and give you a gradient to interpolate where different temperature changes are based upon where the temperatures were actually measured. So if you want to produce something like this, maybe you want to check out this blog post.
[00:15:05] The next piece of content porting array length from Oracle PL SQL to PostgreSQL plpgsql. This is from Migops.com, and they're talking about a specific case where you're porting an array length check in Oracle in a function to postgres where you get the entirely different unexpected answer. So if you are looking to migrate Oracle to Postgres and want to watch out for some gotchas, maybe you want to check out this post from Migops.com.
[00:15:35] Next piece of content PGO 4.7, the postgres Operator PVC, resizing, GCs backups, and more. This is from Crutchydata.com and they're talking about a new release of their postgres operator for working with Kubernetes. So if you are interested in that, you can check out all the new features that have been added to this release.
[00:15:55] And the last piece of content, the PostgreSQL Person of the week is magnus Hagandar. So if you're interested in learning more about Magnus and its contributions to postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.