Postgres 17 Beta 1 Released | Scaling Postgres 318

Episode 318 June 02, 2024 00:17:54
Postgres 17 Beta 1 Released | Scaling Postgres 318
Scaling Postgres
Postgres 17 Beta 1 Released | Scaling Postgres 318

Jun 02 2024 | 00:17:54

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss all the new features in Postgres 17 Beta 1, some features that did not make it, database collations & sorting and causes of slow commits.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/318-postgres-17-beta-1-released/

 

View Full Transcript

Episode Transcript

[00:00:00] I'm not gonna lie, I got pretty excited seeing all the different features that are coming in postgres 17 this fall, but unfortunately there were some things that aren't going to be making it into 17, but we'll talk about that today. But I hope you, your friends, family and co workers continue to do well. Our first piece of content is PostgresQL 17 beta one is released, so in the opening paragraph they're definitely looking for people to download and test out the database. So if you are able to do that, that would be a great help to the community. [00:00:35] But checking out the feature highlights here under query and operational performance improvements they mentioned the improvements to vacuum that resulted in a 20 x memory reduction due to a new internal data structure, as well as the 1gb limit on memory has been removed. So now vacuum can use a lot more memory to do its work, which is great. They introduced a new stream IO interface that should improve sequential scans and running analyze. There's been a number of changes that should help with common table expressions or ctes, otherwise known as with queries, as well as improving execution time when an n clause is used with b tree indexes and parallel index builds are now possible with brin indexes. There's also been improvements to how the wall locks are are managed, showing up to a two x performance improvement in partitioned and distributed workloads enhancements the postgres foreign data wrapper can push more exists and inquiries down to the remote servers and several logical replication improvements. One, you don't need to drop logical replication slots when doing a PG upgrade, so that way you can avoid resyncing the data. But this is what I'm more so looking for failover control for logical replication. So essentially you have failover slots now and then the PG create subscriber where you can convert a physical replica into a logical one. In terms of developer experience, there are a number of new JSON features added, such as JSON Table. The merge command now supports the returning clause. Copy performance has been improved up to a two x performance improvement when exporting large rows. And again, the thing I really like is the on error feature that allows copy to proceed even if there's an error, and it gives you an indication of what rows need to be reloaded. And I think this is huge too, is that postgres 17 now has a built in collation provider so we don't have to rely on Libc or even ICU anymore, making upgrading easier. But the only disadvantage right now, it just uses a cutf eight. I believe so. If you want to do linguistic sorting or ordering, I think you'll still have to rely on another provider for that. But that definitely minimizes the number of indexes that would need to be re indexed if the correlation definitions change. Like you move from one operating system to another, or if you're using ICU and you upgrade those ICU collations, they talk about a number of security features. And also the main thing with backup is having incremental backup, which is awesome. And then improvements to monitoring where explain can now do serialize. You see the serialization component of preparing data as well as some improvements to vacuum progress as well as other system views. So definitely I don't have time to cover absolutely everything, but this is an awesome feature set and I can't wait for postgres 17 to get here. And a lot of the changes with query performance, you don't really have to do anything, it's just the planner has become better to make things more efficient, which is awesome. Now, related to this, we'll look at what's not in postgres 17 beta one and how to test it from pganalyze.com. and this is Lucas's five minutes of postgres and he does a review of what's in the beta. He definitely emphasizes to people to please go ahead and test this out, but he also mentioned some things that didn't make it in. One is the temporal tables or the temporal primary key, so that feature looks like it didn't make it. Next is cataloging of not null constraints. Next is removing useless self joins or transforming or clauses to any expressions. So I think those last two are great additions, making the planner more intelligent in rewriting queries to be more efficient. But it's not a big loss because you can always rewrite them yourself. But I think the thing I am most saddened about is this one custom rel options for table ams and other changes that Alexander Korotokov worked on for table access methods. Now, Lucas doesn't explicitly mention here, but I think this is related to AureoledB because all of the performance enhancements that AureoledB is working on requires making some changes to the table access methods so that OrioliDB can operate like a basic extension. It doesn't require patching postgres for it to work, and unfortunately it just extends the amount of time before we have another storage solution available for postgres other than the heap. So that's unfortunate, but if you want to learn more you can definitely check out Lucas's post here. [00:05:15] Next piece of content default sort order in db two SQL server Oracle and postgres 17 this is from ardentperf.com. and this is another one of Jeremy's really comprehensive posts that I like a lot. And he's talking about collations. And I'll just read this one section here. He says, quote, I now think that code point order, which is basically binary sorting, is the best default db collation, mirroring what db two and Oracle does. And linguistic sorting can be used by SQL when it's actually needed for the application logic. And I've actually come around to this way of thinking too, particularly with the new provider being offered with a cutf eight collation in postgres itself. So now when you need linguistic sorting, you can just use another provider and whatever language you need to sort it in. But he goes over how db two handles things, how SQL server handles things, how Oracle handles things, as well as how they potentially mishandle some stuff. So it's interesting how everybody kind of has their own issues. And then of course the new postgres 17 feature. Oh, the other thing I forgot to mention is that the advantage of a C or a C UTF a correlation is better speed for sort performance. So it really does make a lot of sense to only use linguistic sorting when you have to. So another reason I'm definitely looking forward to what's coming in postgres seven d and check this out if you're interested. And if you are new to scaling postgres and performance in general, I do have a free mini course called the postgresql performance starter kit, and it has three lessons on PSQL understanding explain as well as PG stat statements. So if you're interested in that, you can sign up in a link in the description below. [00:07:03] Next piece of content why do I have a slow commit in PostgreSQL? This is from cyberattack postgresql.com, and they're talking about an issue where someone was looking in their database and they were seeing a lot of commit statements. In other words, that's the only thing that was there, commit semicolon. So why would this be so slow that it shows up? And he goes through a couple of the reasons of why that could be. Now he says, normally it's very fast because when you do a commit, all you have to do is set two bits in the transaction in the commit log. If you have enabled track commit timestamp that does need to be written, but then you just flush the write ahead log to disk unless you have synchronous commit off. So due to the fact that you're writing primarily to the disk, this is usually a clue if you're seeing something being slow, and that may be a disk problem. So that could be a local disk problem. Hopefully you're not on magnetic disks, but that could be a cause. Or if you are on network attached storage, there could be latency in the network or some other disk, or contacting the disk issue. A few other reasons that could cause it is that you have deferred constraints or triggers. So what deferred means is it waits until a commit happens, and then it does all of the constraint checks and trigger firing. So that could cause a delay in that, and he shows an example of that here. Another reason it could be caused by cursors withhold. So a cursor by itself shouldn't cause an issue. But when you specify withhold, which actually materializes the query for the cursor. This materialization step happens at commit time. So if it's a very long query, that's going to be a long commit in your logs. Another reason that could cause a slow commit is if you have synchronous replication enabled, because what that means is when a client sends a request to the primary server, it has to be written not only in the primary server but also the wall on however many synchronous replicas you have set up as well, so that incurs even more network round trip time. So that could slow a potential commit. And then lastly he says it could be caused by third party extensions. For example, a foreign data wrapper that implements transaction processing at the remote data source may want to commit the remote transaction when PostgreSQl commits the local transaction. So if something like that happens, you're going to have more latency and slow down commits. But I thought this was a pretty interesting blog post, and you can check it out if you want to learn more. [00:09:31] Next piece of content how we fused duckDB into postgres with crunchy bridge for analytics this is from crunchydata.com dot. Now normally I don't cover a particular vendor's products on scaling postgres, but I found this really interesting, how this product was built and what it does, so I wanted to showcase it here. So this is from Marco Slot, who used to work on Citus, and what sidus does is scale out postgres. So basically with Citus you shard your data onto multiple different postgreSQl servers. You contact a coordinator node and then it rewrites queries to pull the relevant data from each of the individual postgres data shards. Well, they took that same concept and now they're doing SQL pushdown to an OLAp engine, specifically duckdb. So you send your query to postgres, it gets rewritten and sent to the ductDB processing engine that relies on a local NVMe drive for caching. And it can read files to access the data, for example on s three. So these could be parquet files on s three, for example. So I thought this was super fascinating. Now they don't go into all the detail, but they give a basic overview of how it works. And if you're interested in that, definitely encourage you to check out this blog post. [00:10:56] Next piece of content top five postgresQl surprises from rails devs this is from andyatkinson.com and he recently attended Senseti Ruby and he had some feedback I believe on his presentation from some rails programmers. And apparently these are things they didn't know or were surprised by. So the first one he mentioned was covering indexes. So this is where you include a payload on an existing b tree index. And the use case, I think about if you're looking up a user buying id and all you need is their email or their name, well, you could include one of those columns or both of those columns as a payload on that id index and be able to get all the data you need from an index. Essentially just do an index only scan. And he has a slightly different example here where he's looking at trips and it's just pulling a rating as an additional payload. The next area he covered is viewing pages accessed and this is looking at blocks or buffers or pages in postgres to get an indication of how much work is being done. So pretty much this is looking at the buffers output as part of explain to understand how much work is being done. The third thing that was surprising is ordering topics. And I don't know if they were confused by how to order an SQL. Maybe they're ordering at the application level. That's interesting. The force surprise was enumerating columns versus selecting everything. And I guess they really don't know how the ORM is translated to SQL because of course the advantage of this is you only pull what you need and have the database do less work. The fifth surprise is using postgres for more types of work. And here it looks like he was talking about it's possible to use postgres instead of another database solution. Like maybe redis or elastisearch. There are utilities that you can use with postgres instead of bringing in these other database systems. Now of course those have their advantages for certain use cases, but postgres can basically do a lot for you. But check out this blog post if you want to learn more. Next piece of content there was another episode of Postgres FM last week. This one was on full text search, so Nikolai and Michael talked all about it. And when I think about full text search, I think of it as basically whole word search. So it uses a typical dictionary natural language to be able to identify the word. So it ignores stop words like a and things of that nature and just searches for a whole word. If it's misspelled, it's not going to find it. It does support prefix searching, but it doesn't really do similarity search. But related to similarity search, they did talk a little bit about trigrams and Pgtrigram, the extension that allows you to break content up into three characters such that you can get a similarity search that is separate from full text search. They also talked a little bit about semantic search, which is using vectorized search using PG vector, and whether you should use that or full text search. And they made an interesting point in that doing semantic search with PG vector. It's an approximate nearest neighbor search, so basically it is not exact, so you could be missing expected data. On the other hand, full text search, as long as you're accurate in your query, will always find all the references to what you're searching for. So they were thinking about maybe there's a use case to combine those together so you can accurately find everything with full text search, but the semantics search can bring other additional relevant records into the search results. So I thought that was pretty interesting. If you want to learn more, you can definitely listen to their episode here or watch the YouTube video down here. [00:14:37] Next piece of content carbon footprint and query optimization this is from htombrovoskaya dot WordPress.com dot this is a short post, but she makes an interesting point that, you know, if you care about the environment, optimizing your database is a way to do that because essentially your server is doing less work, requiring less electricity, producing less heat, etcetera. I usually make the argument that you save a lot of money when you do performance optimization, but this is another benefit of doing performance optimization so you can check it out if you're interested. Next piece of content custom PostgreSQl extensions with rust this is from notsoboringsql.com dot. And in this post he builds an extension he calls PG sysload. So he's using the system load average on a Unix or Linux based system because he wants to place a governor on certain jobs running in postgres. So he starts with the PGRX framework. So he installs rust, installs the system dependencies, and sets up the cargo PGRX. And this is the code. He's basically parsing the proc load average file to get the data he needs. Once that is done, they run the extension, create the extension in the database, and now when you run this function says load average, it outputs the one, five and 15 minutes load average. Then he actually does an implementation to throttle batch processing by showing a way you can cause a particular session to sleep based upon the load average. And this one down here looks like it adjusts the number of rows processed per batch based upon the the current resources. So check this out if you want to learn more. Next piece of content converting DM's to post gis point geometry this is from crunchydata.com and DM's data is basically degrees, minutes, and seconds data. So if you have a point somewhere on the globe, you would need a longitude at a latitude of the degrees, minutes, and seconds. Well, you can actually convert these coordinates into a point which postgres likes to work with using this function here. So what they did is they created a function to be able to convert from a DM's coordinate into a point system in postgres. Now you can add a new column to your geometry and go ahead and convert that point. And the last piece of content, postgreSQl data cleaning versus Python data cleaning this is from timescale.com and this is a super long post. It's listed as a 24 minutes read, but this is similar to other posts that have talked about not necessarily cleaning up everything beforehand, but actually loading your data into postgres and doing the cleanup and transformation in postgres itself. So if you're interested in that, definitely check out this blog post. [00:17:27] I hope you enjoyed this episode. Be sure to check out scalingPostgres.com, where you can find links to all the content mentioned in this episode, as well as sign up to receive weekly notifications of future episodes there. You can also find an audio version of the show, as well as a transcript. Thanks, and I'll see you next week.

Other Episodes

Episode 265

May 14, 2023 00:16:47
Episode Cover

pg_stat_statements, Transaction ID Wraparound, Consultant Knowledge, CitusCon | Scaling Postgres 265

  In this episode of Scaling Postgres, we discuss different ways to use pg_stat_statements, how to handle transaction ID wraparound, consultant knowledge and all the...

Listen

Episode 81

September 16, 2019 00:12:44
Episode Cover

Data Loading Speed, View Dependencies, Users & Roles, H/A Clusters | Scaling Postgres 81

In this episode of Scaling Postgres, we discuss data loading speeds, view dependencies, users & roles and high availability clusters. To get the show...

Listen

Episode 7

April 09, 2018 00:11:09
Episode Cover

Any Scale, Monitoring Vacuum, Copy, Annotated Config | Scaling Postgres 7

In this episode of Scaling Postgres, we review articles covering Postgres at any scale, monitoring vacuum, using copy, annotated config files, and how to...

Listen