The Future of MySQL is Postgres? | Scaling Postgres 313

Episode 313 April 28, 2024 00:15:33
The Future of MySQL is Postgres? | Scaling Postgres 313
Scaling Postgres
The Future of MySQL is Postgres? | Scaling Postgres 313

Apr 28 2024 | 00:15:33

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss whether the future of MySQL is Postgres, how to use recursive CTEs, work on OrioleDB, and PG17 performance improvements.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/313-the-future-of-mysql-is-postgres/

 

View Full Transcript

Episode Transcript

[00:00:00] Under the heading postgres is taking over the database world. We have a company who's proposing having it replace MySQL basically by becoming wire compatible with it and its different clients, including its internal capabilities. But we'll explore all that and more this week, but I hope you, your friends, family, and coworkers continue to do well. Our first piece of content is actually a presentation that was done at the Postgres conference 2024 called the Future of MySQL is postgres. And this is from nextgrez.com. And I'll turn off my camera for a moment so you could see the big beefy postgres taking care of the little dolphin, maybe prior to it getting taken out, I don't know. But this presentation starts with giving a general lay of the land and namely that, you know, Postgres is one of the few databases that continues to increase in usage according to Stack overflow, while as things like MySQL is declining, Microsoft SQL Server is declining, etcetera. And in spite of this, MySQL is still the number one open source database according to DB engine rankings. But talk some more about comparison between the two database systems. And in around slide 17 they gave a list of suggestions about if you're doing a migration from MySQL to postgres, what are the things you need to think about? And a lot of it is related to data type differences between the two database systems and how you need to handle that. And then they talk about migration approaches and different tools or techniques you can use to migrate from MySQL to postgres. But then they mentioned, even if you're doing this migration now, you got to make all these application changes to be able to work with a postgres database. But they're looking into making postgres be able to talk and act like a MySQL database system. So basically you don't have to change a thing if you want to start running postgres. And this is similar to what AWS was doing with Babelfish where using Babelfish you become wire compatible with Microsoft SQL Server. So basically they want to do this with a postgres extension and they want wire compatibility with the MySQL protocol. They also mentioned an Febe protocol here, here. I'm not quite familiar with that, but how they intend to do it is they need to first start with a patched postgres, and they want to use the exact same hooks that exist with AWS, babelfish. And further, they want to get these hooks accepted into baseline postgres, so that way you don't even need a patch postgres, you can just install an extension and it works. So these are some of the benefits they hope to realize with it. So you'll have the wire protocol with MySQl, the SQL syntax of MySQl that would need to work as well. And they have a pretty aggressive timeline here where they ideally want to have general availability by the end of this year. And they're looking for help if anyone wants to join. So this definitely seems like a big reach. I mean, I hope they're able to leverage what AWS has done with Babelfish to help them achieve the goals. But it's definitely interesting. But check out this presentation if you want to learn more or even participate in it. Next piece of content recursive ctes transforming and analyzing data in PostgreSQl part three this is from redgate.com and this is the third part of the series that we've been covering talking about using postgres as a transformation tool in extract, load and transform. And in this series they're focused on recursive ctes. So it's basically a way of doing loops or iterations within a single SQL query. So it's not using Pl, PG SQl or anything like that, it's just using a recursive Cte. And they're attempting to solve another advent of code puzzle. So basically it looks like this puzzle is walking through a file system and then they go through the process of building the recursive CTE to give the solution to it. So if you want to flex your SQL muscles, definitely give this post a look. Next piece of content Aureoli join Supabase this is from supabase.com and I did mention this last week as part of the greater group of blog posts that Supabase posted as a result of them going general availability. And I've covered Aureoli DB before and some of the things that they're working on, but there are some things I haven't heard of, or at least I've forgotten, is that in addition to working on building a fast storage engine for postgres and help develop pluggable storage in postgres that are also working to decouple storage and compute in postgres. So I wasn't aware of this third item that they're working on. So basically Aureoli DB is another storage engine for postgres. So you would create a table and put it on this new storage system so it would not use the standard heap. And basically they're trying to achieve reduced IO. So they're implementing a row level wall and a non persistent undo log. So I'm wondering if this is all contained in memory, so nothing's written to the disk. They also talk about implementing lockless access for in memory data, utilizing page level data compression that can get up to four or five times space savings. Because it's using an undo log, as opposed to storing all the different versions in the table rows itself, you get bloat reduction and lastly storing table rows in the leaves of the primary key index, making primary key lookups very cheap, which can also save storage space. But I wasn't aware of this decoupled storage and compute. And as a result of the development of this engine, they've realized a significant reduction in the amount of disk I o necessary, such that you can still get good performance by putting your database files in s three, which seems a little crazy to me, but they have a YouTube video here showing the database running with an s three bucket, essentially. Now, I still have a whole lot of questions like I saw that wall files were being archived to s three, which begs the question, okay, where are the wall files? Are they in the PG wall directory? Are they somewhere else on some other disk? Definitely something I'm going to be keeping my eye on, because I've really been waiting to be able to have a different storage system that uses an undo log. I thought it would be initially z heap, but that seems to have stalled and Oriole DB seems to be the next candidate because I see a lot of use cases where if you have an update heavy workload, the current heap is just not the best place to do it because you're constantly creating new rows based on how postgres MVCC works and how all those versions are stored in the heap. Well, if you had an efficient undo log, you could potentially store update heavy tables there to avoid that type of issue. So basically, I think it would be great to try out a different storage solution for one or two tables of a database, not having to commit everything to it at once, and work out any issues that appear over time, and maybe it will become the new default storage solution. But check this out if you're interested. [00:07:25] Next piece of content waiting for postgres 17 faster b tree index scans for n lists and any equals were basically four arrays. This is from pganalyze.com, and this is Lucas's five minutes of postgres, and there was a patch submitted that optimizes in queries or any looking in arrays. And basically, as he says here, it helps with reducing duplicative leaf page access. So when you have a query like this, so you're saying get me some data where the id is in this list of ids. What it actually does, if you look at the PGSTAt user tables, is it does seven index scans, so basically one for each value in there, which seems a little crazy to me. I would have expected one index scan, but this is on postgres 16 today, and you'll also see that the shared hits were 15 for this particular query. But if you look at a patched postgres 17, the shared buffers being hit was only three compared to 15. It is actually about three times faster the execution time, and you'll see that the index scans is only one, so it's able to more efficiently do one scan of the index to compare multiple values. Now it won't always be one, presumably if you have thousands of ids in your n, but it should result in some real world benefits. This also results in an improvement of multidimensional access methods. And this is the case where you're comparing two or more in lists. So he has an example query down here where you're looking at a status is in two certain states and a sender reference is in three different states and order it by the sent at. And there are multiple different indexes here. You can actually click on this working example to see the indexes, but postgres 16 chose the index where the sent at was first. So that's to help with the ordering. But it just does a filter on the other columns. It doesn't use an index at all, even though it exists, and as a result it's not ideal performance, 270 milliseconds. But in postgres 17 it's enabled to use these index conditions in the where clause, and it doesn't have to do a filter like it did in 16, and as a result it's about three times faster at 60 milliseconds. And of course the buffer is being used is much lower as well. So these are some great improvements. I hope it makes it into postgres 17. Definitely looking forward to it. [00:10:01] Next piece of content. There was another episode of Postgres FM last week. This one was on massive deletes. So they're talking about when you want to delete a bunch of rows in your table, maybe it's 100,000 million, 10 million, whatever it is. And basically talking about the problems that massive deletes can cause because first of all, if you don't have a high enough max wall size, you could get checkpoints happening really frequently, which can bog your system down, as well as potentially result in new full page image rights to the wall, further decreasing your performance. And Nikolai is saying basically what you want to do is batch them, but those batch sizes shouldn't be too small because then you'll have a lot of overhead from the commits, but you don't want them to be too large, causing some of these other problems. So basically, like Goldilocks, you want the batches in appropriate size where they don't impact your system. So what I like to do is start my batches off small, maybe 1000, and then ramp them up to 5000, 10,020, etcetera, as long as there is no production impact, or if there's a production impact, it's minimal and within acceptable limits. The other thing I do that they mentioned as well is interleave vacuums throughout different batches, so that way your table keeps being well maintained. But they also mentioned depending on how much data you're deleting, you may even want to re index your indexes if you think it's going to be significant enough. And of course if this is something you're doing on a regular basis, partition tables are a great solution for being able to delete data periodically if you can. So dropping a partition is much more efficient than deleting all the individual rows within the data. But you can listen to the episode here or watch the YouTube video down here. [00:11:48] Next piece of content auto archiving and data retention management in postgres with PGpartman this is from crunchydata.com and apparently there's a new release of PGPartman where they actually allow auto archiving, as in moving partitions into dedicated schemas for archival purposes. And like I've mentioned previously, I usually don't use PGPartman. I have my own scripts that I kind of set my partitioning up the way I like. But if you're looking for a tool to help you do that, PGPartman is pretty much the go to. So check this out if you're interested. [00:12:24] Next piece of content is the PG extension network version two architecture. So this is a post on the postgres wiki and talks about the proposed architecture for the postgres extension network version two. They show the current architecture here and what they're proposing and describing all the different layers and services that they they are wanting to set up. So if you're interested in that, you can check out this post next piece of content ten PSQL commands that will make your life easier this is from timescale.com dot, so we'll just run through these really quickly. So slash d allows you to describe all the relations in your database slash d space table name allows you to look at the definition of a single table slash e helps you edit your query buffer so you can do z do your edit and then execute it ef and then your function to edit a given function x to toggle expanded output timing to turn on or off timing slash c to connect to different databases on the same cluster slash copy to do a copy slash I to read in a file to run SQL commands and slash question mark just to see all the different options that are available. [00:13:36] Next piece of content announcing PGDsat to satisfy CIs benchmarks for postgresql this is from hexacluster AI. They released an extension that helps you test out 80 different security controls that are part of the CIS benchmarks and it produces HTML or text reports. And here are all the different checks presumably that they do as a part of it and they showed you how to build in an install the extension. So if you're interested in that, you can definitely check out this blog post. Next piece of content a deeper look inside Postgresql visibility check mechanism this is from Heigo CA and basically what visibility has to do with is each database session has its own view into the data and what data it can see and not see is based upon the visibility calculations and he goes through all the different places that determine what can be seen and unseen. So if you want more details about how visibility calculations are done in postgres, definitely check out this blog post. I was particularly impressed by his flowchart here that shows the decision process of determining whether a given row for a given session is visible or invisible. So check this out if you want to learn more and the last piece of content Postgresql internals part three understanding processes in postgresql this is from stormatics tech and they go over all the different processes in postgres and what their jobs are, what their responsibilities are. So check out this post if you want to learn more. [00:15:08] I hope you enjoyed this episode. Be sure to check out scalingpostgras.com where you can find links to all the content mentioned as well as sign up to receive notifications of each episode as they come out. Also you can find an audio version of the show as well as a transcript. Thanks and ill see you next week.

Other Episodes

Episode 214

May 08, 2022 00:19:29
Episode Cover

Backup Compression, Postgres IO, Parquet Files, pg_stat_monitor | Scaling Postgres 214

In this episode of Scaling Postgres, we discuss parallel server-side backup compression, IO in Postgres, parquet files and the new pg_stat_monitor extension. To get...

Listen

Episode 32

October 01, 2018 00:11:58
Episode Cover

Sysadmin Concerns, Power of Indexing, pgbouncer Monitoring | Scaling Postgres 32

In this episode of Scaling Postgres, we review articles covering sysadmin concerns, the power of indexing, pgbouncer monitoring and pg_prewarm. To get the show...

Listen

Episode 123

July 19, 2020 00:13:28
Episode Cover

JSONB Types, Earth Distance, Dates, Times & Intervals, Authentication | Scaling Postgres 123

In this episode of Scaling Postgres, we discuss working with JSONB types, calculating earth distance, utilizing dates, times & intervals and Postgres authentication. To...

Listen