Episode Transcript
[00:00:00] Today we talk about a day where postgres development stopped, and a few other times, but I hope you, your friends, family and coworkers continue to do well. Our first piece of content is how engaging was Pgconf dot de v? Really? This is from Peter Dot eisentraud.org and he apparently posted this on Twitter. How engaging was PGConf dev no commits to postgresql during the entire conference. So of course he wanted to see how long that dry spell is, as it were. So we loaded up the commits into postgres and analyzed it with his query here to see how long the different gaps were. And this particular gap was the fourth longest gap, oh my gosh, in almost 28 years or so. So I don't know if it's necessarily good or bad that the longest gap is not on a holiday, but it's when people are talking about postgres at a conference, I guess you could say postgres developers are definitely committed, pun intended, to their work. And there was another tweet by Thomas we should stop doing conferences. It obviously slows the development down. But check out this blog post if you want to learn more about the stats that he collected. Next piece of content PG Lake House a duckdb alternative in postgres this is from paradedb.com. now, about four months ago they released something called PG analytics, an extension that embedded Apache data fusion to do analytics in postgres in a very performant way. What's interesting is that in their blog index I no longer see PG analytics, but now they're talking about this new extension, PG Lakehouse.
[00:01:50] And initially I thought this was embedding duckdb because I was thinking of the crunchy data blog post from last week where they're using DuckDB to do analytics, but they're saying it's a duckdb alternative because internally they are using the Apache data fusion engine, but how they're using it is a foreign data wrapper API to connect to external data resources, and the queries are pushed down to Apache Fusion. So apparently this is the query engine for these analytical queries, and the extension uses Apache Opendall, which can access up to 40 different types of data stores. So I think PG analytics was maybe only for parquet files, but now with this integration with Apache Doll, you have all of these different object stores, file formats and table formats. It might be that only parquet files in s three work right now, but they're looking to add in more of these data formats for PGlakehouse to support, but they do have an example of how to get it working and how to get started with some sample data, doing some queries in it. So if you're looking for a data lake solution and want to keep using postgres, maybe check out this new extension, PG Lakehouse. And if you're ready to get started to learn more about postgres performance, I welcome you to check out my mini course PostgresQL performance starter kit. It is a free course that just has three lessons in it. There's a link in the description below where you can go to sign up for the course.
[00:03:22] Next piece of content introducing PGcompare, the ultimate multi database data comparison tool. This is from crunchydata.com dot and much like you have a diff command that allows you to compare two different files on Linux, for example, and show you what is different between those files. This apparently does it for database systems and the key features are multi database support. So it not only supports postgres, but it also connects to Oracle, MySQL and Microsoft SQL Server. It generates a comparison report that highlights the differences between the data set. So apparently it can identify missing records, mismatched values and summary stats. Now all the results are stored in a separate postgres database. So this does have a database that it connects to to store all of the data differences it detects. And you can actually specify what columns you want to include or not, or I assume what table. So there's options you can adjust. And it says it is designed to handle large data sets with minimal impact on both the source and the target system. Although I wonder how performant it is because when I looked at this I was thinking ooh, could this be a tool that I could use if I'm doing a logical replication upgrade? Because there's some base checks you can do. Does it have the same number of tables, same number of indexes? Is the count of the tables identical, and are some of the well known queries returning the same answer? It's just not as comprehensive a check to make sure everything is identical. And this tool may offer that capability, although I'm not quite sure how it handles things that are recently changed. Like does it ignore new data coming in for its comparison operations? I didn't see that referenced here, but this is an open source tool. They have the link to it here on GitHub and it looks like it is a Java tool. So they tell you how to get started with it and how to run a comparison and they're imagining that this could be used for data migration purposes, data synchronization purposes, regulatory compliance, or quality assurance. So it's definitely something I'll be checking out, and feel free to check this out as well if you're interested.
[00:05:25] Next piece of content there was another episode of Postgres FM last week. This one was on sponsoring the community and Michael was joined by Claire Giordano to primarily talk about the postgres open source community, and specifically she talked about the number of things that are going on at Microsoft, which she was part of Citus and joined Microsoft when they merged. They talked about the upcoming conference coming up this week, Poset, as well as all the different ways that you can contribute to postgres other than just code. So if you want to check out their episode, you can listen to it up here or watch the YouTube video down here next piece of content ultimate guide to Posette an event for Postgres 2024 edition this is from citusdata.com and this is a free and virtual event from June 11 to the 13th. So that's this week, and this post talks all about the upcoming event. So if you want to check it out, definitely check out this blog post so you can watch some of these talks.
[00:06:24] Next piece of content postgres planner quirks join, equivalence classes and n any filters this is from pganalyze.com and he's talking about an issue someone had when they were joining a table together and in the where clause. They also were looking for an array of documents essentially similar to an n clause, but they were using any, and given the size of the tables, they were getting some poor performance from this nested loop join. And Lucas talks about there is some optimization where postgres knows if it's joining on an id that is part of an equality check, not for an array or an n clause. It can do a more efficient join and apply this restriction to the first table joined on here as well. But that optimization does not work for any or n clauses, he says. Unfortunately, there's not any kind of optimization in postgres 17 for this, but but how you can get around it is actually add a where clause on that other table specifying the exact same any clause, and when this was applied, the query was about a thousand times faster. So it basically went from hundreds of milliseconds to less than a millisecond. But if you want to learn more details about that, definitely check out his piece of content.
[00:07:41] Next piece of content PG Friday taking postgres for granted. This is from bonesmosis.org dot and he's talking about the relatively newer features of having predefined roles in postgres, and that when you wanted someone to be able to write anywhere in the database or read anywhere in the database, there was a fair amount of grants you had to do to achieve that. And he has an example of it here. But in recent versions of postgres, they have predefined roles. So you can assign a user to PG readall data. So it's much simpler to implement or PG writeall data, or PG readall stats or pgmonitor etcetera. So there's a number of these predefined roles that give you quick and easy access to particular areas of postgres. So if you want to learn more about that, check out his blog post.
[00:08:29] Next piece of content foreign keys in PostgreSQl Circular dependencies this is from cyberdeck, postgresql.com dot. And he's talking about the situation where you have, say a department table and you have an employee table. So an employee is a part of the department. But what if you have a leader or a manager in the department that you want to assign as an employee? Well, that is a circular reference, and if you actually set this up this way, you can't insert any data into it. But the solution to it is to actually use an initially deferred clause for one of the foreign key relationships. So what that means is that one of these foreign keys won't be checked until commit time, so you're able to insert the data successfully, and then when you do the commit, it will validate that everything is correct. So that's how you can implement a circular dependency in postgres.
[00:09:23] Next piece of content controlling resource consumption on a PostgreSQL server using Linux. Cgroup two. This is from procona.com and we've had cgroups for a while in Linux, but now apparently there's cgroup two. That is even better. And he says if you want to try this, you do need to be on an updated version of the kernel and or Linux distribution for you to get the cgroup two capabilities. And he shows how you can set up postgres and assign it a specific set of resources for the system.
[00:09:54] So maybe you might want to implement this if you're trying to share multiple databases on the same box, but not have the noisy neighbor problem. So one of the databases hogging all the resources, that allows you to firewall that essentially with this. Or he said, maybe you even want to run an application server and a database server on the same box and allocate resources to each of those responsibilities, that's possible to do too, but if you're interested in that, check out this blog post and the last three pieces of content are all AI related. The first one is from timescale.com and it's postgresQl hybrid search using Pgvector and cohere. So here they're trying to combine a a keyword search in postgres and a semantic search using pgvector and cohere. So the keyword search, as they say here, gives you exact matches. That's essentially precision, whereas the semantic search using pgvector and cohere gives you context. So essentially you're searching for meaning, and it shows how you can combine these together in a hybrid search to hopefully give you better search results. So check this out if you're interested. Next piece of content rag app with postgres and pgvector this is from enterprisedb.com and this is another example of building a rag app with postgres and PG vector, and they have all the code listed here. It's quite long, but you can check this out if you're interested. And the last AI post is pgvector as embedding store in private GPT. So this is a set separate project, private GBT that essentially gives you your own LLM. And in this example they're using Pgvector as the embedding store. So you can check out this blog post if you're interested in doing that.
[00:11:40] I hope you enjoyed this episode. Be sure to check out scalingpostgras.com where you can find links to all the content discussed, as well as sign up to receive weekly notifications of each episode. There you can also find an audio version of the show, as well as a full transcript. Thanks and I'll see you next week.