Episode Transcript
[00:00:00] If you haven't been keeping up with the tech news this week, there was a massive backdoor that was discovered in the XZ package, included in different distros like Debian and red Hat, but it hadn't reached the stable branches yet. But because of some amazing work by Andres, who's a member of the Postgres community, he identified it and discovered it while doing his postgres benchmarking work. So that's one of the things we'll be discussing this week, but I hope you, your friends, family and coworkers continue to do well. So our first piece of content is actually a mastodon post because I don't think any summary blog post does any better necessarily. But this is from Andres, who to my understanding is working for Microsoft currently says I accidentally found a security issue while benchmarking postgres changes. So if you run Debian testing unstable or some other more bleeding edge distribution, I strongly recommend upgrading ASAP. And he goes into slightly more depth here where he was doing micro benchmarking and wanted to quiesce the system down to reduce noise. He saw SSHD processes were using a surprising amount of cpu. He started profiling that, which showed lots of cpu time and Lib Lzma, and then as he continued his investigation, basically ultimately they discovered a backdoor in this upstream XZ package, which basically does compression for Linux distributions. And here's the link to that here on openwall.com. And the subject is backdoor and upstream XZ Liblzma leading to SSH server compromise. So this isn't necessarily postgres specific, of course, but I found it super interesting that essentially Andres and his role at Microsoft was working on postgres, and because of that, essentially all three of them saved the Linux distribution from a massive security vulnerability.
[00:02:04] But I encourage you to check this out if you want to learn more. Next piece of content postgres is eating the database world this is from pigsty IO, and this is pretty much a take on the phrase software is eating the world. Well, this is postgres is eating the database world, and despite of any flaws or issues that has, it continues to march forward and keep getting more popular, primarily because of two reasons. One, it's open source, and two, its extensibility, so its capability to be enhanced in all sorts of different ways through essentially the extension ecosystem. Now this blog post does talk a lot about pigsty, which is basically an open alternative to RDS, where by default they allow tons of extensions to be available. But I still thought this article was really interesting in how they were presenting things. So they're talking about OLAP performance and how on the click house benchmarking tests called clickbench you can get postcros to about a 47 x. I think this is 47 x lower than what Clickhouse can achieve. But there are other solutions that are trying to improve this. Hydra got down to 42. The top tier ones like Umbra, clickals, databand selectDB are three to four X difference. But parade DB's new extension, PG analytics gets what he's calling a second tier performance at ten X. So that starts to become right in the ballpark of the first tier offerings, pretty much. So if you can get close to the top tier performance, why would you want to switch away from postgres at that point? And he says duckdb which gets 3.2, that's not postgres, but there is a foreign data wrapper to interact with that from postgres. So basically Postgres is becoming the database platform of choice and you just plug in any extensions that you need to make it more performant. What I also found interesting about the paradebi as well is it putting an leucine like search into an extension to allow it to be accessible through postgres. But essentially this is their chart here where postgres sits in the middle and all these other extensions add different features and capability, but all of your data is essentially still stored in postgres unless you're using, as it shows here, a foreign data wrapper to access other resources. And this aligns with the state of postgres survey done by Timescale DB that he's showing an image of here where the main reasons people chose postgres is number one is open source number two, reliability, which is fantastic. So that's definitely important. But also the third one is the extensions. So as needs present themselves, the extension ecosystem of sorts is able to move into action and address it faster than trying to get it built into postgres core, for example. And then he goes into talking about pigsty and what their offerings are and comparing all the different extensions that are available in pigsty compared to some of the other hosted platforms. But I found the overall message of the post very interesting and it kind of aligns with my perspective. So definitely encourage you to check it out if you're interested. Kind of continuing on this extension discussions. The next post is why postgres extensions should be packaged and distributed as OCI images. This is from ongrace.com and in the whole discussion about how to package and distribute extensions that the postgres extension network and David Wheeler have been talking about this is an endorsement of actually using the OCI image specification. And OCI, in case you're unfamiliar, is open container initiative. So yes, it's related to containers, but he argues that it could also be adopted for delivering extensions. But a lot of the foundations have already been laid for determining how to do building, packaging, distributing, as well as security for them. So I encourage you to check out his viewpoint about this in case you want to learn more. Also related to that, there was a post what's happening on the postgres extension version two project, and this is on Timbo IO, and they're discussing the current status of what they're thinking about for the next version of the extension ecosystem. The next post is also related PG XN version two go or rust. This is from just a theory.com and of course this might cause an interesting flame war, but they're discussing whether they should write it in go or rust, and actually has a vote at the bottom here that people can submit their recommendations.
[00:06:46] And then lastly, there's also an eventbrite link for the many summits that are happening in discussing the postgres extension network. So it looks like you can sign up and attend if you're interested in learning more. Next piece of content row level security for tenants in postgres this is from crunchydata.com dot. They're talking about row level security where you can set it up such that some trait of your user or your session dictates what rows you can see in the database. So as an example here, he created an organization table and then an events table, and on the events table he enabled row level security and created a policy where the id must match the current user. So apparently a logged in user would have whatever identifier is the id to be able to match up and only allow them to see those records for that. And he says this can work great when you have direct access to the database, meaning maybe you're starting a PSQL session, but if you're actually going through some sort of application server, it's not common for the user that an application logs in with to be the same user hitting the database. So you're going to need to do something different. And here he's saying you could use session variables so you can create a policy like this and looks for a particular session variable and uses its value to match against the id. So here when you make a connection to the database, you must set what this id is, and then you can query the table and it will use the policy to only show you the rows you have access to, and he shows you how you would need to set this in the session before doing a request. And of course if you do this, generally you're going to want to have your tenant key. In this case, it looks like it's org id present in all the tables for this to work correctly. Now one thing they didn't mention, or I didn't see mention of it, is performance. Because I know historically getting row level security to be performant is quite a struggle because as you can imagine, it is doing some sort of restriction for every query that you have to only show you the rows that are relevant to you. And performance can get even worse if your policy happens to get more complex. So that's just something to keep in mind if you're considering doing this is definitely watch out for security issues next piece of content there was another episode of Postgres FM last week. This one was on search and pretty much a lot of it was talking about text based searching. And even though they're talking about text based searching, they do talk about full text search, fuzzy search using trigrams, semantic search like which you can get using PG vector and then faceted search. But if you want to learn more, you can listen to the episode here or watch the YouTube down here.
[00:09:34] Next piece of content PG a dirty postgres rag this is from bonesmoses.org dot this is another example of building your own rag or solution for doing retrieval augmented generation. So he actually used his blog posts as the source data, loaded it into postgres, generated the vectors. He did not use OpenAI, but he used a mini version that he's calling a transformer and then goes into the process of setting all this up to do queries. So this definitely goes into a lot more depth. But if you're interested in AI, definitely encourage you to check out this blog post.
[00:10:14] Next piece of content syslog logging with journal D in postgresql this is from cyber.com postgresql.com dot and basically syslog is the default logger in Linux, but Postgres generally uses its own logging solution, or at least using the standard error logger. Well, you can log to syslog, just be aware that the message level gets changed and he shows a translation between the postgres message level and the syslog severity indication, but he shows you how to set it up. What you need to change in terms of setting the syslog as the log destination as opposed to standard error. You also need to set a syslog facility generally local zero through I think he said seven, and he chose to also adjust the logline prefix because the time and the process id already included in syslog. And then once you set that up, you can actually start using Journalctl to start querying the logs, as opposed to doing grep like I tend to do. But check out this post if you want to learn more.
[00:11:16] Next piece of content pitfalls of using Select Star or select this is from Stormatics tech and this is a frequent recommendation I give. Basically, try to use select star less. Only choose the columns that you want because it just results in less network traffic, as he says here, as well as less memory in your application, particularly if you're trying to pull big text fields or even JSON B fields if you don't need them. But he also talks about sorting and hashing because that could also be hindered by having to deal with larger data than you actually need. And then of course, if you're dealing with large amounts of data like text or JSon, it's going to be stored in the toast, so that will take longer to access and you may even have to decompress it. And of course, the less columns you pull, the more probability you may be able to get an index only scan. If an index exists on the table already, that can give you the columns that you need. But check out this post if you want to learn more. And the last piece of content was an April Fool's joke for supabaze.com, where they're announcing the data preservation service and basically they're putting it on paper. So if you want a little laugh, go ahead and check out this blog post. I hope you enjoyed that episode. Be sure to check out scalingpostgrass.com where you can find links for all the content mentioned, as well as the video and audio of the show. And while you're there, please go ahead and sign up for the mailing list where you get notified of every show that comes out. Thanks and I will see you next week.
[00:12:50] Our channel.