Postgres 16 Released, The Postgres Meme, File vs base64 Strings, Intelligent Sharding | Scaling Postgres 282

Episode 282 September 17, 2023 00:14:00
Postgres 16 Released, The Postgres Meme, File vs base64 Strings, Intelligent Sharding | Scaling Postgres 282
Scaling Postgres
Postgres 16 Released, The Postgres Meme, File vs base64 Strings, Intelligent Sharding | Scaling Postgres 282

Sep 17 2023 | 00:14:00

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss the release of Postgres 16, a Postgres meme, storing files or base64 strings and sharding intelligently.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/282-postgres-16-released-postgres-meme-file-vs-base64-strings-intelligent-sharding/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres we talk about postgres 16. Released the Postgres meme file versus base 64 strings and intelligent sharding. I'm Kristen Jameson and this is Scaling postgres episode 282. [00:00:24] All. All right, I hope you, your friends, family and coworkers continue to do well. Well, as you can see, the first piece of content is that PostgreSQL 16 is released. But before we get into that, I wanted to mention some additional changes that have been happening to the show. First off, all of the podcasts are now available on the website on each episode page, and they should be being distributed to many more podcast providers as well. So if you've wanted to listen to a podcast version, it should be available in more places. Second, I actually have set up a form on my website to ask you about particular struggles that you're dealing with because I wanted to gauge interest and see if this could possibly be a small segment of the show discussing what others in the community are struggling with. I'm not going to put a link on the website to it anywhere for now, but I will put it in the email newsletter. So if you're subscribed to that, I'll mention this as well as a link on how to get to the form. And then thirdly, be prepared for next week. I may be changing the intro a bit so I may jump straight into the content as opposed to having my typical intro. So we'll see how that goes and definitely appreciate any feedback you have once you see it. But like I said, the first piece of content is PostgreSQL 16 is released. So this is great. Thanks to everyone in the Postgres community that made this happen. And this announcement [email protected] and of course they list all the improvements by different areas. I thought I'd just highlight ones that I was most interested in. So first is performance improvements. So of course I like any performance improvement, but the one that I like the most is the enhancement to Copy, where they had quote tests showing up to a 300% performance improvement in some cases. So that's awesome in my opinion. Second, logical replication. Again, a lot of logical replication changes in this. I'll actually mention two here. One is being able to do a logical replica from a standby. Before you could only do it from a primary machine, but now you can have a standby machine as a publisher. And then second, because so many people have asked about this, is being able to do bi directional logical replication. So we're not at multimaster yet, but this seems like a good step towards the way. Second area is the developer experience. And basically I like the added syntax for the SQL JSON standard. In terms of monitoring, it's got to be the PGSTAT IO new system view for monitoring IO metrics and then the access control and security. Personally, I like being able to use regular expressions for things like Pghba.com Now, but definitely check this blog post out or download it if you want to learn more about it. Next piece of Content pgSQL Friday Twelve contribution Summary this is from Softwaremboos.com and these are reviewing the posts that were done for pgSQL Friday in September. The first was one we covered last week, which is improvement to infinity. And then two others that we'll cover here. One is bi directional logical replication in PostgreSQL 16, which I just mentioned. But this is another post that goes into that in more detail and explains it and he actually shows what happens in 15. Basically, you end up with an infinite loop because there's no way to identify the source of the data when you're logically replicating, whereas 16 adds that capability so you can get this bi directional logical replication happening. So you can check this out if you want to learn more about that. And then the second post is PostgreSQL 16 has better number literals so being able to use underscores to be able to read larger numbers more easily, and he actually goes into some of the detail of the implementation it looks like as well. Next Piece of Content also related to that is that Postgres FM had another episode last week where they covered our favorite version 16 feature and Nikolai mentioned he primarily likes the fact it's all open source, but he did mention some of the PG Watch enhancements and a number of others. And Michael actually liked Heavenly Query ID and Auto Explain. But he also mentioned some others too. So check out this episode. You can listen to it here or watch their YouTube channel down here. Next piece of content explaining the postgres meme. This is from Avastura Dev and I'm a little late to the party in this because this post is actually from January 9, but apparently there's been a postgres meme that goes around where they're comparing postgres features to the level of an iceberg. So you know the saying where they say, well I know this much, but that's just the tip of the iceberg? Well, they're kind of using that to describe postgres features relative to an iceberg and everything above the surface, the vast majority of people know, hey, it has select Inserts, Updates and Deletes and it has a limit and an offset and there's the concept of null. And then once you get a little bit lower, you talk about normal forms or outer joins or acid compliance, and then they continue to go deeper, talking about connection pools and recursive CTEs or Cursors in the next layer, talking about triggers or merge or partial indexes. The next level talking about select for Update or Denormalization and going further, cost models don't reflect reality or causal reverse. Now, this is the level at which it started to lose me a little bit, so I don't know if I'm quite this grizzled of veteran pretty much all of the ones up here I kind of knew. Maybe there was one. I wasn't quite sure what it was talking about, but the vast majority I knew. But here and below I was kind of like, I haven't really heard of that before. Although there are ones that I have. Of course we've covered. Transaction ID exhaustion. I believe we covered. F sync gate So they're there, but not all of them. Now, what makes this post great is he actually goes into detail with each of these and explains what it is. I don't know if it's 100% correct, but I did find it fascinating to go through and test my knowledge and see how far I could essentially go down the iceberg. But if you want to learn more about PostgreSQL, definitely check out this blog post next piece of content. We used to store files as base 64 strings in postgres, but not anymore. This is from Blog Mailpace.com and it looks like they are an email service and basically email attachments. He says, quote, are always encoded in base 64 for sending. So they have an API that asks for the data in that format. And what they were doing was just storing it in a database table. But then they started running into issues where their data was growing, quote by several gigabytes per day. So after debating what to do, they went ahead and kept the data in the table because they needed to send this email out. So why convert it to a file and convert it back to base 64 again to send out? Because they didn't want to have latency with regard to that type of conversion. But after the email was sent, then they elected to save that base 64 encoded data as a file. So if anyone needed to reference it for any purpose, they could still have access to that data. Now, I imagine this may impact Bloat to have all this data going into a row and then have it be removed through an update. But he did say that after the deployment of this fix, they saw their database stop growing in size almost immediately. Now, I thought this was interesting because it's yet another case of when in doubt, it's usually best to store file based data outside of the database and not in the database. And I've rarely seen a case where it makes sense to store in the database. Although even in this case they are storing it temporarily and then moving it to a file. Although what I might do in this case is make it an entirely separate table and partition it so that you can just drop those partitions easily to minimize any Bloat issues. But if you want to learn more, check out this blog post next piece of content. Data Locality scaling PostgreSQL with Cytus Intelligently this is from CyberTech Postgresql.com and it looks like they're trying out Cytus, which is essentially an extension that does scale out postgres sharding your database to multiple individual server nodes that act as one large database. And they created a customer table and a sales table. And the sales table does reference the customer ID and then they actually wanted to distribute this table. So there's a command inside us to create a distributed table. You give it the table name and then I'm assuming this is the shard key as the second parameter. So he sharded based upon the ID and then the sales based upon the ID inserted data into the customer in the sales table. But the problem is, when you go to join the customer table to the sales table, they now have to cross nodes. Essentially because the data is located based upon the independent ID in each table, there's not a common Identifier between them and it looks like Citrus actually gives you an error quote. The query contains a join that requires repartitioning. So whereas you can set to enable repartition joins on, you wouldn't want to do this in a production environment because it just makes things slow. You would want to place your data so that joins can happen on the same node. And how you resolve that, or how he resolved it in this case, is he created the distributed table using the customer ID and he said to colocate it with a customer table. I actually haven't set up a Citus database yet. I wonder how much a requirement specifying the colocation is. So like most multi tenant databases, they have an account ID or a tenant ID and the vast majority of tables have that ID. Could you simply state that as the shard key or do you actually need to use the colocate with each distributed table? I'd have to take a look at the documentation, but anyway, once he properly colocates the data, he runs a query and of course it happens much faster, 15 times faster than before. So definitely colocate your data when you're wanting to do sharding. Next piece of content. Also related to Cytus is Cytus sharding your first table. This is from Cybertechgresql.com and he discusses the process of getting started with Cytus. Here he shows an example of using docker to get started with it. Or you can set it up manually on your own instances. And he shows how he set up five different instances here one a coordinator node and five worker nodes. And he created a distributed table and tested some insert performance and then queried from it as well and did explain plans. So if you're interested in getting started with Cytus, you may want to check out these two blog posts. Next piece of content monitoring PostgreSQL replication. This is from Cyber.com postgresql.com and normally when I want to monitor replication, I'm going to be using the exact system view he mentions down here, which is PGSTAT replication. Now, it's important to query this system view from the primary or if you're dealing with cascading replication from whatever server you're replicating from because it gives you the status of the replication for its replicas. So here you run select all from PG Stat replication to get the status of it. And the area that I tend to focus on, of course, is what is the right lag, flush lag, and replay lag. And these are based upon the log sequence numbers or the LSNS. So the difference between the sent LSN, the write LSN is the right lag. The difference between the write and the flush is the flush lag and between the flush and the replay, the replay lag. So the sent LSN is when it was sent from the primary to the secondary server. The write LSN is when the standby asks the operating system to write that data. And then the flush LSN is when it's actually on the disk on the standby system. And the replay LSN is when it actually winds up in the database files and you can query it. So even though it's on the disk, it's not actually in the database files and accessible through, say, psql to do a query. And he explains that here, as well as another system view that you can use on the standby, which is the PG Stat wall receiver. And I actually agree 100% with what he mentions here. However, this one is far harder to read. I prefer using PG Stat replication to make monitoring easier, and I agree with that 100%. And lastly follows up with replication slots. So if you're using replication slots for your replication, you can also use the Pgrapication slot view to check the status of those. But check out this blog post if you want to learn more. Next piece of content. Who's in charge? This is from Momgm US, and this was an interesting blog post talking about the organization of the postgres project, essentially, and how no one is really in charge. Or more accurately, he says, quote, So who is in charge? You are, and everyone else in the community, basically. The postgres organization is a group of independent teams working together. And it's only when disagreements happen between these independent groups that the core team gets involved. And that, he says, is relatively rare. But I found an interesting read. I encourage you to definitely check it out. Now, the last two pieces of content are bonus content. I'm not going to cover them in depth, but if you use PostGIS you'll probably find these of interest. The first is episode seven highlights of Path to Citiscon podcast with Paul Ramsay and Regina Obey. This is from Citusdata.com, but the actual name of the podcast is Why People Care About PostGIS and Postgres. And the second post related to it is one done by Paul Ramsey and that's random geometry generation with PostGIS. This is from Crunchydata.com. So definitely check out these pieces of content if they interest you. [00:13:35] That does it for this episode of Scaling Postgres you can get links to all the content mentioned in the show notes. Be sure to head over to where you can sign up to receive weekly notifications of each episode or you can subscribe via YouTube ride tunes. Thanks.

Other Episodes

Episode 299

January 21, 2024 00:12:21
Episode Cover

Build A GPT In SQL | Scaling Postgres 299

In this episode of Scaling Postgres, we discuss how you can build a GPT in 500 lines of SQL code, how to optimize extension...

Listen

Episode 179

August 23, 2021 00:15:40
Episode Cover

Foreign Data Wrapper, Timescale Compression, Fuzzy Search, Query Execution | Scaling Postgres 179

In this episode of Scaling Postgres, we discuss how to use foreign data wrappers, compression available in timescaledb, working with fuzzy search and describing...

Listen

Episode 268

June 04, 2023 00:11:38
Episode Cover

Faster Copy, psql Variables, backup_label, Bad Encoding | Scaling Postgres 268

  In this episode of Scaling Postgres, we discuss faster COPY in Postgres 16, how to use psql variables, the importance of backup_label and issues...

Listen