Postgres Releases, PostgreSQL Survey, Partitioning vs. Sharding, Bulk Loading | Scaling Postgres 277

Episode 277 August 13, 2023 00:19:04
Postgres Releases, PostgreSQL Survey, Partitioning vs. Sharding, Bulk Loading | Scaling Postgres 277
Scaling Postgres
Postgres Releases, PostgreSQL Survey, Partitioning vs. Sharding, Bulk Loading | Scaling Postgres 277

Aug 13 2023 | 00:19:04

/

Hosted By

Creston Jamison

Show Notes

 

In this episode of Scaling Postgres, we discuss new Postgres releases, taking the 2023 State of PostgreSQL survey, partitioning vs. sharding and the fastest way to do bulk loads.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/277-postgres-releases-postgresql-survey-partitioning-sharding-bulk-loading/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about Postgres releases, PostgreSQL survey partitioning versus sharding and bulk loading. I'm Kristen Jameson and this is Scaling Postgres episode 277 all right, I hope you, your friends, family and coworkers continue to do well. Well, you take a week off and there is a deluge of postgres in the last two weeks. So there's a lot of content to go over for these past two weeks. But the first piece of content postgresql 15.414 point 913.1212 point 1611.21 and Postgresql 16 beta three released. This is from postgresql.org and these particular releases address two security issues. The first one is an extension script substitutions within quoting allow SQL injection and the second issue was merge fails to enforce, update or select row security policies, so you may want to update for that purpose as well. As there have been 40 bug fixes for all of these releases, the most significant one to be aware of is if you're using Brin indexes. So they have a fix for handling nulls and brin indexes and it will require you to reindex your brin indexes. So keep that in mind if you're going to update to this next version. And then lastly, there is a notice that PostgreSQL Eleven is end of life on November 9, 2023. So check this post out if you want to learn more. Next piece of content the 2023 State of PostgreSQL survey is now open. This is from timescale.com and they did a survey last year where they show the results here, but there is a link to take the survey here. So please go ahead and take out the survey so we can see what the perspective of the community is like this year. Next piece of content PSQL Friday Eleven this is from pgsqlfriday.com and they announced the new blogging event which has already happened, which was about partitioning and sharding. So in this next post talking about partitioning versus sharding, PostgresQL is from Tomaskentote medium and this is essentially the challenge post about what it should cover. [00:02:18] The next post related to the partitioning and sharding event is actually the one I relate to the most, meaning I kind of agree with a lot of what was mentioned here. And this is partitioning. We need it from hdombrovaskaya WordPress.com and she says, quote like many others I never did sharding for real, that is in a production environment, and I too have not done that. I've only done partitioning and just to level set where things are. So partitioning is where you take a table and you break it up into multiple tables using typically declarative partitioning in postgres. So it has a built in capability to split out tables to multiple child tables. Sharding is usually where you take a table and you want to separate it or put certain tables on another database outside of your existing one. So you can basically think of Cydus where you're scaling out to multiple database shards. Essentially you're sharding your database and putting the data in more than one database server. So she doesn't really have experience with sharding, but in terms of partitioning, she says, quote whenever I suggest somebody partition their large tables, I must warn them that partitioning will not make queries run faster and that I will do my best to ensure the queries will not slow down. And this is my experience as well. So many times people say they want to partition for query performance, whereas I've found that very hit or miss. Now it's not that they never get fast, but it's very hit or miss. And sometimes you can make things worse. And this is even if you are still using a key in your queries that the shard is based upon. So I haven't really figured out why some are better, why others are not, or I haven't seen anyone report on it or understand why that's the case, but that's been my experience and we'll look at a post that shows this in detail, but she says the reason to do it is for maintenance and archiving, and I totally agree with that. It makes things easier to maintain when you have huge tables that you're working with and if you need to archive them. So you're not going to need that data. It's much better to just drop an old partition than to actually try to delete it, potentially causing table bloat and index bloat, et cetera. So if you're going to be archiving data, definitely go ahead and partition your tables. And she talks about the importance of you need to change all your queries on these partition tables to include the shard key so that you can properly exclude the unnecessary partition. So you're only targeting the partition that you need to get the data from. So like I said, I think this post most closely relates to my experience with partitioning in particular. So definitely feel free to check that out in terms of other contributions, software and booze.com did thoughts on PostgresQl partitioning and sharding and the thing that was interesting about this one, he mentioned some things that timescale does. So they partition a little bit differently for their time series data. They don't really use hash and list partitioning, so it's basically by date. And they don't call them partitions, they call them chunks, as he says here. Also talked about Citus, which is of course the solution for sharding out your database. Talks about some of the advantages of partitioning, and it's basically from a maintenance perspective in my opinion, but also some potential gotchas mentioned in some of these posts, and those include having to include your shard key in every query, essentially to get optimum performance. The next one is you can't use unique indexes, or there's no globally unique index for a set of partition tables. You can create individual unique indexes on the child tables, but it can't exist within the scope of the whole partition table from the parent down. And then thirdly no concurrent index addition or removal. So basically there is a process to get them added in a current fashion. Basically you add them to each child table concurrently, and then you add a final index to the parent table. But it's not as convenient as just being able to run a single command. There's work involved with it. So those are the downsides that I'm particularly aware of that some people mentioned next post understanding partitioning and sharding in postgres and Citus. This is from citadata.com and they link to a very good article I suggest you check out, and that is brander's advice on partitioning. They have a link here, and they even have a quote that he gave here on using partitioning. So encourage you to check that out as well as their perspective on sharding because they are the sharding database solution basically. So that's the reason to check out this post. The next one partitioning versus Sharding and Postgresql from engineering adjust and this is from Andrea Sherbaum and this gives a real world perspective on using partitioning and sharding. And they actually use both. They don't use a solution like Cytus, it seems like they do application sharding, so the application determines where certain data is stored and they use shards for that purpose, but they also partition larger tables within those shards. So you can check out this if you want to learn more about that. Next post is on sharding and partitioning from andyatkinson.com where he covers a lot of the same things, but he actually talks about application level sharding and particularly he has experience in Ruby on rails and he has some perspective on how rails can actually handle some of this application sharding or some of that features were added to rails for that purpose. Next post related to it is partitioning and sharding in postgres. This is from pgmuster.com and he has a very similar perspective where he says, quote query performance as one of the major benefits of partitioning. This feels like something I should be on board with, but honestly I think that faster maintenance tasks is by far the bigger benefit. And I'd even say similar to the first post that was looked at, some of those query optimization benefits can be highly elusive and the maintenance is a bigger reason to do it. The next post is partitioning as a query optimization strategy. So this isn't part of PG SQL Friday, but he actually is looking at this question about query optimization and how partitioning can help. And here you can see what I'm talking about where you look at query one which is discussed in query two, and you can see the yellow is the partition table using a partitionwise join. And this is a logarithmic scale of execution time, so it is faster and he says maybe about four times faster using partition tables compared to non partitioning which is in the blue. But when you look at query two, I mean maybe there's some difference, but it's very subtle. So it depends on the query in the tables that you're partitioning, the results you're going to get. So sometimes it works, sometimes it doesn't have much of a change at all. So it's just something to be aware of. Oh, and I should say this is from ashutoshpg blogspot.com next post is mastering Postgresql table partitioning. This is from Fragland Dev and this is an overall post talking about partitioning table. And this post I wouldn't say necessarily helps you master table partitioning, but it basically gives you an overview showing you the different types and how to set them up. So it's more of a quick how to post to get partitioning set up. So you can check out this blog post if you're interested in that next piece of content. Bulk loading performance in postgresql this is from cybertechn postgresql.com and this is a set of experiments. It's actually similar to a video I did, oh my gosh, a number of years ago looking at bulk loading with regard to Ruby. So essentially they did a lot of replication of what I but added some new things as well, but basically the same conclusion. And if you have experience in this, you won't find anything surprising. So we tested six techniques. The first technique was single inserts in auto commit mode. So this is just doing insert statement, insert statement and committing after each 1 second one is single inserts in one transaction. So that's presumably doing a begin transaction, then doing all your insert statements and then committing at the end. The third one is single inserts with a prepared statement in one transaction. So basically doing a prepare command and then doing the execute command for that statement and do that in one transaction. Second is multiline inserts in one transaction. So do an insert with multiple rows of data in it as shown here. The third is doing that multiline insert with a prepared statement, so doing it that way. And then the last one is using copy. And here's the performance results. Basically, copy destroys them all. The single inserts with a commit transaction for each one is the slowest, immensely slow, almost 9000 seconds, followed by single inserts in one transaction, followed by single inserts with prepared statements followed by the bulk inserts, meaning a multiline insert and then bulk inserts with prepared statements. Actually a little bit longer, but definitely copy is the way to go if you're wanting to bulk load data. And failing that is using a multi line insert. Now related to this is optimizing bulk loads in postgres and how copy helps with cache performance. This is from pganalyze.com, and this is the post that Lucas covers this week in his five minutes of postgres. But he also talks about cache performance and how copy actually uses a ring buffer, which is different than the shared buffers that inserts uses. So it basically helps preserve your shared buffers and doesn't monopolize them if you're doing a very large bulk load when you use copy. So basically you can preserve your shared buffers for queries or things like that if you do your bulk uploads using copy. So definitely check out his post if you want to learn more about that next piece of content. Why does my PG wall keep growing? This is from cyberky and postgresql.com. So PG wall is basically the write ahead log, and it's the directory that all the wall files are written to to maintain integrity in case the system needs to be restarted or used as a reference for sending wall to replicas. And the most important thing in this post is quote never manually delete any files in PG wall because you will bring your system down. But in terms of what causes the PG wall to grow, there's a few reasons that can happen. One is that there's a problem with your archiving, so you have an archive command set up and something's broken about it, and the files just keep growing in PG wall because it can't archive them to the other destination that you've selected. The other thing that can happen is that it's too slow. Maybe you're generating more wall files than your archive command can process. Maybe you're trying to compress them and that compression is slow. Or maybe you're trying to send them over the network to s three or something, and that transfer process is slow, so speeding that up will help keep your PG wall under control. Next, it could be a replication slot issue. So maybe you have a replication slot where the replica has shut down, or maybe it can't communicate over the network and those wall files keep building because that replication slot won't allow them to be deleted until they've been consumed. So definitely be aware of that situation. And then the last one is your wall keep size is too high, so you've actually told it to keep too many wall files. So that can cause this type of problem as well. But check out this blog post if you want to learn more. Next piece of content JSon versus JSoN B and postgresql this is from dev two in the Francesco Tizio section, and this basically covers the difference in postgres between JSON and JSoN B. JSon basically just stores a text representation and pretty much exact representation of the JSON. It doesn't alter it, whereas JSON B stores a binary version that it does some alterations to. Some of those include removing white space, merging duplicate keys as he says here, and then ordering that the keys could be altered. Another benefit of JSONB is that you can also index it and makes it super fast for being able to retrieve information from that JSON B column. And it's also much easier to edit data within that JSON B column. So basically use JSON. If you want to preserve the results of the JSON, you don't want it altered in any way, whereas if you want to use it for pretty much all other use cases, go with the JSON B. Next piece of content PostgresQl 16 playing with PGStat IO extends this is from dbiservices.com this is a great post talking about the new view PGStat IO coming in 16. He walks through each of the process about how to understand how it works, how you can look at changing block sizes based upon data inserted into tables, and how to query this to understand what's going on in the file system. So definitely encourage you to check out this blog post if you're interested. Next piece of content slow and steady database constraints with Andrew Atkinson this is from indyatkinson.com and he's talking about a podcast he was on where he discussed different types of constraints such as default constraints, not nulls, primary keys, foreign keys, check constraints, and exclusion constraints. So if you want to learn more about that, you can check out this blog post next piece of content the cache that wasn't optimizing postgres for select performance this is from enterprisedb.com. They're talking about a situation where you want to be able to cache something to give better select performance. So they had a particular set of tables here, a particular query where they were maxed out at 10,000 transactions per second and they wanted to make it faster. So basically they made an aggregate table updated by triggers and then essentially creating this cached table. Their performance was able to increase by 2.5. So if you want to learn more about how they did that, definitely check out this blog post. Next piece of content what do you consider postgres best practices? This is from hdombroposkaya WordPress.com. This actually posed questions, but the answers are actually in the comments. So I thought this was particularly interesting. So feel free to check this out if you want to learn more about that next piece of content when did postgres become cool? This is from crunchydata.com and this is actually more of a history post about postgres. So it goes through all the different history and the point at which he felt it became cool. That he reveals in the conclusion here is basically when Heroku made postgres pretty much its only database, as well as postgres getting JSON support about the same time he feels that's the point at which postgres became cool. Next piece of content comparing postgres performance across cloud providers this is from blog perfects engineering and he actually compared three different providers and did some query tests against them for different levels of transactions and TPS rates. And from the results he showed here, he showed AWS coming on top compared to digitalocean and Google Cloud. But then he says, but AWS was highly variable and sometimes the results look like this. So whenever I see this, I definitely take it with a grain of salt. I was actually a former scientist, so any type of variability like this, I would actually like to see standard deviations to determine where this should actually sit. So if you have this high variability in it, I would like to see those standard deviations from the attempts to get a better idea of what the results actually could be in real world situations. So again, you probably want to take this with a grain of salt, but this is some information you may want to check out. [00:16:29] Next piece of content introducing PGMQ simple message queues built on postgres so this is a new extension that has come out. So if you're looking for a message queue, maybe you would like to check this out. Next piece of content rusting a new postgres c extension that records subtransaction counters this is from Bdruvo GitHub IO, and he's released a new extension called PG Sub Exact counters. So this does a counter of how many subtransactions are happening within your system. So check this out if you want to learn more. Next piece of content Pgvector fewer dimensions are better this is from superbase.com and they're looking at different embedding models and looking at some benchmarks that have been set up by hugging phase, and they're comparing different models. And what they say is that each model's dimension size has little to no correlation with its performance. So even though you're using higher dimensions, it doesn't mean you're going to get faster results. And what they actually found is that smaller number of dimensions for some of these models actually resulted in greater performance. So if you're using AI, definitely check out this piece of content if you're interested. [00:17:35] Next piece of content XKCD bad map projection with Postgis this is from crunchydata.com and I'm not going to attempt to explain this, but if you're interested in graphical information systems and bad map projections, definitely check out this pretty long post. Next piece of content a possible way to implement a shift function in plpgsql this is from Fluco, 1978 GitHub IO so check this blog post if you're interested in that, as well as his part two here. Next piece of content handling outliers and Brent indexes with the new multimin Max operator class. This is from pganalyze.com, and we covered that on the previous episode of Scaling Postgres. So this is Lucas's perspective on that post. Next piece of content since I took a break, there have been two postgres FM episodes. The first one is on high availability, and they cover, quote what it means, tools and techniques for maximizing it while going through some of the more common causes of downtime. And their second post is data types. So if you're interested in either of these two episodes, you can listen to them or watch them on YouTube. [00:18:39] That does it for this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 131

September 13, 2020 00:15:58
Episode Cover

Slow Archival, Index Improvements, Node Metrics, Join or Subquery | Scaling Postgres 131

In this episode of Scaling Postgres, we discuss slow WAL archival, index improvements coming for Postgres 13, accessing node metrics via SQL and deciding...

Listen

Episode 150

January 31, 2021 00:16:57
Episode Cover

Scaling Advice, Vicious Cycle, Improve Observability, Addressing Corruption | Scaling Postgres 150

In this episode of Scaling Postgres, we discuss scaling advice, a vicious cycle of WAL writes, how to improve observability and addressing database corruption....

Listen

Episode 0

March 08, 2021 00:17:16
Episode Cover

Citus Open Source, Time Series Performance, Subscripting Updates, target_session_attrs | Scaling Postgres 155

In this episode of Scaling Postgres, we discuss Citus 10 open source, time series performance in native Postgres, using subscripting for updates and new...

Listen