Generate Test Data, Faster Archiving, Date Statistics, Useless Indexes | Scaling Postgres 247

Episode 247 January 11, 2023 00:14:25
Generate Test Data, Faster Archiving, Date Statistics, Useless Indexes | Scaling Postgres 247
Scaling Postgres
Generate Test Data, Faster Archiving, Date Statistics, Useless Indexes | Scaling Postgres 247

Jan 11 2023 | 00:14:25

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss how to generate test data, how WAL archiving is faster in PG15, how to set date statistics on a timestamp and how to avoid useless indexes.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/247-generate-test-data-faster-archiving-date-statistics-useless-indexes/

 

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about generate test data, faster archiving date statistics and useless indexes. I'm creston. Jameson and this is scaling postgres episode 247. [00:00:22] All right, I hope you, your friends, family and coworkers continue to do well. Welcome to 2023. This is the first episode of the year after taking a few weeks off, and there is a lot of content that was generated over the holiday break. So we've got a lot of content to go through, so I'm probably going to go through it pretty fast. But our first piece of content is generating lots of test data with postgres fast and faster. This is from Kmupple GitHub IO, and this blog post covers all sorts of different ways to generate test data. So from the basics using Generate Series, he has a quick way to generate 30 million rows, which is about equivalent to a 1GB table in a few seconds. So there's a lot more things you can do with Generate series to incorporate randomizations and things of that nature. But just using Generate Series, he calls Level Zero, then he goes to Level One, which he basically uses PG bench. And the remainder of the post talks about different configurations you can add to PG Bench to generate some quick test data. Now, the benefit of Generate Series is that you can generate your own test data, whereas Pgbench has a particular schema of test data that it generates for you. But this post is a great resource to find out different configuration variables you can use to PG Bench to generate the database size that you're looking for. So he says these are the parameters he tends to use when he generates a database. And at a scale of 5000, it generates about 73gb. He says you can also introduce less logging, so you can do it in an unlogged fashion. So it basically bypasses the wall and is able to happen faster. You can see that this offers a 40% boost. You could do data only, ignoring indexes, so that gives about a three x boost. He says you can hit the I O even harder to try and avoid any CPU bottlenecks by doing a fill factor. And here he used as opposed to a fill factor of 100, where all the tables are fully full, used a fill factor of ten. So that really pushes the I o. Or you could also do parallel streams, so you could use more than one CPU, because right now it's all been single threaded. But you can do it in such a way, as he describes here, to do jobs in parallel and that'll use more CPUs and it will push the I O even harder. And then he shows the results of the different changes and the speed up improvements, and gives some final bonus advice when you want to push the amount of test data even harder. So if you're looking for a way to generate test data. Definitely a resource to check out. Next piece of Content speed up of the wall archiving in PostgreSQL 15 so they're talking about archiving and how in 15 they offered the archive module or the library feature to be able to bypass running a command to do the archiving. But this is indicating that even the process of archiving the wall has been improved when you're just using the archive command. Basically it used to do a lot of directory reads looking at the archive status to check and see which file needed to be archived next. But they actually made it more efficient just scanning the directory once, holding the results in an array in memory, and that drastically reduced the number of directory scans. So much so that some people are reporting a 20 x performance improvement with regard to the speed at which wall can be archived. So if you run into this issue where your database is having trouble keeping up archiving wall, maybe postgres 15 offers some speed improvements in that area. And definitely check out this blog post if you want to learn more. [00:03:49] Next piece of Content improving group by with Create Statistics and he's talking about a situation where he has a time series table. So there's a timestamp and different values in a numeric field and how when you're trying to group by them, if you just group by a date, even though there is a correlation with many of those timestamp values because they have the same date, it's just the time of day that has changed. But Postgres isn't aware of that correlation. When you try to do a group by where you're truncating to the day of the timestamp, it actually has to look through every row to kind of find the information you're looking for when doing an explain plan. But he says you can use the Create Statistics capabilities in Postgres and Create Statistics on that day truncation and then it actually allows the query planner to give you an accurate estimate of how many rows have that particular day. Now, in the example he presented in this post here, just selecting from the whole table, you don't see a difference in the performance when looking at explain, analyze. But with this additional knowledge, when you're looking for, say, a specific date range. It should allow the query planner to give you a better plan for pulling out the data that you actually need to pull out. But if you want to learn more how to do this, definitely check out this blog post. [00:05:03] Next piece of content. Don't do this. Creating useless indexes. This is from mydba notebook.org. And this is actually part of the PgSQL Friday blog post. But I thought this warranted particular merit because they're talking about how some, orms even mentioned Django had the ability to create indexes on all columns of the table, which to me sounds a little insane. And they've even seen a situation where there's a database that had twelve times more indexes than the actual data stored on the disk, which is pretty crazy. So definitely don't create indexes on every column you have, or even close to that. Usually the recommendation is to of course have your primary keys with indexes, and then of course your foreign keys, and then apart from that, there may be additional columns that you need to index for better performance. But keep in mind, every index you add takes up more disk space and reduces the amount of more valuable indexes or data that can be in the cache, potentially. Not to mention the write throughput because every index on the table has to be kept up to date when new data is inserted or potentially updated. So you definitely need indexes, but definitely don't have too many. But if you want to learn more about that, check out this blog post. Next Piece of Content JSON Logs in PostgreSQL 15 this is from Cyberpasql.com, and they talk about how to set up the new JSON logging feature in postgres 15. Now, if you do this, be sure you have a tool set that can actually consume the JSON logs. In particular, they're making a point of you need to be careful because not every set of data is represented in the JSON, meaning that in this example you actually have fewer amounts of data with a log entry versus an error entry. You can see that there's more pieces of information contained within that JSON. Now that allows for some efficiency, but just keep that in mind that you need a tool that can properly parse the JSON output if you want to move to that. But check out this blog post if you want to learn more. Next Piece of Content PostgreSQL Hidden Gems this is from Peter Eisentrout.org, and he's talking about some potentially overlooked features in previous versions of Postgres, or at least what he thinks are today were the most overlooked feature. In terms of 9.4, he's talking about replication slots. In terms of 9.5, it's PG rewind talking about non exclusive backups in 9.6, which has definitely affected some of my clients because the exclusive backups were entirely removed in postgres 15. Postgres ten is the Create statistics option. Postgres eleven is the system catalog file format, which is more appropriate for developers postgres. Postgres twelve is setting an SSL minimal protocol version. Definitely important to no longer use the older versions. Postgres 13 is logical replication of partition tables definitely benefit, and postgres 14 is altered table detach the partition concurrently, so being able to get rid of a partition you're no longer using without blocking the whole table. So definitely some hidden gems. And check out this blog post if you want to learn more. Next Piece of Content advent of Code 2022 days six through Ten with PostgreSQL this is from Software and booze.com, and for days six through ten of Advent of Code, he has released his solutions in the form of YouTube videos. Now, related to this, Greg at Crunchy Data has also released his solutions, but as an individual blog post. So he has day 6789 and ten. So definitely check out these pieces of content if you want to learn how to do advent of code using just PostgreSQL. [00:08:39] Next piece of content PgSQL Friday. Number four PostgreSQL and software development. This is from Hdomebrovascaya WordPress.com. And this is January's PgSQL. Friday. And they're talking about software development with Postgres. Now, they're not necessarily talking about application development, but this is the database related scripts or jobs that are associated with working with Postgres. And he asked questions, do you have your own scripts and where do you store those? Do you store your SQL code in GitHub or elsewhere? Do you use PG Tap and things of that nature? So in terms of responses, Andreas Sherbaugh La gave a response where he talks about his PostgreSQL Person of the Week interviews and gave some responses according to this, how different people he's interviewed have answered it. Softwarembooz.com gave a response, talking about how he tends to manage the SQL, particularly from a DevOps perspective. Gorthx WordPress.com gave a response and talked about identifying sequences that were about to run out and a script on how to do it. And Rustproof's Labs gave a response, talking about the different types of code that they have in terms of being mission critical, nontrivial or trivial, and how they go ahead and work with that code in their solutions. So if you want to learn more about this, you can definitely check out these blog posts. [00:10:00] Next piece of content union all, data types and performance. This is from Cybercock and postgresql.com. They're talking about how they were doing a union to try to merge two types of tables together, and they got an unexpected result in terms of poor performance. So they had a bird table, a bat table, a cat table, and then they created a view for a mammal and included body temperature and did a union all and they gave a pretty fast response. But then they created a view called Flying Animal, including wingspan using Union all, and they got a very slow response. The reason being, if you look at the tables here, the wingspan for a bird, the data type is real. The data type for wingspan on a bat, it's the numeric. So it was these data type differences that caused the poor performance. And this blog post goes into how they discovered it by looking at explain plan and looking at the differences. So basically what's going on is the implicit casting can't work because of the different data types, but you can do an explicit cast. Now, if you want to learn more about this, this is the episode that Lucas covered this week on Pganalyze.com five minutes of postgres, so definitely check his episode out to learn more, where he covers previous discussions that CyberTech has had about using Ors and unions. [00:11:20] Next Piece of Content Transparent Data Encryption this is from Procona.com, and this is a blog post that covers all sorts of different levels at which you can do transparent data encryption. Now, Postgres doesn't do hardly any of these really only column based, so it does not offer anything out of the box for disk level encryption. You can rely on your operating system for that. It doesn't offer a cluster level encryption, nor database level, nor table level. And you can see other solutions such as Oracle, MySQL, Microsoft, SQL Server do have solutions for it, but really only column levels available for PostgreSQL. Now they mentioned Pgcrypto, so you can use Pgcrypto to do column level and definitely check out this blog post if you want to learn more about that. But there is another solution other than PG Crypto, which is PG Sodium. So this is another way to do transparent column encryption for Postgres. So if you're looking into PG Crypto, maybe check out PG Sodium as well. Next Piece of Content Diffing PostgreSQL Schema Changes this is from Procona.com, and how they're doing this is they basically take logical dump manifests and then diff the difference between them so you can see the schema different changes. So if you're interested in that, you can check out this blog post. Next Piece of Content PostgreSQL Vacuuming to Optimize Database Performance and Reclaim Space this is from Procona.com, and they're talking about Vacuum and the different settings you can use to definitely optimize your database performance and hopefully reclaim some space. You can check out this blog post to learn more. Next Piece of Content Oracle Versus SQL Server versus PostgreSQL Date Data Type this is from Migops.com, and for those of you who are wanting to migrate to Postgres, this talks about the date type differences between these two database systems compared to Postgres. So check out this blog post if you want to learn more. Next Piece of Content PostgreSQL 16 Part Three or Commit Fest 2020 211 this is from Postgrespro.com, and they cover a number of features that were worked on during the third Commit Fest, so you can check out this blog post if we're going to learn more. And over the break, there were actually three episodes of Postgres FM that occurred. The one on December 23 was on Row estimates and how it can cause poor performance, how to spot problems, and options to fix them. The episode on December 30 was a review of Postgres this past year, and the episode on January 6 was about transaction ID wraparound what it is, and some ideas to minimize the risk of it happening. So definitely check out these episodes if you're interested in this content. [00:13:51] And the last piece of content. The PostgreSQL Person of the week is lucas Etter. If you're interested in learning more about Lucas and his contributions to Postgres, definitely check out this blog post that does it for this episode of Scaling Postgres. You can get links to all the content mentioned in the show. Notes be sure to head over to Scaling Postgres, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 114

May 18, 2020 00:20:28
Episode Cover

Data Directory, Vacuum Features, Binary Performance, Network Latency | Scaling Postgres 114

In this episode of Scaling Postgres, we discuss the Postgres data directory, new vacuum features, binary storage performance and the impact of network latency....

Listen

Episode 243

November 27, 2022 00:10:25
Episode Cover

IN vs ANY, Ghost Conditions, Percentage Calculations, Variadic Unnest | Scaling Postgres 243

In this episode of Scaling Postgres, we discuss how PG15 helps reduce replication lag, how to get the parameters used in prepared statements, PostGIS...

Listen

Episode 136

October 19, 2020 00:13:08
Episode Cover

Stat Tracking, Temporary Tables, pg_squeeze, pg_auto_failover | Scaling Postgres 136

In this episode of Scaling Postgres, we discuss stat tracking, temporary tables, a new utility called pg_squeeze and an update to pg_auto_failover. To get...

Listen