Parallel Indexing, SQL vs. ORM, Logical Replication Upgrades

[00:00:00] In this episode of Scaling Postgres, we talk about parallel indexing SQL versus orm logical replication, upgrades and development databases. I'm creston. Jameson. And this is scaling postgres episode 29 one. [00:00:22] All right, our first article poll this week is PostgreSQL parallel create index for better performance. So this is from the Cybertechgresql.com blog and basically they're covering a new feature offered in PostgreSQL eleven that allows you to create your indexes in parallel using multiple cores of your machine. So I really like the methodology that this post lays out. So first he intends on creating some large tables in PostgreSQL, and he actually uses the new procedure function to be able to insert a million rows at a time and then commit after each million rows. Then he just calls this function 500 times to basically create 500 million rows. And he does it with a single numeric. So he said he wanted to use numerics because they are more complex and it's going to take more time to build the indexes. So he felt it was a better test than doing it on some integer field. So basically he has 500 million rows of random numeric values, which constitutes roughly 21gb of data. Now by default in PostgreSQL eleven, parallel index creation is on by default, and I believe the default values he's mentioning here, the max parallel maintenance workers are set to two, but for a baseline he said, all right, well, let's go ahead and set those to zero. So essentially it's just going to use one core and creating that index on the system he's using took about 17 minutes. Next he says, all right, well, let's try two cores. So he bumped up the max parallel maintenance workers up to two and it took eleven minutes. So a pretty good drop from 17 minutes to eleven minutes. So he said, all right, well let's try four, because I believe he says he's using a four core machine. So he's upping the four minutes and it went from eleven minutes with two down to about nine minutes. So he says at this point he's running into hardware limitations. He thinks of the disk or some memory constraints because he hasn't tweaked that. So he moves into adjusting memory and then he moves up into using table spaces for where the index go and table spaces for the sorting space. So on each step you see, he presents a graph at the end. Here the improvements that happen. So going from a single core to two cores to four cores. Then configuring memory drops it down a bit more. Putting on separate hardware drops it down a bit more. And just for reference, this final data point is where he was using integers. So index creation is really dependent upon the data that you're trying to index. But I thought this was a great post to kind of give you a sense of what kind of improvements you can potentially see using these new parallel index creation features that are coming in PostgreSQL eleven. So definitely blog post I suggest you check out. [00:03:12] The next post is I don't want to learn your garbage query language. And this is from Eric Bernhardson. Now this doesn't necessarily mention PostgreSQL, but he basically classifies it in this pretty short post, but he's classifying as a bit of a rant. Now he's kind of tired with all the different orms that people create or domain specific languages and why can't we just stick with SQL? Like he talks about even specific SaaS products. Splunk have their own orms or query language like Splunk has SPL and Mixpanel has JQL, rollbar has RQL, newark has an RQL, AdWords has Awql. So he's like, why can't we just use SQL? So I thought this was an interesting post and I definitely feel this at times. I know in my development I do tend to use orms, but sometimes I just drop down to pure SQL when things start getting a little bit hairy and I want to be more specific in what I'm asking of the database. And definitely whenever I'm having any performance problems, I look at the raw SQL. So this is just something to keep in mind because maybe you're having performance problems because the Orms are doing something you don't expect. So if you run into performance problems, definitely look at what the raw queries are an Orm is sending to the database because that may give you insights into how you can change what you're asking the Orm to do. Or perhaps drop down to using raw SQL if it's necessary. Although the times that I felt I've had to use an orm or DSL is when you have dynamic queries. Like maybe you're wanting to do some reporting and giving someone like a raw SQL is just too much, but you want to give them a form to be able to kind of form their own queries. Basically I've found I've had to resort to some DSL, either doing it yourself or find one that can help you provide that type of feature, but definitely an interesting viewpoint I thought I'd share. [00:05:14] The next post is upgrading to PostgreSQL Eleven with Logical Replication. So again, logical replication, this is a feature that came online with version ten and they're talking about potentially using that to upgrade to PostgreSQL eleven that's going to be coming out first. They cover the three main ways to upgrade a PostgreSQL installation. First is just do a PG dump and then restore it. That's the most straightforward. If your database is small, it's definitely the way to go. PG upgrade. When your database system gets a little bit larger, this is the route that you would generally want to go because doing a logical dump and restore can take a very long time. Whereas you can do this with minimal downtime using the hard link option of it. And with version ten, logical replication is another possibility. So that's what this post goes into. Now I should add he mentions here quote logical replication is the newest of the bunch here, so it will probably take some time to work out the kinks. So this is definitely a new way to try to do it and there are certain things you need to be aware of and handle. So it's going to require a lot of testing. But basically he goes over a twelve step process that you can follow if you wanted to try to do your upgrade using logical replication. The other option if you're looking for kind of like a zero downtime upgrade is potentially using PG Logical, which I believe Second Quadrant does as well. And they've had previous posts that give some insight into how they do that. So with PostgreSQL eleven right around the corner, this was an interesting post to check out to see if you potentially want to do your upgrades using this method. [00:06:56] The next post is twelve factor dev prod parody for your database. So this is from the Citusdata.com blog and basically they're talking about you generally want to have parity or have development your development database be as close as possible to your production database. And they say minor versions differences may be okay, but you definitely want to try to keep it as similar as possible and if you can keep the data as similar as possible. Now of course this is not easy if you have terabyte or a multi terabyte database to keep that in sync. And plus you generally in your production data have data that you need to keep secure and you don't necessarily want it all over developers machines. [00:07:44] So there's a couple of different ways you could do this. You could potentially scrub all the personally identifiable information from the database, but again, it's not an easy process to do. You may miss something. So one area that I've seen people do is that they set up staging databases or even production replicas and they deploy application changes to those staging databases, test out everything is still working and then deploy it to production. And for particular queries you could test those against a copy of the live production database. Now it won't be exactly identical, but you can help identify cases where queries may choose one path in production but a different one in development if you have a recent copy of production. So that's something you can do to minimize differences between performance and development versus production. And in the post they also talk about things to keep in mind in terms of when you're adding a column as a not null and need a default. The safe process to do that as well as always create your indexes concurrently. But the post continues on and mentions essentially having a replica of production, like I mentioned by forking your database and they talk about some tools that Citus data has for their solution. But you can do a restore to a separate PostgreSQL database to replicate something similar or just maintain a separate server with a backup and apply the wall files to it. So definitely an interesting blog post with some ideas to keep your development environment similar to your production database or if that's not really possible, how you could minimize potential issues that are due to the database differences in development and production. [00:09:24] The next post is why do we install as root? And this is from Dave's Postgres blog at Pgsnake Blogspot.com and he basically says common questions I hear from quote common couple common questions I hear from customers why do we install our software as Root and why do we run services as Postgres? And he says the simple answer is for security. And basically Postgres gets installed as root, so as many of the files are owned by Root. So if the Postgres user for whatever reason gets compromised, they can't modify the actual code or expand their attack to other areas of the server as much. And why run the services postgres? Because they own the files and again it minimizes the extent to which a compromise of that service will impact the server as a whole. And he says a quote a basic principle when securing a software installation is install with maximum privilege requirements and run with minimal. So if you've ever asked those questions about this, definitely a relatively short blog post to check out. [00:10:31] The next post is Care or Char, what is it good for? And this is from thebill.com and basically he talks about PostgreSQL has type care and it's little used. Most people use Faircare or text and kind of the reason there's no benefit to use the care data type in other database systems. There's more of a reason to use them because Veracare actually saves space versus a care. And a care is definitely a fixed length and Postgres does a little bit like that. But he talks about some surprising differences with Care and basically suggests definitely only use Vericare or text. And if you want to get more details about the differences between this is definitely a blog post to check out. [00:11:15] The next post is Tuning PostgreSQL database parameters to optimize performance and this is from the Procona Database Performance Blog. So basically they go over a number of parameters to tweak when you're configuring your PostgreSQL instance. So they talked about shared memory, which is generally 25% of your Ram and they give a couple of suggestions on how you may want to tweak that. Potentially they talk about wall buffers and how it's default to 16 megabytes but if you have a lot of concurrent connections you may want to increase that. They talked about effective cache size which generally they talk about being 50% of your memory, they talk about work mem and how this is essentially a per connection setting, although it can be used more than once depending upon the queries that you're doing. Like for example, if you have a quote, if you have many users trying to execute Sort operations, then the system will allocate work memory times the total sort operations for all users. So generally you want to keep a ceiling on this. If you have a lot of connections, however, you could ramp it up for a specific session, or if you don't have a lot of connections to your database, then it talks about maintenance work memory, which is used for maintenance tasks such as a vacuum restore create index, alter foreign key excuse me? Add foreign key alter table and how this can boost up your index speed, which is one of the settings that was changed in our first post that we looked at, then talked about synchronous, commit, how you could potentially turn that off for more speed at the cost of reliability. Although I don't know if I would necessarily do that. And then following up with checkpoint timeout and checkpoint completion targets and some general advice with regard to that. So if you're looking to learn more about tuning your PostgreSQL installation, definitely a blog post to check out. [00:13:04] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 333

When Select Writes! | Scaling Postgres 333

Episode 214

Backup Compression, Postgres IO, Parquet Files, pg_stat_monitor | Scaling Postgres 214

Episode 70

Default Security, Max Connections, Progress Reporting, Migration | Scaling Postgres 70

Parallel Indexing, SQL vs. ORM, Logical Replication Upgrades | Scaling Postgres 29