Value of SQL, Window Functions, DB Migrations, Data Storage

[00:00:00] In this episode of Scaling Postgres, we talk about the value of SQL, window functions, migrations and data storage. I'm creston. Jameson and this is scaling postgres episode 51. [00:00:20] Alright, before we get started this week, I wanted to mention that on February 20 Eigth, which is a Thursday, I'm planning a live show or webinar of sorts about PostgreSQL configuration. If that's something of interest to you, please go ahead and respond in the YouTube comments or reply to the email. [00:00:42] Let me know if that would be something potentially of interest to you. And I'll have more information on the next episode of Scaling Postgres about that. But to get started for this week, we first have a piece of content that I had the slide presentation for last week about Breaking PostgreSQL at scale. This is from Christophe Pettis, and they did post the YouTube video for it. So I will have this link in the show notes, and I definitely suggest this is a piece of content to watch to match up with the slide presentations that were mentioned at last week's episode of Scaling Postgres. The next piece of content is SQL. One of the most valuable skills, and this is from Craig Kirstein's, and he's talking about how well SQL, or knowing it, has served him in his different responsibilities from being a product manager to being an engineer. So he focuses on, number one, it's valuable across different roles and responsibilities. So whether you're in the business side, more business side, or more the engineering side, it's useful because data exists in both of those domains. Number two, learning at once doesn't really require relearning. And he talks about SQL hasn't changed that much over many years. Now they add new features to it, such as CTEs, common table expressions or window functions. But to fundamentally get to the data, it hasn't really changed that much. Whereas if you look at a programming language like JavaScript, there seems to be a new framework out every two weeks or something like that. And then three, you seem like a superhero quote. You seem extra powerful when you know it because of the amount of people that aren't fluent. So basically, he says a lot of developers ask him advice because he knows a fair amount of SQL. So if you want to make yourself a better developer, particularly when working with data, this is definitely a skill to learn. And so it's a good post and I definitely suggest you read it to get more of what he's talking about. Now, related to that, the next post is how to calculate a cumulative percentage in SQL. This is from the J O Oq.org blog, and he has a chart where he's calculating a cumulative percentage in SQL and this is the raw data, and he goes through how to do it. So this is a very simple way to kind of learn Windows functions because that's how he's using it. So first he goes about grouping it by date and then doing the percentage calculation using window functions, doing sum the amount over an order by a payment date and then sum of overall the amounts. And he has the resulting SQL query here to show the percentages cumulatively rising. And he says you can even do a nest aggregate functions in window functions. So here's another representation of doing that. So if you're wanting to learn more about window functions, definitely a blog post to check out. [00:03:45] The next post is what's new in PostgreSQL Eleven? And this is from modern Sql.com. Now, this is not like a lot of the other PostgreSQL Eleven posts that are out here because the focus is on the SQL. So he talks about the SQL 2011 overclause where PostgreSQL has added additional features in Eleven to support frame unit groups and frame exclusion. So if you wanted to learn even more about window functions and SQL in general, this is definitely a blog post because he goes into depth over what each of those mean. Now, in addition to it, he does go over the implementation of procedures in addition to functions and the call method to be able to call those procedures. And he does comparisons as well in his blog posts about what PostgreSQL supports versus other database vendors. Then he talks about parameters in Fetch first versus next. And this is essentially the limit clause. Limit is PostgreSQL specific but Fetch First in rows only, for example, is the SQL standard for doing a limit clause and then he talks about relative XPath expressions as well. So it's a different take on the PostgreSQL Eleven feature set mostly as it relates to SQL. So if that's of interest to you, definitely a blog post to check out. The next post is Move Fast and Migrate Things how we automated migrations in Postgres and this is from the Benchling Engineering Blog and they say here they use an Orm SQL Alchemy and its companion migration tool Alembic and how they use these tools to be able to make changes to their database over time. And a problem that they ran into was they had some downtime so they were using a reference to determine what operations were safe to use on a large and active database and they were adding an institution column with a null to the users table which should be safe, but they ran into a lock queuing issue. So for example, there was a long running transaction that had grabbed a shared lock on users. Their migration tried to happen, but it was waiting for that to finish. Now because this once an exclusive lock, there was another request that comes in for the user table, but it gets queued behind it. So now there's all these other additional user request transactions that are waiting for access to the user's table because of this access shared lock on the table that is preventing its exclusive lock. Now, how you handle this is using some sort of a lock timeout and they started using a lock timeout and a statement timeout so that the migration can gracefully fail. But of course, when they started doing that, it required manual intervention. So they usually had to do these three strategies to get around it. One, check if the migration was safe to rerun, and if so, retry the migration manually to see if they just got unlucky. Two, investigate what locks were being blocked by and possibly shut down a Quran system for some period, because the issue they had before it was a Quran running a long running job. And then three, wait till a better time, the database is less active to run the migration. Now, they tried some different things to make it easier, but they still had to manually rerun migrations a fair amount. So they actually built an infrastructure, they say, to automatically retry safe migrations. And with this system, they only retry migrations that have no intermediate commits. So basically that migration is just doing one thing, like maybe adding one column, wait two minutes between each retry to give systems any time to recover, or engineers to respond if a problem, and try at most ten times to do it. And they say, quote in three months of running this in production, we have seen all of our migrations go through successfully without any manual work. Now, they also talk about the concept of pre and post deploy migrations, which I believe is a feature of the Orm that they're using. They had some problem with the post deploy migrations, so basically what they went with was just pre deploy migration, so just make the change to the database and then deploy the new code to the application. So they basically moved to only that. And they said, quote it removed a lot of complexity and room for error. But then they had another issue that they tend to encounter, which was backward incompatibility. So for example, they're trying to remove a column from a table and their basic process is remove all usages of the column in the code and then remove the column with a migration. However, this can cause problems because as I say here, quote every query to the user's table failed until the new code was deployed a few minutes later. So even with that pre deploy migration, the time at which the migration happened to the new code was uploaded and running all queries failed. And that's because orm SQL Alchemy among them, and I know this happens for Rails as well, for certain selects and inserts, they keep a reference to the columns that exist. So if the column is suddenly missing in the table, selects and inserts or other database activity can fail. And they say here SQL Alchemy has two configuration options to truly remove usage of a column. So they have a deferred setting and evaluates none setting. But of course this requires a member to do that. And Rails has, I believe, an ignored columns capability. You can add to a model and you need to do that before you actually remove the column from the database so that the Rail software stops referencing that column. But they actually built in some compatibility checks to make this easier. So they built a custom tool to handle it to make their migrations easier. So overall, I felt this was a good blog post that goes into kind of the issues that they're running into and some best practices that you should consider adopting and things to watch out for in your application as you're changing things fast. So, definitely a blog post I suggest checking out. [00:10:05] The next post is Performance best Practices for using Azure Database for PostgreSQL Connection Pooling this is from the Azure Microsoft.com blog and they do have references to some other older posts here. But why I picked this, I thought it was interesting because anything about Pgbouncer I believe is important to learn more about because you're going to have to use it at some point with scaling. So they go a little bit about Pgbouncer. But what I found was interesting when they were doing a PG bench test because of the amount of work required creating and breaking down connections for PostgreSQL, because you have to fork another process just by adding PG bouncer to the test scenario they had up here. They said with PG bouncer, the throughput improved four times as shown below, while connection latency was reduced by 40%. [00:10:55] So in addition to reducing memory usage by being able to drop the number of connections in PostgreSQL by using PG bouncer here, you definitely can see some performance boosts with using it. So Pgbouncer or PG Pool is another option, is definitely a tool to do connection pooling for your database to improve its performance. [00:11:17] The next post is PostgreSQL deep dive. How your data model affects storage. And this is from Crunchydata.com. Now this is a pretty long post, but he goes over three or four different scenarios of different way to store, different ways to store data in your database. So for example, if you have a parent table and a detail table, maybe you have an orders and an order detail table and he loads it up with data, I believe it was 100,000 say orders and 10 million in the detail table. And he checks the size of them and even some queries. But then he looks at it from a different perspective. What if you just have it parent detail table merged together using like a text arrays? How does that size change, how does the performance change? And then he goes over through three or four different iterations to look at the different considerations in terms of size and a little bit to the performance trade offs for each one. And one thing he said here, quote, we have shown here that in certain, but I think reasonably common circumstances, you may be able to reduce 100 terabytes of data down to eight terabytes. So depending on how you decide to store the data, you could get a lot of space savings. So if you have a huge data set and you're potentially thinking about maybe ways to restructure it, definitely check out this blog post. [00:12:41] The next blog post is actually from Foss Demo. So this is the conference that referenced the first article where I was talking about breaking PostgreSQL at scale. This is actually the PostgreSQL track and the databases track and there are relevant talks and I've discussed them in last episode of Scaling Postgres as well as this episode of Scaling Postgres. So these are all the different talks by the different speakers. And what I also did is I looked through YouTube and what videos have been posted. I'm not showing them here on the show, but I'll include them in the notes for the show. Links to each of the YouTube videos as well, although for each one you can download the video recording as well. So definitely a lot of new content that you can feel free to check out. [00:13:31] The next post is what's up with set transaction snapshot. So basically this goes over what this feature is and basically it's a way, if you're using repeatable read and you have multiple sessions that you want to look at a consistent snapshot of the database, you can use this set transaction snapshot so that each client or a session that is connecting to the database gets a set view of it. Now, the scenario Hugh is talking about here is maybe you want to do a PG dump and do it in parallel with multiple sessions, but you could think of other scenarios where you're trying to read data, like maybe you're trying to do some process in parallel analyzing the data and you want a consistent snapshot. This is potentially a command and technique you can use to do that. So, definitely an interesting blog post to check out. And this is from thebuild.com the last post is an overview of the index changes in PostgreSQL eleven. This is from the several nines.com blog and they just go over each of the different features as it relates to indexes. So they talk about it introduced a parallel Btree index build. So you can build your indexes faster. You just need to be able to set your max parallel workers max parallel maintenance workers and you may also want to increase your maintenance work memory to be able to do better. Parallel index builds or more quickly added predicate locking for hash just in gen indexes. As I say, these will make serializable transaction isolation more efficient. When using those indexes, allow entire hash index pages to be scanned. You can specify a statistics value for a function index with the new features to partitioning. You can now do local partition indexing. And of course the big one for me is covering index, where you can use an include clause to add additional columns onto that index as kind of a payload so you can do more index only scans. So if you're wanting to look into the index features of PostgreSQL Eleven, definitely a blog post to check out. [00:15:34] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the Show Notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 234

Rust for Extensions, Timescale vs. Postgres, Uninterrupted Sharding, Data Flow | Scaling Postgres 234

Episode 78

Efficient Pagination, Rotating Passwords, BRIN, Row Level Security | Scaling Postgres 78

Episode 315

Hacking On Postgres is Hard! | Scaling Postgres 315

Value of SQL, Window Functions, DB Migrations, Data Storage | Scaling Postgres 51