Reducing WAL, What is Null?, Understanding SQL, TPC-H Benchmarking

[00:00:00] In this episode of Scaling Postgres, we talk about reducing wall. What is null? Understanding SQL and TPCH benchmarking. I'm Kristen Jameson and this is Scaling Postgres, episode 104. [00:00:22] All right, I hope you're having a great week. Our first piece of content is reduce wall by increasing checkpoint distance. And this is from Cybertechn postgresql.com. And to start off, he just creates a basic postgres twelve cluster, sets the max wall size relatively low and just turns off synchronous commit to be able to have the tests run quickly. Next, he creates a PG bench test database with 10 million rows and then does a PG bench benchmark ten times 1 million transactions. And it processed about 13,700 transactions per second. And doing that calculation of wall generated about 77gb of wall files. So basically the lower max wall size causes it to checkpoint more often because it can't keep that much wall around, it needs to go ahead and checkpoint save those changes that are occurring to the data files. And with that it generated a lot of write ahead log wall 77gb. Next, he increased the max wall size to 150GB, which is over 1000 times as much, changed the checkpoint timeout period to one day, still left synchronous commit off. And now in running it, it only generated 5GB. So 5.3 compared to 77gb. So why would the same set of transactions cause ten times the amount of wall to be generated? And he explains it here. Basically, after a checkpoint, anytime that page or block is altered, they need to write the full contents of it to the wall to ensure consistency. Now, once that whole block has been committed to the wall, subsequent changes won't require that to be done. But it needs to start with a consistent starting point to make sure that after a checkpoint, any alterations, those pages or blocks are committed to the wall files. Now, an older post talking about this from 2016 is actually called on the impact of full page writes. And this is from a second quadrant.com. And here they're talking about full page rights or the storing of full page images in the wall files. And as an example, they are looking at size differentials in terms of different workloads comparing a serial or a big serial compared to UUID. Now, because the serial is essentially sequential, whereas UUID is randomized, whenever it needs to make alterations, you're going to get more full page images stored in the wall. Therefore, using UUIDs causes a great deal more wall to be generated normally. But this is a great post that goes into more in depth on the reason why the need for having these full page writes as well. So just something to keep in mind if you want to research how to potentially reduce your wall size as your database increases in size and some potential solutions to save space and make it more efficient. [00:03:25] Next post is what is null, and this is from procona.com and they're talking about the concept of null has different definitions based upon different languages. Like for example, they talk about C and c plus plus. Here that null is defined as zero in the standard definition header file. Whereas when you're talking about Java, null means variable references do have value and it can be tested against null with equality operators. But if you're using PostgreSQL, null basically means an unknown value. So they have a quick comparison here where you do one equals one is true, does foo equal foo for text? Yes, it's true. Does null equal null? No. And the reason is because you can't use this equality operator, an unknown value and equal an unknown value is unknown. Similarly you can't say is this unknown value not equal this unknown value? You don't know, it's not known. And how you need to work with nulls in PostgreSQL is use the is or is not. So when you say null is null then you'll get a true or null is not null, then you'll get a false. And similarly you can use the coalesce function and it will return the first set of not null arguments and you can test this expression, but normally you're going to use null is null. Now they also mentioned null if, which I personally haven't used, but it basically returns a null if these two values passed in or equal otherwise return the first variable and it goes into a couple of quick examples of how you can use null. And they have some example data here where nulls are present you can't say is it blank? Because it's not blank, it's actually null. So you need to actually say look where the s naming the surname is null. So this is a relatively brief post that talks about null as it relates to PostgreSQL and just some things to be aware of if you are relatively new to postgres. [00:05:26] The next piece of content is SQL. Distinct is not a function and this is from theblog J O Oq.org. Basically he's saying he's seen instances of someone writing some SQL where they use distinct and parentheses around a column. But really this distinct applies to this whole select statement. It's not a function that you would use to execute or have column as a parameter. So these params are unneeded, they don't do anything because the select applies to the whole thing. And he goes through and explains a little bit why this is the case and how to work through it. Now, one exception he says, is the distinct on keyword, which is something that postgres has an extension to the SQL standard. So distinct on does require parenz around a parameter that's passed in describes how distinct on is required. Now, related to this, there is a further post he did the next day called Five Ways to Better understand SQL by adding optional parentheses. Now the first part of this he goes through all sorts of different posts that goes into additional lessons on SQL. So if you want a refresher or you need more education with regard to that, definitely check out some of these posts. But the rest of the post is talking about the placement of parentheses and how it can help you potentially understand what's going on with SQL. Now, he first talked about a row value expressions where you can write your SQL this way, but you can also put parenz around it. It doesn't mean anything and you can even use a function called row with parenz around it or a keyword row and still it gives you the same result. Now you can even put parenz around two different columns, so you can say with the first name followed by the last name, and you can look for that here to get the same set of data. And you can even use the N command or a sub query to say looking for the first name and the last name where it's in the set and it will only return the records that match the set as well as doing it with a sub query. Next, he moves on to joins and explains how joins work and how you can use parens with that. But he also goes over some things, explaining some precedents you could potentially do. But personally I just like doing joins this way without any parens to understand what's going on. And he agrees this is a nice and clean SQL, but he's using these parens as a reason to try and help understand what's kind of going on under the covers with SQL. And then like the previous post, he talks about distinct and how it applies to the whole section as opposed to a single column. He goes over a union intersect and accept and how using parens can help you better understand what's going on with that. And finally follows up with sub queries along with some conclusions. So if you want a refresher in SQL or you want to increase your knowledge of it, then definitely a blog post to check out. [00:08:32] The next piece of content is learnings from Benchmarking PostgreSQL on TPCH. [00:08:38] This is from the it's all in database blog on Rafisabee blogspot.com. And here they are benchmarking PostgreSQL using the TCPH benchmark, which is a more analytical or an OLAP based benchmark because doing a lot of testing with parallel queries. And he covers the lessons learned in terms of parallel scans needing to use indexes tuning related parameters such as your random page costs, generally dropping that down to one if you're using SSDs properly, setting the effective cache size to get better performance. He talks about parallel joins and the importance of workmen. Now, I was shocked to see where he was saying this workman 20% of the Ram size, which that's huge. So clearly this is for an analytical database where you have fewer larger queries running, because I've never seen this for online transaction processing database and he talks about a parallel merge join and then shared parallel hash joins as well as some other conclusions. So if you're interested in optimizing for parallel processing of an analytical database, maybe that's a blog post you'll want to check out. [00:09:57] The next post is what to look for if your PostgreSQL replication is lagging. Now, this talks about streaming replication from one primary database to a replica, or from a replica to a replica. Now, first they go over all the different types of replication, but this is primarily talking about streaming replication and it's talking about essentially monitoring your replication. And the number one way to do that is the PGSTAT replication view and they cover what all the different parameters are, what you can use on the standby node, what can you use on the primary node, and then how to identify the lag and the source of it. So if you're interested in monitoring your replication lag or you have a problem you'd like to explore, definitely check out this blog post from several nines.com. [00:10:44] The next post is creating a PostgreSQL procedural Language part Four handling Input Parameters So this is part four where they're creating a procedural level language using Julia. And this is the part four where they're actually passing off and executing code in Julia. So if you want to continue on with this series, the next post is available from secondquadrant.com. [00:11:08] The next post is developing PostgreSQL for Windows part two. So this is the second part where they're discussing actually developing on PostgreSQL on, say, a Linux environment, but you want to ensure that it works on a Windows environment as well. Well, this series of posts walk through that and this is a part two for those posts. The next piece of content is the PostgreSQL person of the week is Joe Conway. So it's an interview with Joe Conway and his contributions and work in PostgreSQL. So if you're interested in checking out this, you can check out PostgreSQL Life website for the Joe Conway post. [00:11:47] The next series of posts are primarily all about tools and updates that are available. So this is postgres AI, and there is a product called Database Lab, where basically, as it says here, it clones multi terabyte database in seconds. So it has a terabyte database and then essentially it does diffs as you're working with it to give you your own separate slices of the database for doing different experiments. Now with that, it also has a SQL performance advisor called Joe. So this lets you run those experiments through a slack chat interface to be able to check the performance of your queries using this database lab. So it's an interesting set of tools to be able to explain, analyze on production data and analyze the best queries. So if you've never heard of them before, maybe you'll want to check out the tools that they have available. [00:12:46] The next post is Crunchy spatial, it's PostGIS for the web. So basically, this is a tool that Crunchy data is working on. They're calling it Crunchy Spatial. And basically they're building a web interface leveraging PostGIS to the data within PostgreSQL. So they're developing a series of tools or microservices that are web interfaces into PostGIS. And the PostgreSQL database and the number of services they want to set up are a Vector Tile server, a Vector feature server, a data import service, a routing service, geocoding service and a base map service. To start off, they have two tools available of this feature set a PG Tile Serve and a PG Feature Serve. Now, related to this, they have another post that's crunchy Spatial Tile serving and they go into more in depth on PG Tile Serve about how it works. So it's a microservice that communicates with the data layer. It basically provides a web user interface into the Spatial data being produced. And they have relatively short number of steps to go ahead and spin up your own example of this. So if you're interested in using PostGIS for these purposes, definitely check out these set of blog postgres. [00:14:09] The next piece of content is PG. Backrest auto select Backup. So this is a post about PG Backrest and how a new version was released that has the ability to do an auto selection of the proper backup set when doing a restore based upon a target time. So if that's of interest to you, maybe you won't want to check out this blog post. And the last post is Cytus 9.2 speeds up large scale HTAP workloads on Postgres. Now, HTAP actually is a hybrid transactional analytical processing database and thus covers all of the new features that have been introduced in Cytus 9.2, as well as a number of different performance improvements. So if you have any interest in Citis, definitely a blog post to check out. [00:14:57] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show Notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 359

Trillions of Rows & Metrics | Scaling Postgres 359

Episode 106

Recursive CTEs, DB Access Patterns, Dates & Times, Cluster | Scaling Postgres 106

Episode 381

DocumentDB Movement | Scaling Postgres 381

Reducing WAL, What is Null?, Understanding SQL, TPC-H Benchmarking | Scaling Postgres 104