Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about UUIDs write spikes, distinct and the SQL standard. I'm creston. Jameson and this is scaling postgres episode 23 20.
[00:00:21] Alright, the first piece of content this week is UUIDs and Compressibility, and this is from the Richardstarton UK blog and in it he talks about using UUIDs with your database. He does mention postgres a bit, but he talks about other NoSQL databases and different ways that they use UUIDs and potentially how best to use them. So UUIDs are good to use when you essentially have more than one system creating data. So if you can rely on your primary database to generate IDs, typically you can just use a serial data type which uses an incremental int. However, if you have IDs being generated by multiple systems, potentially geographically, then UUIDs are a good fit for that situation. Unfortunately, UUIDs can be quite large and as he was looking at different compression options for UUIDs and even what their size is, so he's saying for a typical binary format he's using a ratio of one. And basically his conclusion going through this is that you should definitely try to use a binary format when you can. And he mentions that postgres has the ability to do that. However, other types of formats when you start going into text can be two to five times as larger, so that's going to have an impact on your system's performance. So he took a look at, okay, well how well could he do it by compression? And essentially with a binary format there's no difference in the level of compression. Pretty much it's pretty much all around 15 megabytes for a million UUIDs, whereas with the other formats you do reduce them, but it's nowhere near a binary. So this is an interesting blog post to check out. I definitely suggest you look at it and if you're using UUIDs in your system, it's a pretty short post to review, but overall I think the main point is that if you're using UUIDs, try to use a binary format as opposed to a text format in your system.
[00:02:30] The next post is Troubleshooting Daily. Write spikes on Sir Matty's PostgreSQL RDS. This is from the Sir Matitech blog on Medium and in it this is kind of a story based scenario. They had an issue where early in the morning they were seeing seven times higher right spikes essentially, and they were trying to determine what the cause of this was. So they looked at traffic, was there any significant traffic coming in? The answer was no. They looked for any background workers that were potentially doing something and they couldn't find anything. They looked in the query logs and they saw nothing that jumped out of them. From the query logs, they looked at the database snapshot schedule, I'm assuming this is backup, and again, nothing from that. One thing that they tried to do is change when certain jobs got run in order to see if the spike moved to a different time in the evening.
[00:03:34] And finally they found something related to their ETL process and ultimately they narrowed it down to a query with an order by because with all the data they were doing an external merge that was using the disk. So this is something to keep in mind when you're looking at your database system's statistics and you see high write activity that is not necessarily due to inserts updates and deletes, it can be due to selects. And I've seen this frequently where an order by clause when doing the sorting and the ordering, it can generate a lot of disk activity. If it's a large sort, it's going to flow to the disk as opposed to do it in memory and you're going to get a lot of write activity as it generates that query. And they determined that they ultimately did not need that order by and of course the write spike subsided, it's no longer quite as high. So this was a good blog post if you want to kind of get the story behind the different things they checked out. But ultimately it's related to a query with a large amount of data that had to be sorted that was going to disk and causing write activity for actually a select. So just something to keep in mind when you're looking at potential write spikes in your system.
[00:05:01] The next post is high performance distributed DML in Citus. And this is from the Citusdata.com blog. Now that's the title of this blog post, but what I found most interesting is that really it's a follow on from a post that they discussed using another data aggregation technique and that was we covered this in a previous episode of Scaling Postgres but it's scalable incremental data aggregation on postgres inside us. And again it's mentioning Citus but you don't have to use Citus to use some of these techniques.
[00:05:37] So this is kind of a follow on post to that. And in this example they are using movies. So essentially you have movies in one row per movie with a review score for that movie and they wanted to record the score in movies. And of course you'll want to update the review based upon an average of however many reviews happen. Now you don't want to just do an update because as they say here, update will take a row level lock and if you have multiple reviews being submitted or calculated at the same time, you're going to run into blocks or locks. So basically how you can get around that is instead doing inserts. So they created a second table called Movie Reviews. So all the movie reviews go in there and then on a periodic basis you can then update that movies table and set the review score and the review count based upon how many rows are in the Movie Reviews table. And because this will run on a periodic basis. Presumably just it could be once an hour, once every 15 minutes. You're not going to get a bunch of locks because all new movie reviews are going into the movie reviews table. So this is a good practice to follow if you have a particular use case. Now, of course, one downside of this approach they mentioned is that every time you are aggregating the movie review table for each movie. So even if you get one more review, you're still reviewing say 1000 reviews and depending upon your use case, if you have a lot of rows, that can be a bottleneck. So this is kind of a follow on to the previous post I mentioned where you can do an incremental update and this is the function that they created, again using the rollups table that was described in the previous blog post in order to incrementally roll up those reviews and this should be a lot more efficient. Now, again, you don't need Citis to do this particular technique. So I thought it was a great post that describes how you could potentially do this. But at the end, of course, they do include a benchmark where they're comparing RDS and Aurora, a large instance type for each of them and compared it to four smaller Citis database clusters and they saw really fast performance. So again, these are solid techniques that you can use no matter if you're using Citus or not. And if you have a use case that could potentially use some of these techniques, definitely a blog post to check out.
[00:08:11] The next post is the Many Faces of distinct in PostgreSQL and this is from the Status Code blog on medium and the subtitle title is Three Powerful uses of distinct in postgres. So what is distinct? And the quote is select distinct eliminates duplicate rows from a result so you can select distinct department from employees. So there's going to be multiple departments represented in the employees table and they're just showing each one distinctly. Now you could do the same thing with a group by just grouping by the department and showing the department, but then he talks about distinct on. Now this is a special nonstandard clause to find the first row in a group. So to get the first row in a group, in this example he is finding the person with the highest salary by doing distinct on department from the employee table in order by the department and then by salary descending to get and it will pull the highest row from the employee table for that department. So I could think of some really good use cases for this. And the third option is is distinct from and basically this is a way for safely comparing nullable fields. So for example, in this scenario A is distinct from B. So looking at this, one and one is the same, so it's false. One and two are different. So it's true. Null and one are different. So it's true because typically if it was equals it would be Null and Null and Null is false. So again, an interesting clause that you could potentially use in your SQL queries.
[00:09:49] And lastly he talks about array aggregate, which aggregates values into an array. And we've seen this before in previous episodes, but he likes to use it for just getting a sample of the data. So here you see what employees are part of each department with this type of query where you're aggregating the employees by department. So this was a relatively short post, but it gave a lot of short tips of some great clauses that you could use in your SQL queries.
[00:10:22] The next post is actually the PDF from a presentation called How PostgreSQL's SQL dialect stays ahead of its Competitors. And this is by Marcus Winnand and it's from the Modern Sql.com blog. Now in it he compares the SQL standard to what features are available in other databases with a focus on how PostgreSQL has handled them. And I should say this presentation is quite long at 97 slides, so I'll just go over some of the highlights of what he discusses to see if you're interested in this presentation. But he covers grouping sets, he covers table sample to get samples of data in a particular table.
[00:11:06] He covers XML table, a way to work with XML and relational data together. The new Identity column feature which is similar but more powerful than serial columns, new features added for over and partition by and lastly JSON functionality. So this was a great presentation to check out, just to look at the different SQL feature support for different database platforms and where PostgreSQL sits amongst them.
[00:11:35] The next post is Batch updates and concurrency. And this is from the Tapoeh.org blog and it continues on with his PostgreSQL concurrency series. And then he's talking about batch updating. So you get a dump of data from some source. Here he's picking up the Museum of Modern Art collection and then you're going to need to update that batch on a periodic basis. So basically the scenario is if a row exists, you want to update, if not do an insert.
[00:12:08] Now he talks about a classical way you would typically do that, but you run into concurrency problems if a row already exists and you want to try to insert it again. Now one solution he is suggesting you could use is locking the target table. But if you have PostgreSQL 9.5 you can use the on conflict clause and when doing the update on conflict, do nothing. So this is just another post that rounds out his concurrency blog series.
[00:12:38] The Last Post is Troubleshooting PostgreSQL streaming Replication synchronization So this describes the scenario where you have a primary database and it's synchronizing to one or more replicas. And what happens if the replicas fall behind? And maybe in the log you see something to the effect of this error. Requested wall segment has already been removed. So how do you catch it back up? Now, basically, you need to get those wall files from wherever you're storing them or archiving them and put them back in access to the Replica so it can catch up. And he discusses different settings you can make to the database to be able to keep those wall segments or those wall files around longer, so that you'll minimize the chance of this happening. If you don't have enough wall files, then you need to take another backup and restore the Replica from the start. And lastly, he covers using replication slots to prevent this problem from happening at all. So pretty much the wall files won't be deleted until the Replicas have all been updated with that wall segment. However, the key thing you need to keep in mind that he mentions here is that if you do that, you need to make sure that those Replicas can always reach the primary. Because if not, then you're going to have unlimited growth of your wall files on the primary database to the point where you could run out of disk space. So basically, you're trading a risk of a Replica falling behind with a risk of filling up the disk or storage system where you're storing all of the wall files. So this blog post goes over a great review of those scenarios.
[00:14:16] That does it. For this episode of Scaling Postgres, you can get links to all the content presented in the show Notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode. Or you could subscribe via YouTube or itunes. Thanks, Our.