Collation Issues, Fast Loading, Recovery, SCRAM Authentication

[00:00:00] In this episode of Scaling Postgres we talk about collation issues, fast loading recovery and scram authentication. I'm Kristen Jameson, and this is scaling postgres episode 72. [00:00:19] You all right? I hope you're having a great week. The first piece of content is ICU correlations against Glibc 2.28 data corruption. And this is from Cybertechn Postgresql.com and they're basically talking about changes to Glibc with version 2.28 could cause corruption of data, mostly for how things are ordered, and it could cause issues with indexes, for example, maybe unique indexes basically now storing more than one. In other words, it's no longer unique based upon correlation changes or you could get ordering differences once your OS upgrades to 2.28. So for example, I checked Ubuntu and the most recent LTS version was at 2.27, so not quite at 2.28. So basically they reiterate here. Why is it a problem? Basically the result of order by clauses and select statements in the order of keys and beatry indexes. So that could lead to incorrect query results as they say here, or to duplicate entries and unique indexes. Now you can do a reindex command. Thankfully version twelve coming up has reindex concurrently, but we don't have that in our versions today. But what this post talks about is potentially moving to ICU correlations as opposed to relying on the Glib C ones. Now, more of this was discussed in a previous episode of Scaling Postgres with this post here beware of your next Glibc upgrade talking about this issue and he demonstrates some of it with two different versions of Glibc. And if you're interested, I also encourage you to check out the comments because here they're talking about where they had the issues with the duplicate key violations on unique constraints with some of their indexes and kind of what they had to do to work through this. So basically this is a problem to be aware of. And this post talks about hey, maybe you potentially want to change to ICU correlations because it would be a little bit more independent than the Glib C changes. And they list several advantages here of moving to ICU such as they work the same on all operating systems, you can actually create your own correlations, you can use case and Accent and sensitive correlations, they're versioned. But the unfortunate fact of the matter is he says here quote you cannot use ICU collations as default correlation for a database or a cluster, so you have to explicitly specify them for all table columns. [00:02:50] This restriction may be lift at some point in the future. So basically this would be a pain to do, but if you wanted to go ahead and move it to ICU coalition he goes over and describes kind of how they're laid out, how to create them, how you can update the versions using ultracolation refresh version how to actually switch to ICU correlations. Again, you have to do it per table and of course you have to re index your indexes. But of course the issue that he also talks about is how to do this without downtime and it's basically using logical replication and the post finishes through describing how you could potentially do this. So definitely this is something that's coming down the pike for different OS versions on where you have PostgreSQL installed that you need to be aware of. And it looks like it's mostly just an ordering change, but it could cause greater problems if you have unique indexes on text columns that are impacted by these correlation changes coming in Glib 2.28. So it's just something to be aware of. And two posts for you to check out. [00:03:54] The next post is fastest way to load data into PostgreSQL using Python. And the subtitle here is from two minutes to less than half a second. And this is from Hakibanita.com. Now basically they have retrieved data and using Python went through a number of different steps on what's the fastest way to load hundreds of thousands of rows of data. Now this is a very long post, pretty python centric, not so much PostgreSQL, although the ideas are still there. Basically single inserts are usually slower, multi line inserts are faster, although interestingly, that wasn't that significant in his tests. He has here at the very bottom, he covered his result summary of the different iterations that he did and the execute many actually didn't improve it by that much. But then of course the fastest one is using Cop, the Copy command, and he had a way of parsing the data out and using the Copy command. And that's what got it down to this. Less than from 128 seconds doing one insert at a time to half a second. So definitely if you're a Python developer, this has some insight to be gained on how you can load data pretty quickly. But the general rule is single inserts one at a time are the slowest ones, doing a multiline insert gives you better performance and then the Copy command is probably the fastest one. And this is similar to what I found in a video tutorial I did talking about Ruby. So it's fast PostgreSQL data loading using Ruby, and again it showed the same results. Single inserts the time slowest, multi inserts, multiple rows per inserts faster, and then the Copy command is the fastest. So if you're a python or Ruby developer, you could potentially check out these two pieces of content. [00:05:51] The next post is big news and databases. Summer 2019. And this is from Windnd At, and this is his newsletter and kind of the subtitle for the page title here is the SQL Standard Extensions and Cloud War. So basically he talks about how the SQL standard continues to grow. And this is something this individual regularly monitors the changes to SQL and tracks them between the Oracle, Microsoft SQL Server, PostgreSQL DB Two, all the relational databases on how it's keeping up with the standards. And they're talking about introducing things like multidimensional arrays, things like actually a graph query language, GQL, and also some streaming data related features. And then with reference to the Cloud Wars he mentioned, basically there's this quote new add on features, options and management packs are only available in Oracle Cloud and on engineered systems. This means only on Oracle hardware, which is interesting, so you could no longer get it and install it on your own hardware. And he has the subtitle here, Cloud War Strange Allies working against your own hardware. So again, this is another great thing about PostgreSQL is that it's open source and you can take it and run it wherever you want, or run it on a cloud environment and pretty much get the same thing. And then he also goes into different discussions about some technology and science, some new releases for the different database systems that he tracks, as well as new content he's created, as well as some other areas. So definitely if you're interested in more news related items for this summer of 2019, definitely a blog post to check out. [00:07:31] The next post is postgresqlandrecovery.com and this is from Luca Ferrari at flukin 1978 GitHub IO. And he's referring to the fact that with version twelve of Postgres the recovery.com file is removed. And he says here that the server will literally not start with it. So they've moved all of the things that you configured in recovery.com into the main Postgresql.com file, or if you included files with it, there's also some new potential files that trigger files that get created, a standby signal and a recovery signal. Basically a standby is a hot standby in recovery. It recovers to the point it is ready to take on primary duties. And he makes a point that some of the reasons why this was done is he says, quote, this makes them moving them into Postgresql.com makes them a good candidate to be changed by an alter system statement. So basically when you upgrade to version twelve, you need to make sure that your processes take this into account or your backup recovery software that you're using takes this into account to handle the absence of this file as well as putting the configuration in the main postgresql.com config file. [00:08:47] The next post is how to upgrade your PostgreSQL passwords to Scram. And so with version PostgreSQL ten they've added Scram authentication as an option, so it's much more secure and standardized versions of doing password management for your users. They go over a brief overview of Scram and what it is. It's basically an acronym that's salted challenge response authentication mechanism. And they go through the steps of how to upgrade from determine if you can upgrade to Scram because there's some things that can't use Scrams. So for example, to my knowledge, PG Bouncer by default does not support Scram. I think there may be some GitHub patches that may allow it, but by default. I don't think it does at this time. The next step is validate your PG HPA comp settings, and they suggest having MD Five for your users that are connecting. Change a PostgreSQL's password authentication method. So basically set it to Scramshaw 256. Determine who needs to upgrade. So they give you a script that you can look through the users that need to upgrade to use Scram. Upgrade each user's password. So each user's password needs to be reset, and they give you two methods to discuss how to do this. One thing they mentioned here you want to avoid is don't use ultra role username password, new password, because the plain text password could end up being logged. So it kind of defeats some of the greater security improvements with Scram. And lastly, update your Pghba conf to use only Scramshaw 256 now because you change this in the Pghba comp file. For example, if you do have utilities that don't support Scram yet, you could have say, general users that are connecting interactively to use Scram, but maybe PG Bouncer you could leave on MD Five, for example. But if you're interested in moving to Scram, here's a process you can use. The next post is actually a YouTube channel, and this is the Itpug Italian PostgreSQL Users Group, and they recently had a PG day in Italy 2019, so they posted a number of videos. The first set is for the individual speakers, basically who they are and what they're talking about in their presentations, and they're relatively short one to three minutes. Then they have the greater presentations here that occurred at Pgday It. Now, I did notice some of them are in Italian or a language that was not English, but if you're interested in video content, there may be some presentations here you'd like to check out. [00:11:27] The next post is implementing autonomous transactions in Postgres, and this is from CyberTech Postgresql.com. And he says, what is an autonomous transaction? It's basically you want to send off or fire and forget some subtransactions that are not connected to the main transactions. So you're in a transaction, you're doing some sort of work, and you want to send some sort of notification. And he says here mostly logging, auditing, progress tracking into some table. So basically this information would persist even if the transactions rolled back. Now, he said there's a hacky way to do it where someone literally uses the Copy command to essentially write something to a file. But that has some downsides regarding the cryptic structure. The Copy program has a super user requirement, and you need physical access to the database server to read the log. Another alternative he thought up was using a DB link, so it's creating a separate link to the database and inserting into a particular table. And then another alternative is using PL Python and some Python driver again to make a connection and do an insert into the database. Now, interestingly, I wonder if you had a requirement for this, if there is another alternative potentially using an Asynchronous notification mechanism, listen and notify essentially I wonder if these types of notifications would work and still be essentially fire and forget whereas the transaction gets rolled back. So if you have a requirement for something like this, definitely a blog post to check out. [00:13:06] The next post is the database deficit. How? Full stack developers, blind spot harms agile teams. Now, this is not explicitly related to PostgreSQL, although he does in general talk about relational databases and it is an opinion piece. But basically he's talking about with full stack developers and particularly with the rise of Node JS. And being able to use one language on the client and the back end server side, essentially keeping track of all the different changes that are happening. Basically the database side gets the short end of the stick usually in terms of the skill set in a quote, full stack developer. So it's interesting piece talking about this database deficit and kind of what some examples look like not using a migration strategy for schema changes. And this is something I've quite thought often about as developers that basically don't spend a lot of time focusing on the database side, they focus on everything else. But it's an interesting opinion piece and I definitely encourage you to check it out if this type of content is interesting to you. [00:14:11] The next post is PostgreSQL wall retention and cleanup. PG. Archive cleanup. So this is a utility that you can use to be able to clean up wall archives. And he basically describes this built in command that's provided with PostgreSQL. You can use it as a dry run to see what potential files would be removed and then do it in a delete mode. But again, what's critical is do not run this in your main PostgreSQL data directory like the PG wall directory or Pgxlog directory. Do not do that, only do it to archive destinations because otherwise you'll corrupt your database. But of course in using this utility, again, the consideration is you need to make sure not to delete wall that you're needed. And they go through four different criteria here that are very important to follow in terms of being able to restore your database to a point in time. So if you're going to be using this utility for cleaning up your wall archives, definitely check out this blog post and follow their criteria to ensure that you're only deleting the wall files that you intend to. [00:15:20] The next post is generate primary keys almost automatically and this is from Luca Ferrari at fluca 1978 GitHub IO he's talking about. He was recently looking at a quite large database with a lot of tables and essentially almost all tables did not have a primary key. So basically he went through and developed a couple of different ways that you could query the database. Looking at the system tables and generate queries where you can automatically generate a primary key. And at first he just used and he used a couple of different methods. So he used this CTE method to be able to generate this, and then you could run this file against the database. Then he used a function that offers the ability to add different parameters to be able to control how the SQL generation works. And then he actually did everything within a procedure to execute the Alter table commands. So three different ways to do it. So if you ever need to run a lot of different commands using information that's located in the system tables, definitely a blog post to check out. [00:16:29] Last Post is serving dynamic vector tiles from Postgres. Now. Again, I don't know that much at all about PostGIS, but if you're interested in content where you need to serve these dynamic vector tiles, here's a new blog post that covers these techniques that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 100

Full Text Search, Query Optimization, Exception Blocks, Procedural Language | Scaling Postgres 100

Episode 265

pg_stat_statements, Transaction ID Wraparound, Consultant Knowledge, CitusCon | Scaling Postgres 265

Episode 348

The Future Is Containers? | Scaling Postgres 348

Collation Issues, Fast Loading, Recovery, SCRAM Authentication | Scaling Postgres 72