Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about a PG minor botnet attack, correlation index, corruption, postgresql.com files and custom data types. My name is Kristen Jameson, and this is Scaling postgres episode 145.
[00:00:24] Our. All right, I hope you, your friends, family and coworkers continue to do well. Just to let you know, there will be no scaling Postgres episode next week due to the holidays, but we will get started again in the new year. Our first piece of content is PgMiner botnet Attacks weekly secured PostgreSQL databases. This is from ZD net and it's talking about a botnetnet attack being set up to attack postgres servers. And it says it only impacts postgres servers running on Linux thus far. And it says it follows a typical pattern in terms of searching out a particular network range and looking at the default port. So of course, number one, you don't want to have your PostgreSQL database exposed on the Internet. Number two, you don't really want to use the default port, you should choose a separate port for it and then it tries to use a brute force attack to find out what the postgres password is. So a third thing is don't allow logins via postgres from an IP address. Set up a separate account to do that purpose and just leave the postgres for say, local only connections. And then it uses copy from program to try and escalate its privileges. So something definitely to be aware of. And be sure to secure your postgres instance as I had mentioned here, so you'll be protected. And given that it's the holiday season, there might be more attacks going on because probably people are paying less attention to security during this time period. So definitely check out this topic or others about PG minor to learn more about it.
[00:01:58] The next piece of content is don't let correlation versions corrupt your PostgreSQL indexes. This is from Citusdata.com and they're talking about an issue where you have correlation set up for postgres. And usually the default correlation is related to the Libsy library that you have that defines how text is sorted within the database. And they give an example of using different collations here where they have the same words rs, banana, cat. When you collate them using in English, I guess, a New Zealand dialect, it sorts it essentially ABC. But AA actually has a different meaning in a Danish dictionary that means this character here. And when you do the order by Correlating, by Danish or DadK, it actually sorts the A word at the bottom. So correlations determine how data is sorted and this is particularly important for indexes. And they show an example here where you can list your databases and see what your correlations are for them. And that the default correlations typically set up by NTDB. Now under this heading of this article where they say, so what's the catch? It says from the Unicode Technical Standard, number ten over time correlation order will vary. There may be fixes needed as more information becomes available about languages. There may be new government or industry standards for the language that requires changes. So you can think of this, it's like time zones. Time zones constantly change based upon what's stated. Well different orders based upon culture and standards apparently change too. And they say and finally, new characters added to the Unicode standard will interleave with the previously defined ones. This means that correlations must be carefully versioned. Now, how this impacts Postgres is with their paragraph here and it says, I quote the most obvious problem this causes for databases like PostgreSQL is that Btree indexes can become corrupted if they are ordered by a collation that changes. Typical symptoms reported in the wild include orders that cannot be seen by one query, but can be seen by another query that is using a different plan. Or sometimes a unique index fails to prevent a duplicate value from being added to a table. Now I've seen exactly these problems happen after a Postgres upgrade. Now it actually may have been one of the Lib C versions that changed because the upgrade not only did it upgrade the database, but it was placed on a newer version of Ubuntu and perhaps the Libc versions changed and had this impact. Now, the resolution to fix these issues are just to do a re index. Once you do the re index, it'll resolve these problems. And the second thing to keep in mind, this only applies to text indexes. So if you have an index on integers or date time fields that's not going to be impacted by collations, it's only your text fields that will be impacted. And that's exactly what happened in the instance that I mentioned. Now, in terms of this happening, I've been mentioning Libc because this is an operating system provided facility for collations. There's also ICU which is the international components for Unicode. So this is you can think of a cross operating system way to define collations. So you could choose to use this and hopefully avoid some of these problems. But you're still going to have to have versioning in effect if you're going to go to the next release of ICU. But it does have a number of advantages that they list out here. But most people stick with the default correlations that are typically reliant on the Libc library of the operating system that Postgres runs on. Now they do say that Postgres releases ten to 13. If you're using these ICU correlations, there's a way to capture what they are and present a warning if things changed. But not really for the Lib C, which is the predominant correlation being used by Postgres databases. But what they've worked on here for release in version PostgreSQL 14 is a way to essentially record what correlations have been used for building these indexes. And then if the correlation changes for say, the operating system it can report back and say, hey, you probably need to re index these text indexes and even specify which ones. So this shows how it handles different versioning based upon the operating system in use. Now the warning he says, will look something similar to this. You'll get warning the index depends on a correlation which is default was version 34 but the current version is 36. So this gives you a warning that you've changed correlation versions and it gives the detail the index may be corrupted due to changes in sort order and the hint is reindex to avoid the risk of corruption. So basically when you change these versions, it looks like you do need to do a best practice of reindexing your text indexes. So this is a patch that should make its way into version PostgreSQL 14. Now, they also mentioned there are some other instances where text is in use that could be impacted or you could see some of this corruption. One, he says if range partitions are using text, if a collation changes, you might get things going into different partitions. This is particularly scary for me, so you might not want to use something like a range partition using a text field for that reason. The other one is that check constraints that perform string comparisons whereas once they may pass, now they may fail. So that's another thing to keep in mind. Now they don't have anything to address either of these two issues, they're just making you aware and then they go over and discuss some other things about indexes. But this was a great post explaining kind of the issue and the hope for at least a warning to be produced in the log if this has happened for your particular database. So I highly encourage you to check out this post from Citusdata.com.
[00:07:47] The next piece of content is repository, tuning guide and API for your postgresql.com. So this is referring to a site called PostgreSQL NF or PostgreSQL. Comf. Now of course, this is the main configuration file for postgres and up to this point this was just a source of documentation that highlights specific areas of the postgresql.com file and give you recommendations and ranges for each parameter that you can change. And they list some of the details with regard to it here. But what they've added today and they're saying in beta, is a way for you to manage your postgresql.com files via a service. So it provides a repository to store your postgresql.com files along with a tuning guide and an API for working with it. So this is an interesting tool to help with management of postgres. I myself use a configuration management tool called Ansible, so that's what I'm going to be sticking with. But if you don't have a similar tool, you may want to investigate this post and see what they're working on. So if you're interested you can check out this post from Angres.com.
[00:08:58] The next piece of content is creating custom postgres data types in Django. This is from Pgeanalyze.com and whereas they are talking about Django, which is a web framework for Python, the first part of the post is just addressing custom data types. The first one they talk about is Custom domains. So essentially this is taking an existing data type and then applying a check constraint to it so that you can easily use it in your table creation. So for example, they created a domain here called String No Spaces. So essentially it's a custom type string no Spaces. It's created as a VAR car, it's not null with a check constraint that no spaces should be included. Now, with all of that built into it, you can then specify when you create a table, I want a column called Username and the type is String No Spaces. And then they show you when you try to insert a string with spaces in it, it'll give you a constraint violation error. Now, what they also said is that you can build upon it, so you can essentially nest domains within domains. So here he created another domain called Email with Check and it uses as its base the string with no Spaces domain. So essentially you're going to get a not null string with no Spaces. And then he adds a check constraint of it must include an ampersand. Then he gives an example of the different errors you can get. If you try to insert one with a space. You'll get a check constraint violation for this check. And then if you enter one without an at sign, you'll get an Email with check. The next type of custom data types they cover is composite types. This is joining one or more types together to represent essentially one type. And they use the example of RGB values. So these can be represented as three integers, but you can create a separate type and they've called it RGB color value. Here that is a composite of three integers and they show example for its use case and then how you can even pull out an individual value from within that composite type. So these types make working with your database easier and then they cover how to use these if you're using the Django web framework. So the rest of the post covers how to use these types within Django. So if you're interested in that, you can check out this post from Pganalyze.com.
[00:11:09] The next piece of content is free. 1GB postgres Database on AWS Cloud Shell now this is quite interesting because Cloud Shell is just a web interface to give you a shell prompt to be able to interact with AWS services.
[00:11:25] But apparently it is its own environment and it has about 1GB of storage. So this post covers how you could actually install a postgres database within this environment. Now, it has a lot of interesting issues. Like number one, the instance is not permanently running only when you enter the cloud shell from your web browser. You can't connect at it from the outside and there's no replicas, no backups. But I guess I'd call it an interesting hack of running Postgres on this minimal environment. So if you're interested in that, you could check out this post from Ongrids.com.
[00:11:59] The next piece of content is PostGIS raster and Crunchybridge. So this is from Crunchydata.com, and they're talking about using the Raster data type in PostGIS. And basically it records Raster type data, which are images, elevation, grids, model outputs, and more. For working with geographical information systems. They're talking about how typically when working with this type of data, you either store it in the database or you store it outside of the database. So have a reference to some data somewhere, like usually the file system, so that PostGIS can access and work with that data. Now, I said initially in DB storage was used, but as information continues to grow and grow and grow, it's moved more towards storing the files outside of the DB. But what's interesting here is that this Raster tool has a way to access tiled GeoTIFFs of data. So you can actually store tiled images within an object storage system like S Three or Google CS or Azure Storage, and then pull individual tiles for analysis. So you don't have to store them all in the database, you're just storing a reference to it in the database. And they give you a demo of how you can do this using Crunchy Bridge. However, you could do the same technique using any version of PostgreSQL with Postgres. So if you're interested in checking it out, you can check out this post from Crunchydata.com.
[00:13:24] And the last piece of content is PostgreSQL exclusive cron jobs using PG timetable scheduler. This is from CyberTech postgresql.com. So if you're interested in using the scheduler called PG Timetable, you can check out this post that explains some new features that have been added to it.
[00:13:43] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show Notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.