Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about developers loving Postgres, a 50 terabyte database upgrade, vacuum FYIs, and protecting your database. From Scarlett Johansson. I'm creston. Jameson. And this is scaling postgres episode four.
[00:00:25] You all right? Our first article comes from Insights Stackoverflow.com for their 2018 survey. And with this, the most loved database PostgreSQL comes in a close second to Redis. So PostgreSQL one of the most definitely the most loved relational database. So not too bad. Interestingly looking at the dreaded is IBM's DB Two Oracle, so some non open source alternatives, but definitely interesting that PostgreSQL is getting this much love from developers. So if looking at surveys like this are of interest to you, go ahead and check out the 2018 survey on Stack Overflow.
[00:01:15] Related to that, the next article is migrating our production deployment from MariaDB Galera to PostgreSQL and Petrone. And this is from the behind pretext EU blog. So this is a very good article in terms of the length and the in depth they go into about their reasons for switching from MariaDB to PostgreSQL, and where they don't specifically address really big scaly concerns. They go into some of the advantages of their existing deployment on MariaDB and then some of the disadvantages as well as they go into kind of the rationale why they made the switch and some of the advantages they found from a scaling nature. What's very interesting in that one quote here, while it's been basically impossible to search all orders in the system for a name in less than 30 seconds, it now takes 800 milliseconds. We call that a success. So definitely it has helped performance by doing the switch to PostgreSQL in their estimation and they also go into in depth in the migration and the steps that they went through to do it. So, definitely a really great blog post to read to get a behind the scenes look of how someone did this. It's a really great post. I encourage you to check it out. The next post is called Updating a 50 terabyte PostgreSQL database and this is on Medium, a part of Adentech, and they are a payment processor and they're talking about how they have about 5000 PostgreSQL transactions per second coming in across multiple clusters. And apparently they're doing this in their own server that they run. So it's not hosted on a database platform, but they have their own server that's handling this traffic in this 150 terabyte database and it goes into their approaching when they did an upgrade when they were at the ten terabyte stage and discuss some of the downsides of the approach they use for that. And then they go into the new approach they use for upgrading to PostgreSQL 9.6.
[00:03:40] So it's definitely once you get up to those higher database sizes, this would be an interesting blog post to take a look at when you're considering to doing your upgrades. So I definitely encourage you to check it out.
[00:03:53] The next blog post is from the CyberTech Postgresql.com blog and it's three reasons why vacuum won't remove dead rows from a table. So it goes over to basically an introduction about why vacuum is useful and then a problem of how some rows can be left behind. And they highlight three primary reasons. One is long running transactions. I've mentioned this in previous episodes, but if you're primarily using PostgreSQL as a transactional database, you want to keep your transaction as short as possible. Because if you do have long running transactions, that can cause problems with concurrency or locking, as well as causing problems with vacuuming as well. So definitely you want to try to keep your transactions as short as possible for an OLTP database. The other reason they mentioned is abandoned replication slots. So if you're using replication and you're using PostgreSQL's built in replication slots, you want to be sure to get rid of those after a replica is no longer in use.
[00:05:04] Because what happens? Those replication slots cause the master server to retain that right ahead log. And if you have a replica that goes down for an extended period of time or loses connection, you're basically going to have unbounded write ahead log growth. So definitely something to keep in mind if you're using replication slots to clean them up when they're no longer in use.
[00:05:29] And lastly, he talks about orphan prepared transactions. So these are when you're doing a two phase commit for a particular transaction. Now, what's interesting is that when you look in the PostgreSQL documentation, they indicate prepared transaction is not intended for use in applications or interactive sessions. Its purpose is to allow an external transaction manager to perform atomic global transactions across multiple databases or transactional resources. So basically, this should be a rarely used feature of PostgreSQL, particularly if you're just developing applications. So hopefully you won't have any of these in your application.
[00:06:16] The next blog post is what PostgreSQL full text search has to do with vacuum. Again, this is also from the CyberTech Postgresql.com blog. And what's interesting about this post is talking about the importance of vacuum with Gin indexes. So Gin indexes are a special type of index that tend to be used when you're wanting to do a full text search of a particular field. And what's interesting as they explain how gen indexes work is that when a row is added, it's actually added to a gen pending list, which is basically a to do list that's processed by vacuum. So basically, if Vacuum does not get an opportunity to go over this index, it starts behaving less efficiently and he goes into an actual practical example of showing you the ramifications of not having that particular index vacuumed and the performance ramifications. So if you're using full text search in PostgreSQL with gen indexes, I would definitely take a look at this post to make sure that your vacuum is operating optimally the next article is Patching plain PostgreSQL for parallel PostGIS plans. And this is in the Carto.com blog.
[00:07:44] And in it they explain how they had a need for greater parallel processing for their PostGIS. Two, four queries.
[00:07:54] And they talk about some patches that they discovered that they could implement that actually sped up their queries to give them these benefits. So if you're using the PostGIS features of PostgreSQL, I definitely suggest taking a look at this blog post to see if these patches might be something that you want to implement.
[00:08:13] The next article is from Greenplum.org. Now, Green Plum is actually based upon PostgreSQL, and it is a massively parallel processing analytical database. And they've been doing recent work to catch up to the current version of PostgreSQL. And in this post they've announced that they've just completed merging PostgreSQL version nine into the Green Plum DB master. So Green Plum is an open source database for analytical processing. So if this is something of interest, you may want to keep track of the work that they're doing here. Okay, so this next blog post is a security related one and it's from the Impervia.com blog and it's a deep dive into database attacks. Why? Scarlett Johansson's picture got my Postgres database to start mining monero, and that is a cryptocurrency.
[00:09:14] So basically this goes through an attack on a PostgreSQL server and how it was able to take control essentially of the database, determine what GPU you have, if any. And part of what they used for the malware payload was an image of Scarlett Johansson. So from a security perspective, this is a very interesting blog post, but one thing that really caught my eye is near the end where they say under the section, how can an attacker find PostgreSQL databases? And they were basically able to find about 710,000 PostgreSQL databases open on the Internet that are potentially vulnerable. So definitely for security reasons, do not open up PostgreSQL to the Internet. Make sure it's behind a firewall or it's restricted to an internal network. So this type of attack just is impossible and they go into some items to limit this attack. So if you're responsible for administration of your PostgreSQL installation, definitely a good thing to look over this post.
[00:10:27] The next article is database normalization in primary keys. And again, this just follows more on developer education about best practices for data modeling. He goes over the various normal forms and has a discussion on primary keys and presents his perspective on a proper data model to adhere to first normal form. Basically, if you have a surrogate primary key, be sure to use a unique constraining index to assure that you definitely have unique rows from a business perspective. So if you're a developer, I think this would definitely be an article of interest for you.
[00:11:10] Oh, and this was from the Tapoeh.org blog. The next article is from the Citusdata.com blog and it's called Fun with SQL generate series in Postgres. Now this is a very short post, but it basically talks about a generate series that lets you generate a series of data and they even go into when you're doing reporting how if you have an absence of data for a particular time period, you can use generate series to help fill in data gaps. So, just a quick little blog post to take a look at.
[00:11:48] And the last article is actually an in depth tutorial I put together on PostgreSQL streaming replication. So if you're looking to start doing that, I welcome you to go ahead and check it out.
[00:12:02] That does it for episode four of Scaling Post, you can get the links to all the content presented in the show. Notes be sure to head over to Scalingpostgres.com to receive weekly notifications of these episodes. You can also subscribe via YouTube or itunes. Thanks.