Broken Indexes, Trademark Issues, Percentile vs. Average, Logical Improvements

[00:00:00] Speaker A: In this episode of Scaling Postgres, we talk about broken indexes, trademark issues, percentile versus average, and logical improvements. I'm creston. Jameson. And this is scaling postgres episode 183. You all right? I hope you, your friends, family and coworkers continue to do well. Our first piece of content is broken indexes after Linux upgrade. This is from Elephanttamer. Net, and he's talking about an issue where when you put Postgres on a new operating system version, sometimes that can break your indexes. The reason being, a lot of times Postgres uses the default Glib C library version, and if they have collation changes that can break your text based indexes, you'll run into issues either with unique indexes, not properly keeping things unique or duplicates, or any number of changes. Now, he specifically says he apparently encountered this issue when going to Debian Ten or Ubuntu 1810 or the 24 LTS version. So what you need to do if you ever upgrade in a collation changes, you're going to need to reindex any text based columns. You have to make sure that your indexes remain consistent and accurate. Now, he mentions a number of errors you may be getting, such as crashes or restarts or certain errors or certain jobs failing, basically just something to keep in mind. And this will even happen if you bring up another replica that's on a different operating system version than the primary. You could also encounter issues similar to this. Now, what you do to redo the correlation is you do a reindex. Now, this blog recommends using Am check to look for invalid indexes, but I would plan ahead. And if you're changing operating system versions, go ahead and do a reindex of your text fields for your database and plan that time in to do it, just to be sure. And I have actually kept my systems on Ubuntu 18 four because I'm waiting for postgres 14, which can do reindexes of partition tables. I'm actually waiting for that enhancement of postgres 14 before I upgrade my operating system version because I didn't want to run into this issue. Now, another way to sidestep this problem is to actually use ICU correlations. So you're not going to be relying on the Glib C version that can be upgraded. You can use ICU correlations, which gives you a separate library that you can upgrade independently, essentially of the operating system. And they talk a little bit about that in this hyperlinked post here by Citus, and we did cover this post back when it came out, but it is a good review of this issue and gives you the options of doing an ICU coalition. So you can go ahead and check out this post if you want to learn more. The next piece of content is trademark actions against the PostgreSQL Community. This is from Postgresql.org, and they're talking about an issue where there's been some trademark applications made by different organizations. And this is separate from the PostgreSQL core team or the PostgreSQL Community Association of Canada. And this goes into the area of legal and trademark issues and the Postgres and PostgreSQL name. So I'm definitely not an expert in this area, but if you use Postgres, I think it is appropriate and important for you to read this and just be familiar with the current activity on what's taking place. There are also two other posts. The next one is postgres core Team launches unprecedented attack against the Postgres community. This is from the PostgreSQL Fund website and this shows another side of the issue with regards to trademarks and PostgreSQL and the overall community. And then by a third post respecting the majority, questioning the status quo as a minority. Again from PostgreSQL Fund that goes into depth on this issue as well. Now, again, I'm definitely not an expert in this area, but I would encourage you to check out these posts just so you're familiar with what's currently transpiring with a database that you're probably using if you're watching this content. Next piece of content is how percentile approximation works and why it's more useful than averages. And this is from Blog Timescale.com. Now this is a quite significant blog post that's very, very long. The first half of it talks about the differences between medians, averages and percentiles and how percentiles can actually be quite advantageous for tracking, say, performance numbers, particularly as it relates to taking an average instead or a median. Because a lot of times medians or averages may not give you the best insight into how well your systems are performing. And percentiles are much better in this case. And you can run percentiles because PostgreSQL has a built in function to do it, the percentile underscore disk function and you can define the given percentile you want to get information about. But of course this is from Timescale and they're talking about how they've also developed a specialized function that gives you an approximate percentile. So basically it only takes a sample of the data and gives you still pretty accurate percentiles but does it much faster. And they talk a little bit about those functions and their availability within the Timescale extension for PostgreSQL. And that's essentially what the second half of the post covers. So if you're interested in that, you. [00:05:51] Speaker B: Can check out this blog post. [00:05:55] Speaker A: The next piece of Content logical replication improvements in PostgreSQL 14 this is from Amitcapilla 16 blogspot.com. They're talking about all the different changes for logical replication in postgres 14, including decoding of large transactions. So that's where the decoding process happens, when the transaction is still in process as opposed to waiting till it's committed. That allows replicas to keep closer to the current state of the primary. Talking about performance of logical decoding and how that's improved, particularly if you have a lot of DDLS that are hitting the system. I'm talking about improving the performance or time of the initial table sync and avoiding errors that would cause the whole thing to be retried. So there's been a number of improvements in this area logical decoding of two phase commits, having that as a new feature, being able to monitor logical decoding through a PGSTAT replication slots system view. So that could definitely be beneficial, as well as allowing publications to be easily added and removed before you had to iterate every publication, whereas it looks like you can just add one to a given subscription now, along with adding a binary transfer mode and allowing to get messages via PG output. So if you're interested in any of these features coming in postgres 14 with regard to logical replication, definitely check out this blog post. The Next Piece of Content PostgreSQL create Indexes after Bulk Loading this is from Cybertechuff in Postgresql.com, and this is a good piece of advice. Generally, you're going to always want to create your indexes after you've bulk loaded a lot of data because it's much faster to load the data. And he's got an example here. He loaded the same amount of data two different ways. The first way, he loaded all the data with no indexes, and then he applied the one index that existed and it happened in less than half the time of keeping the index on the table and loading it that way. So it's far more efficient to load the data and then add the indexes as a second step. Now sometimes, of course, you're going to need to load a lot of data with an active table with active indexes, and there's simply nothing you can do about it. This is more applicable to your doing an initial load of a lot of data. You can take this into account, although you may need to keep certain indexes on during the loading procedure if they're referenced as a part of the loading procedure, like if you're referencing data from another location to be able to do a secondary update or something of that nature. But overall, you want to keep the indexes off during any big initial load and then apply them later. But if you want to learn more. [00:08:41] Speaker B: You can check out this blog post. [00:08:44] Speaker A: The Next Piece of Content a quick test for Postgres foreign data wrapper batch Insertion this is from Higo CA and they did a quick test to see how much faster insertions are with the postgres foreign data wrapper in version 14. So they basically set up a foreign server on the same server using a different port and tried inserting different amounts of data 1000 rows, a million rows, 100 million rows and examined their performance across different batch sizes. The first one, there's no batches, it just does one row at a time, the next does ten rows at a time, and the next does 100 rows at a time. And what he found was that the performance, particularly with larger record sets, was about five to ten times faster using the batches. So for example, the no batch with 100 million was 53 seconds, whereas it was 5 seconds with a batch size of 100. So ten times faster doing a batch size of 100. And the thing to keep in mind is that this was on a local server. Imagine the performance difference if you were interacting with a server across the network. I would imagine that increasing the batch size would give you even greater performance compared to doing one row at a time. So definitely a great improvement if you're using foreign data. Wrappers the Next Piece of Content PostgreSQL Schedulers Comparison Table this is from CyberTech Postgresql.com, and they're doing a comparison of all the different schedulers that to their knowledge exist for postgres. So they're talking about PG Timetable, which is a scheduler created by CyberTech. They also discuss PG cron, PG agent, Jpg agent and PG bucket. And they do the comparison charts of checks and X's, and then they have descriptions of kind of each area below. So this is a great post if you're looking for a scheduler for postgres, although generally I still just use Cron to do all my scheduling, so I haven't ever used any of these particular tools. But if you want to learn more. [00:10:48] Speaker B: You can check out this blog post. [00:10:50] Speaker A: The Next Piece of Content best PostgreSQL Gui's in 2021 Updated this is from Retool.com, and Retool does do a PostgreSQL GUI, so that's something to keep in mind with this list. But they do list other than their tool. Ten other tools that allow you to use a Gui to interact with postgres. The number one being, and probably the older and most popular is PG Admin, but they list others like Navocat, which is paid for service. There's DB Beaver and Idsql in DataGrip omnidb Beekeeper Studio table plus querypy and SQL Gate So those are all sorts of different GUI tools that you can use with postgres. The Next Piece of Content pglog a flexible and simple log analyzer. This is from mydba notebook.org, and this post talks about an extension that's explicitly used to analyze log entries from postgres. And the idea is you take logs from one database source, you load them into a separate PostgreSQL instance that you can then use SQL to analyze those logs of the other server system. So there's particular requirements that you should set on the server you're going to be monitoring, and you're going to want to collect those CSV logs here. You load them into tables in a separate instance, and then you can use this PG log extension to be able to run some reports looking at the data in there. So like an auto vacuum report, checkpoint report, temp, file usage, et cetera. So if you do a lot of analysis or want to use SQL to analyze the log output of your database systems, you can check out this post and extension the Next piece of content. Tuple duplicate support and incremental view maintenance. So this is from Yuganagata PostgreSQL Blogspot.com, and this is another post on Incremental view Maintenance, where they want to design a way to maintain a materialized view without having to refresh it. Basically, it stays fresh by reading the data tables and doing the updates in real time. And here they're talking about being able to support duplicates in a table. So if you want to learn more about their progress with this feature, you can definitely check out this blog post next piece of content. What's new in the Citis 10.2 extension to Postgres? And this is from Citusdata.com and there have been a number of enhancements, including PostgreSQL 14 beta three support, basically making it postgres 14 ready. They've added some partition management functions for doing time series as well as index support for columnar tables. This is the one that I find pretty interesting, but if you're interested in Citis, that is a extension that does scale out postgres, you can definitely check. [00:13:49] Speaker B: Out this blog post. [00:13:51] Speaker A: The next piece of content. The PostgreSQL Person of the week is Sum Yadeep Chakraborty. So if you're interested in Yadeep and his contributions to Postgres, you can definitely check out this blog post and the last piece of content our next episode of the Rubber Duck Dev Show covered Rest versus GraphQL APIs. So if you're looking for more long form developer content, you can check that out. Our next episode will be Wednesday at 08:00 p.m. Eastern Standard Time, where we will be discussing background job processing. That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres dot, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 261

Easy Foreign Data Wrappers, JSONB Cheatsheet, Updating Cost Limit, Parallel Aggregate | Scaling Postgres 261

Episode 360

Handle One Million Connections? | Scaling Postgres 360

Episode 137

Toast Impact, zheap Storage, More Performance, pg_rewind Changes | Scaling Postgres 137

Broken Indexes, Trademark Issues, Percentile vs. Average, Logical Improvements | Scaling Postgres 183