Weekly News, Full-Text Search Performance, pg_cron, Bulk Data Loading | Scaling Postgres 139

Episode 139 November 08, 2020 00:12:37
Weekly News, Full-Text Search Performance, pg_cron, Bulk Data Loading | Scaling Postgres 139
Scaling Postgres
Weekly News, Full-Text Search Performance, pg_cron, Bulk Data Loading | Scaling Postgres 139

Nov 08 2020 | 00:12:37

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss Postgres weekly news, full-text search performance, enhancements to pg_cron and the best way to bulk load data.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/139-weekly-news-full-text-search-performance-pg_cron-bulk-data-loading/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about weekly news, full text search performance, PG, cron, and bulk data loading. I'm creston. Jameson. And this is scaling postgres episode 139. [00:00:17] One. [00:00:21] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is new PostgreSQL core team members. This is the notification that's been placed [email protected] and Andres Freund and Jonathan Katz have joined the core team. So presumably this is to balance out the number of members that EDB or Enterprise DB had on the core team. They added two new members to balance it out and have no more than 50% of the membership coming from EDB. So congratulations to Andres and Jonathan. [00:00:55] The next piece of content is also from Postgresql.org in that they have started a PostgreSQL weekly news. So it looks like just some general news, a few bits of product news, and then a lot of patch information about all the different changes that have happened to Postgres. So if you want a lot of detail what's going on with the Postgres code itself, this could be a good resource for you to keep track of. So it says weekly news, so presumably this will be coming out every week. [00:01:25] The next piece of content is full text search since PostgreSQL eight three. This is from secondquarter.com, and this is a follow up to two previous posts talking about performance since postgres 8.3 that looked at OLTP workloads and analytical or bi queries. This one focuses on full text search. Now, they talk a little bit about the history of it and the fact that there's not really a great benchmark for it, but he had some search queries against posts to the PostgreSQL website or searches against the PostgreSQL website, so we thought that would be a real world benchmark. Now, what he tested was Gen indexes and Gist indexes. And gen indexes are generally accepted to be faster for searches, whereas Gist are lossy and they require recheck during searches. Therefore they're a little bit slower. He mentions the hardware used and then he goes into the different reports for each version of Postgres. Now, for the data loading, there's not an appreciable difference between them. I mean, early on it was a little bit longer duration, but not anything appreciable. The index sizes with regards to Gen was more favorable starting around 9.4, and the queers got significantly faster around 9.4 as well. And they've been generally trending up slightly over the different versions. Now, that wasn't the case when he was looking at Gist indexes. Now, the size stayed pretty consistent, but the performance has fallen since about 8.3 slowly, and that's pretty universal for all the different performance tests that he did. So this was an interesting post of benchmarking data you may want to check out, but the overall conclusion is that generally you should reach to Gen when you're doing a full text search in terms of performance the next piece of content is evolving PG Cron together Postgres 13 audit log, background workers, and job names. This is from Citusdata.com and in 2016 they released an extension called PG Cron. So a way for you to set up scheduled tasks to take place within the database. So you can see call this procedure or Vacuum this Table, or alter this extension based upon a Cron tab like entry of when these activities should occur. Now this was an open source project of course, and actually they've been working with others. So Citus Data is a part of Microsoft and they were working with Amazon's RDS team to add some additions to this. Now, generally, I just use cron at the OS level. I don't have a hosted solution such as RDS or Azure's database for PostgreSQL. However, if you are using one of those, then having an extension such as Pgcron in it is advantageous. So they've done a number of additions to make this easier. So first of all, it gives you an audit log of your jobs that have run within the data tables themselves, so that you can see what jobs have run, their level of success, and what the current status is. They've also switched to having an option to use background workers so you don't have to configure your connection information. So this way you don't have to make changes to your Pghba comp file to be able to use this extension. They also added the ability to add job names to particular jobs as well as Postgres 13 support. So it's a very interesting product to run scheduled database jobs when your database is hosted at a cloud provider such as Microsoft or Amazon. So if you're interested in this extension, definitely check out this blog post. [00:04:59] The next piece of content is Webinar Best Practices for Bulk Data loading in PostgreSQL follow up. This is from Secondquader.com, and it's a webinar they put on. You can click here to gain access to it and it talks about the fastest way to bulk upload data into Postgres. And basically the way to do this, make sure you don't have triggers, make sure you don't have foreign keys or indexes on the table and add them after you've loaded the data. And that copy is a much better way to upload data rather than doing insert statements, because the general rule is doing a single insert statement with a commit is one of the slowest. But if you insert multiple rows and then do a commit, that'll be faster, and then copy of course is the fastest. And there's possibilities for using parallelism. And they even talked about a technique here where you could use an unlocked table to actually load the data and even do an analyze on it, and then do an insert select statement to actually place it in its final destination. So if you're interested in speeding up your bulk data loads, definitely check out this webinar. [00:06:02] The next piece of content is having a second look at postgres version 13 improvements. This is from Cybertechn Postgresql.com, so it's further Postgres 13 improvements. Now we've covered some of these, but they cover generating a backup manifest file for base backups and then the PG Verify Backup tool to be able to verify that those files exist. And you have the wall files to do the restore, allowing Pgrewind to use the target cluster's restore command to retrieve needed wall. So this way you can point at a wall destination so it can find any wall it needs to do the restore process. Add an option to PG Rewind to configure standbys. Basically this is being able to write the recovery.com file or create the signal files necessary to do a restore. Allow re IndexDB to operate in parallel. So parallel re indexing use the directory of the PG upgrade program as the default new bin directory setting when running PG upgrade. So there's some advantages to this. What they say here for small web console emulators so you can make your PG upgrade command a bit shorter. Allow dropdb to disconnect sessions using the target database, allowing the drop to succeed. So this is again the command to be aware of that would allow you to drop your database if it has active connections. Some changes to PG bench to make it easier to work with. And then the max slot wall keep size so it can balance out if a replica is falling far behind, you can basically drop it out of being a replica. They also covered improve retrieval of leading bytes of toasted values for better performance using incremental search sorting which we've discussed previously in terms of speeding up multicolumn sorts and allow control over how much memory is used by logical decoding before it is spilled to disk. And lastly, they mentioned add the backend type to CSV log output so that's something to be aware of. Like for example, they mentioned the tool PG Badger was caught off guard by this. So if you're using the CSV log output, be aware of that. So if you're interested in reviewing some of these changes, check out this blog post. [00:08:06] The next piece of content is actually a YouTube channel, the Percona YouTube channel, and they've had Procona Live Online and they had a few presentations that have been posted in relation to PostgreSQL. One is a look at the elephants trunk, PostgreSQL 13 and the democratization of databases. So there's a few posts and there may be some more postgres related in the following days. [00:08:32] The next piece of content is provisioning a PostgreSQL cluster with TerraForm and Ansible. So this talks about using TerraForm and Ansible for setting up a postgres infrastructure and it has links to the saved deployment scripts for TerraForm and the playbooks for Ansible. So if you're interested in using these tools, you can check out this post. Related to it is this YouTube video that discusses the same thing automating a PostgreSQL High Availability Architecture with Ansible. Again, both of these are from Edb.com, and it talks about using both TerraForm and Ansible to deploy a high availability PostgreSQL cluster. Now they said here when they did a survey of what DevOps tools their clients were using, you can see that over 50% of the responses were TerraForm and Ansible. So that's probably why they're using these tools. So if you're interested in doing this, definitely check out these two pieces of content. [00:09:28] The next piece of content is random numbers. This is from secondquader.com. They're talking about a way to generate random data to create data in a database for testing purposes like say, billions of rows of data. And they want it to be reproducible so random, but reproducible by using some sort of seed. So he's using a pseudo random number generator to be able to do that and tested a number of different generators and showed some of their performance here. So if you're interested in doing the same thing, maybe you want to check out this post from Secondquader.com. [00:10:03] The next piece of content is election night prediction modeling using PLR in postgres. So there was a Tweet that talked about election night forecasting using R. So basically crunchydata.com this blog post took that R code and put it into Postgres to be able to do a direct analysis because PLR enables you to run R on postgres as part of a function. So they set that up here. And there's a lot of code with this, but it helps you do this election night prediction. So if you're interested in that, check out this post. [00:10:38] Next piece of content is monitoring PostgreSQL cluster via PG pool two with Prometheus. So they're talking about a PG pool two exporter which sends monitoring data to Prometheus so you can analyze it along with your postgres instance. So if you're interested in setting this up for PG pool two, check out this blog post from Bping blogspot.com. [00:11:00] Next piece of content is how to analyze a PostgreSQL crash dump file. So they show how you can use GDB to be able to debug a crash file from postgres. If you're interested in that, definitely check out this blog post from Higo, CA. [00:11:16] Next piece of content is find your local SRID in PostGIS. This is from a Rustproof lab, so if you're interested in doing these types of geographic analysis, definitely check out this blog post. [00:11:29] Next piece of content is announcing swarm 64 DA 50. This is an extension for Postgres and they've actually done some major changes recently where now for doing their data warehousing analysis workloads, they're actually changing to using a compressed column store index to be able to give you high performance analysis of analytical queries. And it also includes postgres twelve support. So if you're interested in that, maybe you want to check out this post from swarm 64. Com. [00:12:01] And the last piece of content is the PostgreSQL Person of the Week is Elaine Mustaine. So if you're interested in learning more about Elaine and her contributions to PostgreSQL, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or for My Way.

Other Episodes

Episode 118

June 14, 2020 00:15:51
Episode Cover

Safety Systems, Failover Slots, Transaction ID Complications, Repartitioning | Scaling Postgres 118

In this episode of Scaling Postgres, we discuss safety systems, the purpose and existence of failover slots, complications with transaction IDs and how to...

Listen

Episode 130

September 06, 2020 00:15:15
Episode Cover

External Compression, Parallel Vacuum, JSON Aggregates, JSON & Arrays | Scaling Postgres 130

In this episode of Scaling Postgres, we discuss external compression, vacuuming in parallel, working with JSON Aggregates and the JSON & Array datatypes. To...

Listen

Episode 41

December 03, 2018 00:14:26
Episode Cover

Sequential UUIDs, Runnning Totals, Read Committed, TOAST | Scaling Postgres 41

In this episode of Scaling Postgres, we review articles covering sequential UUIDs, running totals with window functions, read committed vs repeatable read and TOAST....

Listen