Watch, Fast Text Search, Column Defaults, Import CSVs

[00:00:00] In this episode of Scaling Postgres, we talk about watching Star Wars fast text search columns with defaults and importing large CSVs. I'm Kristen Jameson, and this is Scaling postgres episode 28. [00:00:22] Hi. We have another week with a lot of great content, so let's get started. Now, the first up isn't necessarily related to Scaling Postgres, but it's super short and interesting. [00:00:35] So this is from actually a YouTube video. It's only eight minutes long and it's from where it's called a double exclamation Point Con 2018 using Postgres to watch Star Wars by Will Line Weber or Lean Weber. So basically there is an ASCII animation that someone has developed over the years for Star Wars. So it kind of looks like this, and it does text from the film and it shows little ASCII animation of what's going on. And he actually using Ruby, I believe, took the raw data from the website that hosts this ASCII animation, loads it up into postgres, and then uses the psql Watch command, which enables you to trigger a query multiple times with a delay between it to be able to watch Star Wars as slow or fast as you want to, based upon the parameters. And you can even update the database to determine at what point you start the movie. Or you could fast forward or rewind to a particular point. So again, not related to Scaling Postgres, but I thought this was one of the more interesting pieces of content this week. I just wanted to show. [00:01:49] So kind of related to the previous YouTube video. This is a post called psql Tips and Tricks. And this is from the PG IO blog and they just have little gifs that go through some of the top tips when you're using psql. So the first one they talk about is timing, which can track the timing of the queries that you run. [00:02:12] If you don't want null to be invisible, you can choose to set it to be a particular value. Again, going back to Watch, where you watch a repeatedly executed query, you can actually alter the styles of the output from psql to make tables. They have an example where you save results of an inquiry to a CSV, to a local file with a copy command, which is similar to the Copy SQL command. And then they're showing h for a built in syntax reference as well as using a command editor. So definitely some tips and tricks to check out. [00:02:50] The next post is Fast full text search in PostgreSQL. And this is from the Austinjwalters.com blog. So here he's talking about a fast text search and he wanted to search through a large set of data and he tested a few different ways. So of course the slow way is using the full text search using TS vectors or two TS vectors without an index. So that's clearly going to be slow. Something that's a little faster is just doing a light. So like searching for light Google with percent on the other side index won't really have a great improvement on this, however, combining the TS vectors with a gen index that gives you the fastest text search results generally. Now a little twist he put to it, he just didn't put the gen index on the column he was interested in but he actually created a new column so that he could actually set weights of how much to prioritize different pieces of information. So he set the weight of the story title to A and then set the weight of the comment text to B. So it's going to preferentially weight the story title above what's in the comment text. Now in doing that you're actually going to have to create a function and then a trigger to be able to keep that up to date. So there is a cost with doing this but it gives you the ability to have those in the database. But again most of the speed is definitely due to the gen index being used. But if you're wanting to learn more about full text search, definitely a blog post to check out the next article is a missing link in postgres Eleven fast column creation with defaults and this is from the brander.org blog. So basically he's talking about a feature in postgres Eleven whereby you can quickly add a default to a column in a table. Now he goes over what has been the problem in previous instances of postgres. So if you wanted to add a column to a table, so for example you alter table users add a column credits with a big int. This happens very fast. [00:05:08] The problem comes in if you want to make something not null well then you need to set a default and then how does the database handle it? It needs to do a full table rewrite and it basically uses an access exclusive lock while it's doing this. [00:05:24] So basically the process you have to go through is you have to add the column as was done up here, you have to then backfill it and then at that point you can set not null and default and things of that nature but it's a bit of a process to do. But with postgres Eleven they've implemented some features that enable adding a column with a default that won't take this access exclusive lock and it doesn't have to touch every single row in the table. And in the underhood section he describes quote the change adds two new fields to PG attribute a system table that tracks information on every column in the database. So attribute has missing set to true when there are missing default values, an attribute missing value contains the missing value. So basically this works great for being able to add a default with new columns you're adding. And as this post goes through it's really great for helping maintain the consistency of your data because he indicates. What people tend to do is they just add the column and don't worry about setting a not null or with the default because of the effort required for large databases to essentially rewrite that table. [00:06:34] But with this feature more developers are going to be inclined to add that column and when needed add a Not Null and potentially a default. So definitely a blog post to check out if you're interested in learning more. [00:06:47] The next post is using Copy in Postgres for importing large CSVs. This is from the Trenao.com blog and basically he had to import a million line 750 megabyte CSV file into Postgres for a Rails app and basically he started using a Rake task to parse the CSV import each row via active record, but at that rate it would take 16 hours to complete. So basically he went through that process and basically came up with a better approach, which is definitely the case using Postgres Copy. So he goes through the steps in terms of creating an imported records table copy from the file into that table, which at this point took 39 seconds to do as opposed to hours. And then he actually had a need to merge table. So from this temporary table he inserted into the destination table, but he also used an on conflict do an update so that basically if a row did not exist it gets inserted, otherwise it gets updated with the updated information. And then lastly, which is optional, is in terms of doing a vacuum analyze or vacuum full analyze if you want. Now if you're interested in learning more about this, I actually did a tutorial called Fast PostgreSQL Data Loading using Ruby and I went through different techniques you could use and showed the timing in this video and I looked at loading one row per insert statement. But then what was interesting is that you can really crank up the number of rows per insert statement. Now I had to do a lot more work to get that working, but it started approaching the speed of copy. I mean basically, probably Copy is going to blow the doors off anything. However, you can get super close loading multiple rows per insert statement versus just one set of data at a time or one row at a time in an insert statement. So if you're interested you can definitely check out this piece of content as well. [00:08:51] The next post is Postgres data types you should consider using, and this is from the Citrusdata.com Blog and of course their number one here quote JSON b tops the list of postgres data types that are apart from the standard one you should be using. But basically this is great for unstructured data that you're using in your application or interfacing with. An important thing to know about it is Gen indexes really speed up access and use of this data. So be sure to use gen indexes with it. [00:09:24] Next quote is range types are a calendar app's best friend. So basically these are great to ensure that you don't have double booking of something. Another type they mention is quote defining your own acceptance values with enums. So basically at the database level you define what values are accepted. So it's a way to enforce a constraint on say, a state field and then they go into cases like when you have an IP address a timestamp UUID, there's all sorts of built in types you can use in postgres. So this was a relatively short post that you should definitely check out. [00:10:00] The next post is again a super short one but has really great information and this is from thebill.com and it's don't lock tables, just don't. So basically when you need to serialize access to one or more resource you may be inclined to lock a table and basically really they're saying don't do that because number one quote it blocks auto vacuum which can cause bloat and even transaction ID wraparound in extreme cases. So these types of locks can also block queries and potentially cause deadlock situations. And lastly, it's easy to cause deadlocks with bad locking order. And then they say if the goal is to serialize access, consider using advisory locks instead. They have all the benefits of a lock on a table while not actually blocking access to auto vacuum or access on secondaries. So definitely some advice to follow. The next post is actually a video from the second quadrant.com blog and it's called Data Integration with PostgreSQL. So it's about a 40 or 50 minutes video and basically using foreign data wrappers and logical decoding, it shows how to interface PostgreSQL with MySQL, to read and insert data or manipulate data to do it from a flat file, from a Python script, using a Rest interface, a different postgres node. And they even demonstrated some things with Kafka. So if you're looking at different ways to interface postgres with other types of data, streams of data or database systems, this is definitely a YouTube video to check out and this was actually from a webinar called Data Integration with PostgreSQL by Sean Thomas. [00:11:47] The next post is actually another YouTube video called PostgreSQL Indexing. How, why and when? So this was presented at PyCon, Australia by Curtis Maloney and it just basically goes over the basics of indexing but some good advice that's mentioned throughout, it's about 31 minutes long, there were some good refreshers at the end for me, but a lot of it I have experience with before. But if you're a developer that's relatively new to postgres, this is getting indexing under your belt is essential. The next post is what hot standby feedback in PostgreSQL really does. So basically this is talking about the scenario where you have a primary database that's doing streaming replication to a replica database. Essentially it's a hot standby and there are instances where as the primary goes through updating deleting data, it can actually remove something that can impact an active query on the replica database. And what tends to happen is that that query gets canceled and you'll see something to the effect of error canceling statement due to conflict with recovery. Now, this blog post goes over kind of why that happens and how hot standby feedback can help prevent these replication conflicts. But like anything there's downsides to it and they mention here basically vacuum is going to be delaying its cleanup operation so it won't be able to clean up those dead rows as efficiently, which could lead to bloat. And quote if the slave never terminates a query, it can lead to table blade on the master which can be dangerous in the long run. So this is one parameter that you can tweak, but just be aware of the downsides before you explore doing it. The next post is Tune Linux kernel parameters for PostgreSQL optimization. [00:13:41] So this goes over a number of Linux parameters that you can tweak for PostgreSQL. Now, a lot of these are dependent upon the version of OS you're using, what distro and even the version of PostgreSQL. So for example, Sh MMX and Sh Mall are only relevant for versions prior to 9.3 and a lot of these settings only improve performance once your database gets to a certain size. So if you're not yet at like the terabyte stage, you probably won't get much of an improvement tweaking some of these parameters. But again, if you're going to definitely do it in a replica of your production system, testing out and seeing what parameter because it's also very hardware specific some of these parameters. So definitely test before you roll things out into production. The next post is Real World SSD wearout. [00:14:39] This is from the Okmeter IO blog and they're talking about how SSDs can tend to wear out and how they are using a smart monitoring agent that collects disk drive attributes and basically contract is the Media wearout indicator. So at what point is the SSD media going to wear out? So this is just something to keep in mind if you actually have a physical server and it's not like a database hosted solution or they are maintaining the hardware for you, but if you literally have a system, this is something you need to be aware of that this eventually happens. And with reference to PostgreSQL, which they talk about here, is that it can definitely happen and definitely with regard to writes, it can happen more than you expect. For example, if you have a lot of SQL queries or those SQL queries and those SQL queries are generating temp files, or they're having to do sort operations and doing sorts on disk as opposed to memory, that can lead to a lot of disk writes and usage maybe that you weren't expecting. So it's definitely something to be aware of if you have physical servers. Last post is Beware of your next Glib C upgrade and they're talking about they've done a big update to the correlation information for Ganu Libc 2.2.8. Now, this probably isn't going to impact many people until the future, but it's just something to be aware of that this can have impact on how sorting works based upon how you have set up your database. Now, they talk about in the bottom here quote as of this writing, only bleeding edge distros like Arch Linux have already shipped Glib C 2.8 and when it's scheduled for Fedora, Debian and even Ubuntu. So just something to keep in mind and definitely a blog post to check out to see if you might be impacted by this in the future. [00:16:41] That does it. For this episode of Scaling Postgres, you can get links to all the content presented in the show notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 96

Account Configuration, Pricing, System Identifier, HyperLogLog | Scaling Postgres 96

Episode 234

Rust for Extensions, Timescale vs. Postgres, Uninterrupted Sharding, Data Flow | Scaling Postgres 234

Episode 31

CTE Warning, PG 11 Features, Death by DB, Correlation | Scaling Postgres 31

Watch, Fast Text Search, Column Defaults, Import CSVs | Scaling Postgres 28