Foreign Data Wrapper, Timescale Compression, Fuzzy Search, Query Execution

[00:00:00] In this episode of Scaling Postgres, we talk about foreign data wrappers timescale, compression, fuzzy, search and query execution. I'm Kristen Jamison, and this is scaling postgres episode 179 one. [00:00:23] All right, I hope you, your friends, family, coworkers continue to do well. Our first piece of content is understanding foreign data wrappers in Postgres and Postgres FDW. This is from Crunchydata.com and it's talking about using foreign data wrappers. And a foreign data wrapper basically enables you to speak to another database. In the case of the postgres FDW foreign data wrapper, it allows you to speak to another postgres database and make a virtual table on that database, mirror one that exists on the foreign database so you can query it locally and it goes and pulls the data for you from that other database. Now, there exists many other foreign data wrappers. There are foreign data wrappers for, I think, MySQL SQL Server Oracle, as well as just a general file. So you can use a file as a foreign data source. So this blog post walks you through setting up the postgres one. So first you would create the extension, the Postgres foreign data wrapper extension. You define a foreign server and here's the syntax you would use here. Then you create a user mapping to be able to set up the authentication to make sure that you can connect up to that foreign server. And then you create your foreign tables on your database, referencing the tables on the other servers. So you can create a foreign table or you can import a foreign schema, which is what they did here. So now you can query those tables as if they existed in your database, and you can even join them to tables that exist within your database. And this post conveniently shows you a number of psql commands to be able to inspect the foreign servers that exist, user mappings, foreign tables that exist, et cetera. Now, in terms of optimization, they recommend using a use remote foreign estimate. So basically allow the foreign server to calculate its own estimates to pass that over to the local server. But you can also leave that off. Then they talk about be sure to analyze your foreign tables relatively frequently. And they even said that if you wanted to have more real time access to it, maybe you want to create a materialized view of that foreign table so that the data really does exist in the local table, it doesn't have to go over and bring it back. And that may give you some performance improvements at the expense of having to refresh that materialized view. And then they follow up saying how this could be an alternative to ETL. If you don't want to transfer all that data to the server, you just want to be able to go and access a few different pieces of data from that foreign server. It's convenient to set this up and do it that way, but if you want to learn more about form data wrappers, maybe you want to check out this blog post. [00:03:03] The next piece of content is Timescale compression and OpenStreetMap tags. This is from Rustprooflabs.com and they're talking about some previous posts that they had done where they have a bunch of data that they use for the OpenStreetMap application that's OpenStreetMap data and they partitioned it using the built in partitioning in postgres, the declarative partitioning. But what they wanted to try to do is say how could the partitions in Timescale be different. And the two benefits that they were looking for that Timescale offers is no need to manually create the partitions. Now, personally, I don't really have a big problem with this. It's pretty easy to create the partitions, but I guess it is convenient to not have to create them if you don't want to do that. The second is compression because by default Postgres will just compress when a row exceeds the default size and that default size around 2 KB. But Timescale enables you to compress the entire row essentially no matter its size. So this post talks about the differences he sees when he migrates from the postgres native partitioning or declarative partitioning to timescale's partitioning scheme. And he goes over setting it up, basically adding Timescale DB to the shared preload libraries and create the extension timescale DB in the database. And then at that point he created some tables and then enabled them to be a Hypertable using the Create Hypertable function. Then he inserted all of his data into it. Now, he kept the old source data so he could do some comparisons for querying and row counts and things like that. And then he went ahead and added compression to it as well. Now, his initial compression looking at JSON B fields resulted in about a 93% compression ratio, which is pretty huge. So he has the size before was 607 megabytes and it went down to 45 megabytes, so that's quite significant. Now, he did say this was a little bit of an outlier and doing some other columns that didn't compress as well. It was about a 40 50% compression, but still that's pretty significant. And then he also follows up looking at query performance with that. So I'll take a look at the table down here. This is the original dedicated table, how long it took in the planning time and execution time for the query. He shows up here and the hyper table took a little bit longer, I suspect, because this is a dedicated table by date, whereas you also had to narrow by the date when querying the Hypertable. So it took more planning time to identify which table to execute the query against. So that took a little bit more planning time. But again, the execution time is just as short, whereas when you looked at a compressed Hypertable, then the execution time increased significantly. As you can tell here. Now, it was still about one millisecond, but you could tell the magnitude here is quite a lot longer. So the compression does save you a lot of space at the cost of query times. But this was a very interesting blog post using Timescale DB for a particular use case, and if you're interested in that, you can check out this blog post. [00:06:19] The next piece of content a handbook to implement fuzzy search in PostgreSQL this is from Archetype.com and he's talking about various different ways of searching text in postgres. The first way they talk about it is just doing a direct equals comparison to find an exact string. And with that a Btree index works great to be able to find an exact match of a text string. It starts becoming a little bit more difficult when you want to look for particular prefix or suffix in the text string. But with that you can still use a B tree index, but you're going to want to use the text pattern ops declaration when you create that index, but that will enable you to do a suffix or prefix search with a Btree index and get you pretty good performance. Now, they did say you could also do an internal search and it is faster, but not by much at all. And then if you have a lot of text in the field, the B tree won't even work. For example, when they created an index on a larger column here containing a lot of data, they got an error that said index row size exceeds the maximum for this index. Now, they didn't talk about this here, but the solution for this is to use a gen index because that will enable you to create that. And generally if you want to use full text search, you would use t queries associated with t searches to do it. They didn't mention that in this post, but then they went into how to do a fuzzy search. And there's basically three ways right now to do a fuzzy search within postgres. The first way is to break things up into trigrams, and with that there's an extension PG trigram you can use. So you create that extension and then you can do queries against it or do similarity searches to find instances of the text that match what you're querying. The other way is levenstein matching and you can do this as part of the fuzzy string match extension and that basically looks for something that's close to what you're querying so many characters off. For example. So this did a levenstein comparison with this text string where there's a less than four difference and that gives you a particular output. The other way to do it with the fuzzy string match extension is a phonetic match, so it sounds like that. And here they're using the difference function to be able to do that type of fuzzy search. Now, whereas they didn't mention the gen up above, they do mention it here. And they even have a gentrigram ops to allow a special trigram search within a gen index to get you better performance. So if you want to learn a little bit more about text searching and fuzzy string matches in postgres, I definitely suggest checking out this blog post. [00:08:58] The Next piece of Content postgres query Execution symbol Query and this is from a dev two in the Yogabyte section. Now this is actually a series of three blog posts covering postgres query execution. The first post talks about a simple queries, the second covers a simple query using prepared statements. So it prepares a statement and then it executes it multiple times with different parameters and then also a plpgsql query execution. So I encourage you to check all of these posts. But this post gives some interesting insight into what's happening in the background and it shows you the different steps that execute a simple query goes through and then it talks a little bit about performance in that. Usually what you would use to track timing of different queries is PG Stat statements. But he says here that that doesn't give the whole picture in that it only gives you the duration from the time that execute start, hook begins and execute finish, hook begins. But there are some other things that aren't included. Namely, if you're on a version before 13, it only gives you execution time of queries, but apparently in 13 you can change it to give you execution only, plus planning, but it still doesn't include everything. Whereas the log min duration statement that logs to the longs does give you a more comprehensive view of how long a query takes to run. And we saw this in a previous post that was covered on scaling postgres where they were seeing timing differences between these two. And one of the main things that log min duration statement included is how long it took to deliver all the data to the client. So that's other areas where this can be more accurate than PG Stat statements. But this was a great series of three blog posts explaining query execution in postgres. So if you're interested in that, you can check out this blog post, the Next Piece of Content how batch insertion was done in postgres FDW and this is from Higo CA and this is talking about the postgres foreign data wrapper. And it goes into more detail of the patch that was recently done in order to batch multiple inserts. Before it was just doing one insert at a time over the network to the foreign server, whereas now they've added a feature where you can batch those up to give you much better performance. And this goes through the patch that did some of that. So if you're interested in that, you can check out this blog post, the Next Piece of Content PG Bench Client Jobs this is from Dev two in the Yogabyte section and they're talking about using PG bench and clarifying what Client and Jobs means. Now at the beginning here, they say client is about number of servers and Jobs is about the number of clients, which I read through the blog post and I actually kind of disagree with that. I believe what the documentation for BG bench says here in that the clients are the number of concurrent database clients. That makes sense. But what I think he's trying to convey here is that a database client equals one process on the server. So basically, the number of database clients you spin up are the number of server processes you're going to get. Now, the Jobs, according to documentation, equals the number of threads, and that makes sense to me as well. But the thing you have to keep in mind is that if you have one system that's trying to run PG bench against the database, if you are asking it to simulate a lot of clients, you may need to give it more than one thread to be able to do that. Or you may run into a bottleneck on the client. But apart from that, this blog post does go into a lot of the details and explain some of the differences between client and Jobs for PG bench. [00:12:41] The next piece of content waiting for PostGIS 3.2 secure Cloud Raster Access this is from Crunchydata.com, and they're talking about when you use PostGIS with Raster data. There's two ways you can do it. You can basically have it in the database, or you could have it outside of the database and you use PostGIS to access that Raster data that's outside the database. And a lot of times people are going for the external solution as opposed to storing it in the database because that makes the database really huge. They're talking about some of the disadvantages that they list right here, in addition to just size. Now they say by default there's a Raster format access library called GDAL, and by default it was built for local file systems. But they've recently added some changes to support using cloud access. And with the most recent changes, they allow you to access secure cloud access to resources. And that PostGIS 3.2 will also enable this as well. So you could have your own private buckets in S Three or Google Cloud storage and be able to access those from PostGIS. So if you want to learn more about that, you can check out this blog post. [00:13:54] The next piece of content validate PostgreSQL extension upgrade scripts using pgvalidate ext upgrade. This is from Migops.com, and they're talking about a new extension called pgvalidate ext upgrade. And basically, if you develop your own extensions for Postgres, this enables you to better manage your upgrade scripts. So if you want to learn more about this new extension for helping you manage your extensions, you can check out this blog post. [00:14:23] The next piece of content is PG Timetable version four is out and available immediately. This is from Cybertechyphen Postgresql.com, so if you use PG timetable to help run time dependent jobs in Postgres, you can check out this blog post. [00:14:38] The next piece of content. The PostgreSQL Person of the week is Andrea Boraden. If you're interested in learning more about Andre and his contributions to Postgres, you can check out this blog post. [00:14:49] And the last piece of content is we did another episode of the Rubber Duck Dev Show this past Wednesday. This episode was on writing performant code. So if you're interested in more software development content, particularly around writing more performant code, you can check out this blog post. Our upcoming episode is going to be talking about performance of your server infrastructure, and that will be happening Wednesday, 08:00 p.m. Eastern Standard Time. [00:15:16] That does it for this episode of Scaling Postgres. You can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 87

pg_receivewal, Application Failover, pg_checksums, pgBouncer | Scaling Postgres 87

Episode 90

Chaos Order, Pub Sub, Petabyte Scale, PgBouncer Connections | Scaling Postgres 90

Episode 229

Postgres Playground, PG14 Internals, DB Corruption, Anti-Join | Scaling Postgres 229

Foreign Data Wrapper, Timescale Compression, Fuzzy Search, Query Execution | Scaling Postgres 179