Microsoft SQL Server Compatible, Time Series Performance, Feature Casualties, BLOB Cleanup | Scaling Postgres 143

Episode 143 December 07, 2020 00:18:42
Microsoft SQL Server Compatible, Time Series Performance, Feature Casualties, BLOB Cleanup | Scaling Postgres 143
Scaling Postgres
Microsoft SQL Server Compatible, Time Series Performance, Feature Casualties, BLOB Cleanup | Scaling Postgres 143

Dec 07 2020 | 00:18:42

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss Microsoft SQL server compatibility, the performance of time series DBs, DB feature casualties and how to clean up BLOBs.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/143-microsoft-sql-server-compatible-time-series-performance-feature-casualties-blob-cleanup/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about Microsoft SQL Server compatibility, time series performance, feature casualties, and Blob cleanup. I'm Kristen Jameson, and this is scaling postgres episode 143. [00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is AWS announces BabbleFish open source postgres with SQL Server compatibility. So this is a post from Angres.com, and they're talking about an announcement that happened at AWS Reinvent, where they announced BabbleFish for Aurora PostgreSQL. And basically what it does is it provides line compatibility with Microsoft SQL Server. So it interprets the different SQL features that have been added to Microsoft SQL Server as well as TSQL, which is what their store procedures are written in. And they have a little diagram they posted here. That's from the Babelfish GitHub IO repository, where you can have an application with a PostgreSQL driver that goes through plpgsql to PostgreSQL to be able to speak to it. But you can also do the same thing with a Microsoft SQL Server client driver that speaks TSQL over TDs. That's the line protocol for Microsoft SQL Server to BabbleFish to then be able to speak to Postgres. So if this works as advertised, essentially you could use PostgreSQL as a drop in replacement for Microsoft's SQL Server without potentially having to change anything with your application, and then over time, you could transition to Postgres. Now again, I believe this is mentioned for Aurora PostgreSQL, which is PostgreSQL compatible, but it says it's going to be released as open source under the Apache Two license on GitHub on 2021, and that it will be compatible with community based Postgres. So this is a great benefit for users who actually want to start using Postgres, but they're currently using Microsoft SQL Server. Now, this was also published on the AWS blog and it's called Want More PostgreSQL. You just might like BabbleFish, but it pretty much goes into the same details about why they're doing it and choosing to open source it. So if you're interested in this, definitely check out these two blog posts. [00:02:27] The next piece of content is timescale DB versus Amazon timestream 6000 times higher inserts, five to 175 times faster queries and 150 to 220 times cheaper. So that is quite a disparity between timescale DB and Amazon timestream. Now, this is from Timescale.com, which of course produced timescale DB, so you have to take their results with a grain of salt and examine them for yourselves. But if it's anything close to what they're purporting, for example, look at this 6000 times higher inserts. If you're considering a time series database, you would definitely want to check out these types of performance differences yourself before you choose a particular product to use. Then they show the different query speeds that they were getting out of it. So not as dramatic as the inserts, of course, except for this one high CPU query. I believe this is where they were breaking it out to multiple devices, doing the queries, and then they follow up with the cost differential between the two. So again, this is really, really significant. But again, if you were looking for a time series database, I would definitely closely examine this and run your own trials of each to see which performs better and cost, et cetera. But this is a very long blog post talking about the differences, the different offerings, and potentially trying to rationalize why some of these differences exist. So if you're interested in time series databases, and particularly looking for a performant one, maybe you want to check out this post from Timescale.com. [00:04:02] The next piece of content is feature Casualties of Large Databases. So this is a post that is not necessarily PostgreSQL specific, but it's talking about how databases get created for applications they're worked on for a while and then inevitably they start getting larger and larger and then having issues. And what this is talking about is what areas tend to get ignored usually early on that then kind of bite you later. The number one that they're talking about is transactions. So a lot of times people don't consider transactions, especially when they're smaller, because there's a lot less activity and you don't need a lot of locking to avoid race conditions or have inconsistent data. Whereas once you get larger, those transactions become more important. Talking about referential integrity, so identifying your foreign keys to make sure that you don't have orphaned records in particular tables. Again, when you're smaller, not as important, but very, very important. Once you get larger, particularly if you have a lot of individuals interacting with the database, you definitely want that data security, that integrity within the database as opposed to just in the application. They quote as nullable as far as the eye can see. So basically every column that's created is usually nullable. And that is, as he says, the default in DDL is that it will create any columns for a new table as supporting nulls. And a lot of times you don't want that because it makes having to code your application a little bit more complex because you have to handle null cases and then maybe blank cases. In the instance where you have a text column suboptimal indexing, there's an interesting trend where you don't need that as many indexes when you're small. But once you start getting larger and larger is when indexing becomes more important. And depending upon the different database you use, you may have a hesitancy to add more indexes to it because of the potential locking that can occur and the size difference it takes for those indexes. So sometimes there's a hesitancy to add indexes. I haven't found this as much with postgres because you can just create indexes concurrently and generally they get created without issue. And if there is a problem, it just is considered invalid. And then you can drop it concurrently if there happens to be any issue but with other database systems, because that's what this blog post is for. Maybe you have more issues. You have to deal with dangerous queries and restricted APIs. So this individual has indicated that some of the organizations they've worked for, the DBAs or the people in charge of the data purposefully restricted very simplified interactions with the database. Like just do one single record inserts or one single record updates and don't delete a lot of records at once. Now there's some benefit to that, but you're going to hamstring the developers and cause more work for them if you don't allow them to update more than one record or delete more than one record. There's some other controls you can put in place with postgres to say, limit statement timeouts or things of that nature to prevent one query bogging down the system. And then he follows up the post with some ideas for scalability and how you can make working with large databases easier. So I'll let you examine this and check out this post if you want from Brander.org. [00:07:15] The next piece of content is Blob cleanup in PostgreSQL. This is from CyberTech Postgresql.com and they're talking about Blobs, which are essentially binary large objects. Now, first he talks about generally with dealing with binary data as long as it's not too large. You probably want to just handle binary data within a field of a database table because that's much easier to work with. And that is the byte array or the byte A data type. So here you can store an image as a byte array and usually by default it's stored as hex. So you can go ahead and store that data in postgres and then be able to retrieve it. But you have to keep in mind the maximum size of that is 1GB per field. So if you have files larger than that, you actually want to store in the database and not just a reference to the file. So for example, you could store a file name that you could reference in your application code to know where that file exists. But if you're literally wanting to store it in the database and it exceeds this 1GB per field, then you want to look into the lead Blob interfaces. So this is an example of importing a large file. You're doing an Lo, which means large object underscore import and then the name of the file you want to import and what it returns is the object ID of that. Now if you notice that's there and then suddenly it's essentially gone. It's not stored in any table, it's not saved anywhere, it just returns that OID and then you need to remember it somehow. So what he says is that what most people do is create a table, say table file here that has an incremental ID, a name for that file as well as the object ID. So when you do an insert into that table for the OID field. You actually put that import statement of the file name into there so that it actually stores that object ID. So now you have all the references to all those blobs that exist within the database. Now that's okay, but if you do something like this and you delete that record, now you've lost what that Object ID is. You have no idea anymore. Now, it still exists. So you can query the PG Large Object table to get that large object ID, but you have no other reference to it within the database, and it's essentially just stuck there. And he says there's a couple of ways that you can handle it. You can do an large object unlink with it, and that removes it from the database. Now, he also mentions that there is a utility exists called Vacuum Large Object that can clean up dead large objects. So basically, there's no reference to this large object in any table. So you can use this to clean up dead objects that exist in your database. Now, they also mentioned in addition to the large object import and unlink, there's a variety of other functions you can use to work with large objects. So if you're interested in working with binary large objects in postgres, definitely check out this post from CyberTech. Postgresql.com next piece of content is an infographic. Five things to include on your PostgreSQL security checklist. This is from Enterprisedb.com, and this was a great list that breaks out things to check for security across securing access to it, say physical access, Firewall encryption access, authentication access, how to secure authentication, and the different areas you need to check within postgres. Securing roles within the database, defining access controls to different data in the database as well as securing encryption. So if you want to check out this security checklist for Postgres, check out this blog post. [00:10:49] The next piece of content is running PG Bouncer on AWS Elastic Container Service. This is from Revenuecat.com, and basically they're talking about setting up PG Bouncer, and there's not a lot of documentation on PG Bouncer. So I always like showcasing those posts that talk about setting up and using it. They don't get a lot into the specifics. And they did run this on a container stack. So they were using a docker container that had PG Bouncer included. They also, for analysis were using Prometheus. So they were using the PG Bouncer exporter that exports data to Prometheus for monitoring of that system. And it goes through the process of them setting it up, making changes to their file limits, which frequently you have to do. Going into some specifics about working with AWS in terms of whether they were setting up a network load balancer or service discovery to use it. And then they talk about the monitoring with Prometheus, and finally they follow up with the different PG Bouncer settings. The most important one is selecting a pooling mode, do you want to do session or transaction based as well as defining your connection limits? So how many clients that can connect to it is the maximum client connection, and then the pool size defines how many server connections you want to maintain. There's a default pool size that you set as well as a minimum size it shouldn't drop below, so it just always has some connections available. So if you're interested in setting up PG Bouncer and some of the specific environment considerations for AWS as well as their lasting container service, definitely check out this blog post. [00:12:21] The next piece of content is how PG Backrest is addressing slow PostgreSQL wall archiving using Asynchronous feature. This is from Procona.com. They're talking about PG backrest. They're referencing a previous article they talked about. Slow wall archiving is that you want to be very cautious of the archive command you use in postgres. Because if that is slow for any reason, you could get a wall archiving being backed up and essentially your data volume continuing to grow because it can't archive wall fast enough. Now, if it's a simple copy command, that'll rarely happen unless you have problems with your disk systems and it's a slow disk causing it. But if you make it more complicated than just a simple copy command, for example, if you copy it to an external server, or in one of the worst case scenarios they're mentioning here, you're doing an S Three copy to S Three. It just takes a long time to transfer those types of files. You can get a backup. And using this query, when they were running a PG bench workload, they were seeing a delay of 278 wall segments waiting to be archived. One way to potentially speed that up is to use compression. Using compression, they were able to get it down 166, but that still existed. But another feature that PG Back rusts offers is Asynchronous wall archiving. So that means it doesn't have to wait until that file has completed the transfer to S Three. It just sends an immediate notification to postgres, okay, it's done. And then it spawns multiple processes to then send that work to those processes to do that file transfer. So there is a risk with something breaking down in that process, and it's acknowledging postgres while it's still falling behind with the transfer. So you do have to make sure your configuration is correct and you're not seeing some of these delays, but it does enable you to respond to postgres faster and then archive those files to the ultimate destination in a more asynchronous manner. They talk about the configurations you want to do, including the full configuration for PG Backrest they did here, and they show a little bit how it works, where you have a process that does the immediate notification to postgres, as well as spawning multiple processes to actually do the transfer to S Three. So if you're interested in setting up PG backcrest with this asynchronous wall archiving? Definitely check out this blog post. [00:14:48] The next piece of content is replacing lines of code with two little reg x's in postgres. This is from Crunchydata.com and they're talking about they have a set of data and essentially they need to do some data cleanup. So the format that they want the data to be in is Cam and then three numerical digits. But these were human entered, so they're going to have some error. Some people, if it was supposed to be eight, they just put eight. Or instead of being Cam 59, they just put Cam 59. This stands for cameras. So it's number of cameras or they also used uppercase. So basically you can use a regex to correct this data rather than writing some code as they said here in Python or some other language to fix it. So here's the exact query he used to correct this data. So he uses a regex replace for each column that he wants to update and he basically looks for the last two digits and sets it up to put them at the end of this. And he explains exactly how this regular expression works to correct the data. Now, he also does that for the instance of the capital letters. So yes, you can do this with a regex and there should also be a way to do it in the same update statement to correct the case as well. Maybe using a nested one, or I would probably just use a lower function to do it, for example, maybe lowering the input before doing the regex search. But this is a very brief post about how to use regular expressions to correct data that you have received. [00:16:25] The next piece of content is deploy PG pool two on Kubernetes to achieve query, load balancing and monitoring. And this is from Bping blogspot.com. So the focus of this article is on PG pool two and setting it up to do query and load balance monitoring. And for monitoring they are using Prometheus in the Pgpool exporter to do that. So they're talking about setting up the process for working with Kubernetes and getting PG pool to work with it. So if you're interested, you can check out this blog post. [00:16:56] The next piece of content is Evolution of Tree Data Structures for Indexing. More exciting than it sounds. So this is a very in depth examination of Btree indexes, the different types and different considerations, how they work, and some more, so modern research into them. So if you're interested in that type of content, you can check out this blog post. And this is from earthelion info. [00:17:20] The next piece of content is PostGIS trajectory. Space plus time. This is from Rustprooflabs.com. They're talking about acquiring data and then not only doing a geographical examination, but also across the dimension of time. So you can see how this time has changed. And on the bottom here, tracking routes of different vehicles here and here. They even did a visualization of it. So if you're interested in this type of content, definitely check out this blog post. [00:17:50] Next piece of content is 2020 Pgasia Conference ended successfully at an unprecedented scale. This is from Haigo, CA. So if you want to learn more about the postgres conf in China and PGConf Asia, definitely check out this blog post. [00:18:06] And the last piece of content is the PostgreSQL Person of the Week is Stefan Schleichnich. So if you're interested in learning more about Stefan's contributions to postgres, definitely check out this blog post. [00:18:19] That does it. For this episode of Scaling PostgreSQL, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 23

July 30, 2018 00:14:40
Episode Cover

UUIDs, Write Spikes, Distinct, SQL Standard | Scaling Postgres 23

In this episode of Scaling Postgres, we review articles covering UUIDs, the source of write spikes, uses of distinct and SQL standards support. To...

Listen

Episode 66

June 03, 2019 00:17:09
Episode Cover

Query Optimization, Normalization, Visualizing Vacuum, Sharding | Scaling Postgres 66

In this episode of Scaling Postgres, we discuss different query optimizations, normalizing to save space, visualizing vacuum and shardings future. To get the show...

Listen

Episode 181

September 06, 2021 00:21:58
Episode Cover

Connection Scaling, Cached Sequences, Message Queuing, Harmful Subtransactions | Scaling Postgres 181

In this episode of Scaling Postgres, we discuss connection scaling, how to cache your sequences, a message queuing implementation and the harms of subtransactions....

Listen