Scaling Connections, TimescaleDB, Time-series Gaps, Red & Golden Signals

[00:00:00] In this episode of Scaling Postgres, we talk about scaling connections timescale, DB time series, gaps in red and gold signals. I'm creston, Jameson. And this is scaling postgres episode 138. [00:00:22] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is improving postgres connection scalability snapshots. This is from Citusdata.com, and this is talking about postgres scaling. To handle more connections, normally you have to start using PG bouncer once you start getting up above, say, 200, 300, 400, 500 connections. But this is looking to scale beyond that and not necessarily have to rely on a separate pooler like PG bouncer or PG pool, because you get a number of benefits, such as being able to maintain sessions and to do prepared statements that you can't do when you typically do transaction pooling in something like PG bouncer. And we had covered this in a previous episode of Scaling Postgres, but this post actually describes the changes that have been made. And these improvements have been contributed to postgres 14, which is due to be released in the third quarter of 2021. So next year, next fall. Now, he kicks off mentioning the performance improvements first, but the area that he decided to work on were snapshots. So related to MVCC being able to maintain multiple versions of the database for different users to view, based upon what kind of transactions they are running, there's, the necessity to track snapshots of what transactions can see. And basically he went in and optimized that. So I'm basically just focused on the graphs here. Here he goes from one to ten to 100,010 thousand connections to the database. Now, of course, all the points are that he tested more than that, but just the graph indicated here. And with the enhancement he made, you don't really see anything until maybe the 90 connection mark. And up until that point, they are identical. The blue indicates prepatch, and the orange or yellow indicates post patch. And this is TPS transactions per second for a PG bench workload. And like I said, up to 90 connections, there was no real difference. But then at that point, you start seeing a little disparity until around 5600, and then the difference starts growing significantly. So much so that it's between two and three times more efficient, this patch, once you get out to 10,000 or more connections. So really a great performance improvements with high numbers of connections. And you can also see that the performance doesn't drop off as much as you do prepatch. Then he looked at the state of unused connections. So basically along the X axis, he's tracking the number of idle connections to around 10,000 connections. And you can see with one active connection and 10,000 inactive connections, essentially your TPS throughput gets cut in half. But after you apply this patch, it remains consistent up to that 10,000 connection mark. And then even with 48 active connections, the performance fares even better. You basically get no difference in the number of inactive connections. So another huge performance improvement if you're wanting to use a lot of connections. So this is great news for Postgres 14. Now the rest of this post covers the changes that were made to the snapshotting process and he mentions all the different commits and all the different information. I'm not going to cover it all here, but he talks all about the different bottlenecks he encountered and what areas he covered. So if you're really interested in the details, I highly encourage you to check out this blog post. And he even has the raw data of everything that was measured at the bottom here. So this is a great, great post. So definitely I suggest checking out this post from Citusdata.com. [00:03:57] The next post is TimescaleDB 20, a multi node petabyte scale, completely free relational database for time series. This is from Timescale.com. Now normally I don't cover product announcements, but this was a pretty significant announcement I wanted to draw your attention to. So with this release, the fact that they're releasing it for free and they have cloud versions available, it's a very interesting usage model and just all the different features that allow as they state here, being able to scale out so you can scale out their distributed hypertables to multiple nodes and the fact it's a time series, it's a very interesting product. At this point, I feel. Now this post goes into a lot of detail about what Timescale DB is. It's basically an extension of Postgres. They debunk some certain myths, so it does have a lot of marketing information. But overall, what the product offers I find very interesting. So for example, going from a single node to two nodes, four nodes, eight data nodes, you get almost linear scalability. They also get huge performance benefits by doing essentially column based storage of data within indexes along with compression that can get huge, they say up to 94% compression for certain data that they're working with. So it's just a very interesting blog post. And if you're looking for a scale out solution for Postgres, to my knowledge there is Citus data, but now also TimescaleDB. So if you're interested in time series database or scale out solutions, apart from what you can do with the community version of Postgres, maybe you want to check out this. [00:05:34] The next post is detecting gaps in time series data in PostgreSQL. This is from Endpoint.com. They're talking about they had a data collection scenario, but they had some gaps in that data collection and they wanted to detect where those gaps were. So this is basically how they did it with SQL. And basically the main function that they leaned on was a generate series. So basically to find missing rows that were date based, they used a generate series command to do at one day intervals to produce data like you see here. Then they did a left join and checked for nulls that exist in the Table of Interest. And they show this query right here, and basically where that's null, they pull up. Okay, this is the day and the particular batch that is missing. So then they can go and look and find the data. But then they also had a date time field that they needed to check as well. And they expected something to be in that datetime field around every ten minutes. So for this purpose they did a generate series, but they did it at a ten minute interval. Now they did the same thing doing a join, but they actually did a lateral join. So for each row they did a specific query against the join Table of Interest. And that's what lateral gives you. And using that, they were able to find the gaps. So this is a pretty interesting blog post using a technique to be able to find gaps in time series data. So if you're interested, you can check out this post. [00:06:55] The next piece of content is PostgreSQL red gold signals. Getting started. So they're talking about a monitoring solution using what's known as red or gold signals. And they have links to each of these methodologies, red being an acronym for rate errors and duration. So for example, when doing monitoring, you want to monitor the rate. So say the numbers of requests per second your services are serving. You want to track the errors. So the number of failed requests per second and you want to track the duration, the time spent on servicing different requests. So they looked at this through PostgreSQL and how you can get this information from the system views. So the first is requests and what they advocate using is the PGSTAT statements view, and just doing a sum of all the calls from PG Stat statements. Now, keep in mind, PG Statstatements is a cumulative record of all the queries being done. So basically you need to check the difference between the previous time it was run and the current time to get a sense of how many calls were done within a particular time period. Next, they look at errors. Now, there's not a great way to do errors, but essentially they chose to do rollbacks. So they looked in PGSTAT database to look at the transaction rollbacks. So I can imagine scenarios where rollbacks aren't always errors. For example, maybe you want to do it on purpose when processing something, but this is what was available within the system view. And then they looked at duration and that's basically looking at PG Stat statements. Again looking at a sum of the total time that it took to process different queries. Now, with the gold signals, they also cover saturation. So that's basically where request has to get queued. And you can imagine a scenario that means say something is behind a lock. So if you wanted to just look at how many locks are occurring that's one way. But what he likes looking at here is PG Stat Activity and tracking how many queries are in an idle state or idle in a transaction, active or waiting. So if you're looking at a different way to monitor your postgres solution using red or gold signals, maybe check out this post from Dataeger.com. [00:09:00] The next piece of content is PostgreSQL Clustering VIP Manager. This is from CyberTech Postgresql.com and VIP Manager is a virtual IP manager and they're talking about using it with, say, the Petrone stack. And that in that stack you have a application that speaks to whatever device is going to be routing the connection to the primary database. Now, in this case, this could be the VIP Manager and it's pointing to the virtual IP that's been assigned and then it connects to whichever is the primary node that's responsible. The VIP Manager, at least when working with Petrone, is that it consults the Consensus Store, which could be say, Etsyd to determine which the primary is, and it routes that virtual IP address to the correct physical address of the current primary. So that's pretty much all the virtual IP manager does. And then it goes into how you would do the configuration. It basically needs to know where this Consensus Store is to be able to look at it, read it, find what is the entry for the primary database and what virtual IP to use to be able to do the redirection. So it's a very brief post about how you can set up the virtual IP manager specifically when using Petrone. [00:10:11] The next piece of content is migrating interactive analytics apps from Redshift to Postgres featuring Hyperscale Citis. This is from Citusdata.com. Now, a lot of this post includes kind of marketing information and highlighting Citis as an alternative solution to redshift, which of course it is. But the certain thing that I looked at that I found interesting was where with redshift, because it's column based, you can't really add indexes as easily with Postgres Standard Row store and that by adding indexes for this specific use case they were able to get a two x performance game versus redshift because you could add these indexes. Next, because it's using a more recent version of Postgres sized data, that is, compared to Redshift, you're able to use JSON B to store semi structured data. Previously, the customer had to store it in large text fields in redshift, whereas they could use the JSON b to store it. And that enabled them to store data in a much more compressed format and use gen indexes to efficiently query it. And they mentioned here the JSON data type in Postgres gave six to seven times compression compared to how they had to store the data in redshift. So this was a pretty interesting post of some of the advantages of using Postgres. Now, first they tried doing it on a single node. It didn't quite suit them so they just went to a dual node that you can of course do when you're using Citus. So if you're interested in that, you can check out this post from Citusdata.com. [00:11:43] The next piece of content is Joins using like or Why. PostgreSQL FTS is a powerful alternative, and by FTS they mean full text search. So they had a customer they were working with and they were having slow performance when using like operator with Joins. Now, they tried at first to use the Btree gen index, which enables you to use indexings with Btree and gen combined, but they still didn't get great performance. As a matter of fact, they were getting a query returned in 536 seconds, so pretty slow. But they found that they were able to use full text search instead of using the like queries. So with a full text search they added a generated app column using TS vector to record what they needed to do. The full text search on added a gen index, and then once they rewrote the data to be able to query it efficiently, they actually got down to 1.5 seconds to run the query. So here's the case where using a simple like or I like to be able to search ended up being not as efficient as using a full text search with a gen index. So if you're interested in learning more, you can check out this post from Postgrespro Co, il. [00:12:59] The next piece of content is PostgreSQL foreign Keys and Insertion order in SQL. Now they're talking about the case where you have foreign keys set up. So in this example, they have a currency, a location, product product description, and a product stock when you're going to be inserting values. When you have foreign keys set up, between each of them there are essentially dependencies. So for example, a currency must exist before you insert a new product because it references that. And if you have a large hierarchy of dependencies set up, it can be difficult to determine what you need to insert first if you're new to a particular database. Well, here they're presenting a recursive query that actually reads through the system tables and determines what needs to be inserted first. So when running it against the sample that they had here, they show level one data needs to be inserted first. So you first need currencies and locations, then you can insert your products, then you can insert your product descriptions and your stock. So it's a pretty interesting query to be able to find out this essentially dependency information for your foreign keys. So if you're interested in learning more about that, you can check it out at CyberTech postgresql.com. [00:14:10] The next piece of content is actually a YouTube channel that I found in my feed and basically it's called Postgres TV. Now, they have had some older content that's been posted, but more recently, for the last four weeks or so, they've been publishing episodes on about a weekly basis. So if you're interested in more postgres content, you can check out these live shows that are posted to Postgres TV. [00:14:35] The next piece of content is getting started with Postgres 13 on Ubuntu 24. So this goes through the entire process of setting up Postgres, configuring it for external access and how to do some basic backups. So basically goes over the installation process, getting the Postgres repository set up to be able to install the most recent version, which is 13 talks about setting up the database cluster. How would you go about starting stopping it, where you would find the log files and how you do the configuration for accessing it across the network, creating users, setting up a database as well as doing some basic backup and restore commands. So if you're interested in learning more about that, you can check this post out on PG IO. [00:15:22] The next piece of content is using PostgreSQL and SQL to randomly sample data. This is from Crunchydata.com, and it's the next post in the series talking about doing scientific analysis with data from the California Wildfires. And in this post they're talking about needing to take a sample of non Fire data and they wanted it to be the same amount of information as the Fire data to be able to start to develop their models because they wanted to do a logistical regression for it. So this goes through the process of taking a sampling of the data using the Table sample keyword to get the data together so that they can set up a set of data for both training their model and also verification of it. So if you're interested in that, you can check out this post from Crunchydata.com. [00:16:08] The next piece of content is a post about a new book that's been released called Learn PostgreSQL. So if you're interested in more learning material such as a book, they have the outline here and a link to the book. [00:16:20] And the last piece of content is the PostgreSQL. Person of the week is Stefan Forkat. So if you're interested in learning more about Stefan and his contributions to PostgreSQL, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 167

Interpreting Explain Analyze, Bulk Loading Data, Limiting Update & Delete, pg_buffercache | Scaling Postgres 167

Episode 240

Forced Sequential Scans, Table Renaming, Terminology, PGSQL Phriday | Scaling Postgres 240

Episode 171

Practical Partitioning, Understanding pg_repack, Explaining Buffers, Multi-row Contraints | Scaling Postgres 171

Scaling Connections, TimescaleDB, Time-series Gaps, Red & Golden Signals | Scaling Postgres 138