Database Decomposition, Fast Text Search, Understanding pg_stat_activity, Window Functions | Scaling Postgres 223

Episode 223 July 10, 2022 00:14:59
Database Decomposition, Fast Text Search, Understanding pg_stat_activity, Window Functions | Scaling Postgres 223
Scaling Postgres
Database Decomposition, Fast Text Search, Understanding pg_stat_activity, Window Functions | Scaling Postgres 223

Jul 10 2022 | 00:14:59

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss the process of decomposing your database, how to search text in Postgres quickly, how to understand the output of pg_stat_activity and how to use window functions to solve a use case.

To get the show notes as well as get notified of new episodes, visit:

https://www.scalingpostgres.com/episodes/223-database-decomposition-fast-text-search-understanding-pg_stat_activity-window-functions/ 

View Full Transcript

Episode Transcript

[00:00:00] Speaker A: In this episode of Scaling Postgres, we talk about database decomposition, fast text search, understanding, PG stat activity, and windowing exercise. I'm Kristen Jameson and this is Scaling postgres episode 223. All. All Right. Hope you, your friends, family and coworkers continue to do well. Our first piece of content is we are splitting our database into Main and CI. This is from GitLab.com. And this is an announcement in their blog where they're splitting their database. So a lot of times, when you reach the limits of how far you can scale up your database, you look for alternative solutions. One way to handle this is you could start developing microservices. And generally a microservice each has its own independent database, and the services just communicate to each other. Typically just Http protocols. The other route you could take is do a scale out solution. So this would be something similar to the Cydus extension. Have essentially multiple primary databases that are sharded out. So essentially, you're sharding your database. You could also do sharding from the application perspective. And the application itself knows where to find the data that it needs. And with that Sharding, maybe you have a certain set of customers on Shard one. A certain set of customers on shard two. But the databases are typically the same. But what they're doing in this case is something they're calling functional decomposition. So they're taking their monolithic database and they're splitting off functionality into its own database. So, for example, they have a diagram here where they have their application. All writes go through PG bouncer to the main database. And then they have read only Replicas set up that they get their reads from. Now they want to decompose it and break off their continuous integration services or CI services to their own database cluster. So they're going to be removing approximately 50 data tables, I said, from the main database and putting it into this dedicated database. But as you can tell, the structure will be the same. Writes will still go through to a PG bouncer to this database cluster that has multiple readonly replicas that the application can pull information from. Now they are using Ruby on Rails and that has some additional features that have been added in the last version or two, I believe, where it can more easily speak to multiple databases. So this is probably why they chose this particular path. Now, this is interesting because I have other clients who have chosen a similar route to this rather than sharding their database. Now they anticipate, because they're dedicating the databases for particular workloads, that they'll be able to overall increase performance of the system, increase stability. So maybe CI services aren't going to impact other parts of the system, and they're also trying to be able to ship just as fast. I mean, this is introducing some changes that need to be taken into account where you operate as before when you had a single monolithic database, but now you're dealing essentially with two databases. And it looks like they have some tools here that they developed. One called loose Foreign Keys, one called Data migrations for multiple databases. So when you do a data Migration, needing to add an index or a column, being able to target which database that command needs to be sent to. Now I'm posting some other links as well. This other one is decompose GitLab's database to improve scalability. This is actually an issue they created within GitLab that talks a lot about the work that was done for this. And this was done almost a year ago. So it's taken them about a year of time to get up to this point in this other post that talks about multiple databases. So this is essentially advice given to their engineers about how to work with this split database scenario. So they talk about configuration and they talk about one thing that's a priority, is basically removing joins between tables in the CI database and tables in the main database. So their code could have joins across this functional boundary. But now that they're in separate databases, of course joins won't work anymore. Now one solution is dealing with is could you just remove the code that was doing these joins? Do you really need to join to it? Another one, Rails, has a capability where you could pull back IDs to the application and then it sends another query with those IDs to the database. Presumably you could do that in their scenario to avoid a join. Another is denormalize some foreign keys to the table. So basically, data that you're looking for actually store it in a table. Now this would result in some duplication, but it would prevent crossing over to the next database. They talk about denormalize it into an extra table. So maybe you just maintain a separate copy of that table. But this is a fascinating look on one path to hitting a scaling ceiling. So yes, you could do microservices, or maybe you could use Citus as a scale out extension. But this is another solution, basically decomposition to separate functional units out of your database into their own database. But if you want to learn more, definitely check out these three pieces of content. [00:05:19] Speaker B: The next piece of content rubber Duck. [00:05:21] Speaker A: Dev Show episode 49 Fast Text Search in Postgres so this is a show that Chris and I did last Wednesday, and it was on Fast Text Search. Basically, when you have a query where you're using Ilike or Ilike or some regular expression to find an entry in a lot of text, what's the most efficient way to do that? Because a like or an Ilike is just going to scan the table, it's not going to use any indexes. And specifically, I show you how you can use the Fast Text Search features to be able to find those text instances really quickly. Like for example, one solution was 1000 times faster. Another example was 10,000 times faster. And basically the more data you have, the better an index will perform in terms of finding some text. So if you want to learn more about that, you can definitely check out this piece of content. [00:06:11] Speaker B: The next piece of Content Understanding PG. [00:06:13] Speaker A: Stat Activity this is from Dep and this post basically goes over every single column in PG Stat Activity and explains what it means, what it does. Now, also with regard to the rows, he further explains that by showing there's different types of backends. So when someone is making a query, generally that is a client backend. So some session is connected to the database and it's pulling data. But you're also going to find other postgres processes within PGSTAT Activity as well, such as the checkpointer, the auto, vacuum archiver, wall writer, things of that nature. And he explains all of these different processes as well. Now, I'm not going to go through this whole thing, but he does a very good job of explaining each of these columns and what they mean and how they could be useful. So if you want to learn more, definitely check out this blog post. [00:07:07] Speaker B: Next Piece of Content handling Bonus Programs. [00:07:09] Speaker A: In SQL this is from CyberTech Postgresql.com and he's talking about a scenario where you have a bonus card that earns points potentially by making purchases. So he has a table here with a card number, a date and the number of points that are earned. So as you can see, there's two card numbers here and there's dates and they're being assigned points on those particular dates. And the question is, how many points does a particular person have? And as a wrinkle to it, he says, but those points expire after six months. So how can you get what the total sum is at a particular point in time? So basically a running total and he's using window functions to do this. So first he does a window function using a range between six months to the current row. And he actually in this first iteration does an array aggregation. So you can see what points would be included in the sum he's eventually going to do. And you can see after six months, all of these points essentially expire and they're no longer there. And then it starts incrementing again from that point. So basically it's a six month sliding window that it presents the total. The next phase of the query actually does the sum so you can actually see what the results are. And then he does it for every card number using the partition by to partition it by card number. So this is a great example of using window functions to answer a real world question. If you're interested in learning more, you can check out this blog post. [00:08:35] Speaker B: Next Piece of Content choosing a PostgreSQL. [00:08:38] Speaker A: Number Format this is from Crunchydata.com and he's talking about when you're choosing a number format. Ideally you want to store the data using the smallest amount of space that makes sense to reduce costs. You want to represent your data with the smallest amount of error, which of course is critically important. You don't want to give the wrong answer for a number and then manipulate your data using the correct logic. So the main ways to store data is of course using integers and that's basically the goto format to be able to store integers. And there's small NS and big NS that are two, four and eight bytes. They also offer floating point numbers, which I've only really used for latitude and longitude when the accuracy wasn't that important. So floating point is of course inaccurate, but it allows very, very fast computations if you want more accuracy but it's going to be slower, then you're using numerics. And if you're going to be storing currency, generally this is the type to use. Definitely don't use a float, although I have heard of people that are using integers, they just multiply the units to get rid of any decimals that exist. Andy has a little bit of discussion the difference between numerics and floats, particularly with regard to rounding, but you can feel free to check out this blog post. [00:09:54] Speaker B: The next piece of content data to go. [00:09:57] Speaker A: Postgres logical Replication this is from Crunchydata.com and this is a post about logical replications and comparing it to streaming replication. Basically they kind of provide different use cases. So streaming replications, as they say here, used for high availability. It copies all the data from the primary database and basically it replicates the whole schema and everything in place. It doesn't do any partial copy of information and you have to be on the exact same version. So it has to be essentially binary compatible. Whereas logical replication, you can copy all the data, but a lot of the use cases call it for just copying a subset of the data on a regular basis. And you can do it two different operating systems or two different postgres versions and they walk through the process of actually setting up logical replication, creating publication and creating a subscription for it. So you can check out this blog post if you want to learn more. [00:10:51] Speaker B: Next piece of content. [00:10:52] Speaker A: PostgreSQL replication and conflicts. This is from Procona.com and they're talking about postgres replication and its history, but then they go into replication conflicts. So as you use a replica database there's the potential for running into conflicts. You can actually check these conflicts by looking at the PGSTAT database conflicts view because that gives you an indication of how often they're occurring and generally they'll appear in the log as well, of course. And he goes over the different conflicts you can encounter such as table space conflicts, locks, snapshots, buffer, pins, deadlocks, and other types of ones. So if you want to learn more about that, you can check out this blog post. [00:11:34] Speaker B: Next Piece of Content State of PostgreSQL. [00:11:37] Speaker A: 2022 1st Findings this is from Timescale.com and they're providing an introduction to a little bit of the results from the State of Postgres survey. They're talking about the demographics of who was included, their rough geographical location, number of years using postgres, their current position in terms of those who filled out the survey. They present a few of their community questions as well as popular tools. So apparently the three most popular tools for queries and administration tasks is psql, PG, Admin, and then DB Beaver. But you can check this out for all the details. [00:12:15] Speaker B: The Next Piece of Content Five Minutes. [00:12:17] Speaker A: Of Postgres Episode 25 postgres Lock monitoring, lightweight locks and the log lock weights setting this is from Pganalyze.com and they covered a post that was discussing when you should be concerned about locking that we covered last week on Scaling Postgres. In this particular piece of content, he goes more in depth about those locks as well as suggests enabling log lock weights as well because that will log locks that exceed generally your deadlock timeout of 1 second. So definitely feel free to check out this piece of content as well for. [00:12:51] Speaker B: More insight into that next piece of content. [00:12:55] Speaker A: How to Log Selects from Specific Table this is from Dep and this blog post covers exactly what it says and his recommendation is to use the PG Audit extension and he shows you how to set it up. Basically on Ubuntu you can just install it, add it to the shared preload libraries. When you do have to restart your database because you altered the shared preload libraries create the extension, then you can define a role in grant auditing to the table of interest and then it pretty much works and logs the information to the Postgres log. Now, we did have an alternate way to do that, kind of a hacky way, creating a function and then a view that calls the function. But he says really you should just use the PG Audit extension. But again, if you want to learn more details, definitely check out this blog post. [00:13:44] Speaker B: The Next Piece of Content queries in. [00:13:46] Speaker A: PostgreSQL Five Nested Loop this is from Postgrespro.com and they're talking about joins in general, but also the nested loop joins and how that works at a very, very detailed level. So if you're interested in that, definitely check out this blog post. [00:14:02] Speaker B: The Next Piece of Content What is. [00:14:04] Speaker A: SQL used for to build environments where devs can experiment? This is from Timescale.com and this is another interview with someone from the Postgres community, Nikolai Samakalov. So if you're interested in learning more about Nikolai and his contributions to Postgres, definitely check out this blog post. [00:14:22] Speaker B: The last piece of content. [00:14:23] Speaker A: The PostgreSQL Person of the Week is Julian Teshwa. If you're interested in learning more about Julian and his contributions to Postgres, definitely check out this blog post that does it for this episode. Of scaling postgres. You can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks.

Other Episodes

Episode 281

September 10, 2023 00:17:12
Episode Cover

HNSW Indexes, Vacuuming Bloat, Watch Me Now, Connections | Scaling Postgres 281

  In this episode of Scaling Postgres, we discuss the benefits and disadvantages of HNSW indexes for working with vector data, configuring vacuum to reduce...

Listen

Episode 224

July 18, 2022 00:13:57
Episode Cover

Column Performance, BRIN Win, Unique and Null, Parallel Distinct | Scaling Postgres 224

In this episode of Scaling Postgres, we discuss how too many columns in a table can affect performance, at what point BRIN indexes win...

Listen

Episode 38

November 12, 2018 00:09:53
Episode Cover

Parallel Query, Real Time Analytics, Killed Index Tuples | Scaling Postgres 38

In this episode of Scaling Postgres, we review articles covering parallel queries, real time analytics, killed index tuples and unexpected downtime. To get the...

Listen