SQL Data Analysis, pgBouncer Setup, Understanding Deadlocks, Tuning & Optimization

[00:00:00] In this episode of Scaling Postgres, we talk about SQL, data analysis, a PG bouncer setup, understanding deadlocks, and tuning and optimization introductions. I'm Kristen Jameson and this is Scaling postgres episode 163. [00:00:24] Our. [00:00:25] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is Practical SQL for data analysis. This is from Hackibinita.com and it has a subtitle what you can do without Pandas. So this is a Tor de Force blog post. It's almost like a little ebook. It is so long that you can tell all of the content in this post. It's really a bit surprising and they even have the table of contents here for it. And basically they're talking about Pandas, which is used in data science to my understanding with Python to do data analysis. And basically this post talks about how you can do SQL right from a database, particularly postgres to answer a lot of the questions that you can do with Pandas and it goes into that. Now, this post is so long I'm going to primarily just look at the table of contents and speak briefly to it. But even if you don't read this full post, I'm going to be saving it as a reference to be able to look up different techniques for doing data analysis. Now, the first part here where he's talking about SQL versus Panda performance, he's talking about how when you do analysis with Pandas, basically you have to pull all the data in from some data source and process it there, whereas it's usually more efficient to do some of that data extraction from within SQL. So for example, talks about only pulling the columns you need, only pulling the rows you need, doing grouping on the database itself as opposed to within Pandas or within an application, so it's applicable to application as well. So he talks a little bit about that and also about memory performance because when you're pulling a whole copy of the data down, it's much more efficient again to just pull what you need. Now, he goes into some different tools that postgres or other database systems offer, such as common table expressions to be able to iteratively build a query talking about how you can generate data. You can select random data, you could do sampling with table samplings. Then he goes into different things you can do like roll ups and cubes and grouping sets and gets into window functions, how you can do conditional expressions within a select. Then he also goes into things such as linear regressions and interpolations and binning and how to do that type of analysis just using SQL. So this is a very comprehensive post. I highly suggest you to check out the next post. PG Bouncer Setup extended Architecture use cases and leveraging so reuse Port this is from Enterprisedb.com and similar to the first post, this is another tour de force blog post covering a lot of different items with regard to PG Bouncer. Now, they do reference this post that was done late last year that we also covered in a previous episode of Scaling Postgres about spinning up multiple instances of PG Bouncer using the same port. Now, the reason this is of interest is because PG Bouncer is single threaded. And if you want to run this on the server and have a dedicated PG Bouncer machine, well, nowadays you probably have two CPU cores or four or eight. So if all you're running is PG Bouncer and you're only running one copy of it, you're not utilizing all of the machine's resources and you're going to want to run multiple instances of it. And basically, if you're able to use this capability to reuse the same port, then you can kind of do a load balancing between them because connections come into that port and it could use one of the different services that are servicing that port. So they talk about this in a previous post, but this one goes into more depth and how you can set up some different ways to do it and even install it from source. Because they mentioned that there were some edge cases that the previous post maybe didn't cover. So they talked about doing it from source. And then even if you can't do it from source, they presented another technique here where you can essentially do the same thing, reuse the same PG Bouncer port and set up multiple PG Bouncers on an instance that can leverage more than just one CPU core. And they go through the whole process of setting that up. And that's just the first half of the post. So the second half of the post is innovative ways that you can use this. So reuse port one is that it helps you scale your PG bouncers so you'll have multiple PG Bouncers running to help you do scaling. It can also do load balancing because it will send connections off to a particular PG Bouncer service running. You can also do read, write or read only routing by setting up particular databases in PG Bouncer. And they discuss that here and they give you an example of the setup that is configured to do just that. And then they talk about areas where you are running PG Bouncer in what they call a hostile environment, which mean you could potentially get a lot of connections hitting it and also ways to hit off denial of service attacks. So they covered that here as well as finishing up with an architecture for setting up PG Bouncer for high resiliency. So, another example of a really comprehensive post talking about PG Bouncer. This is another post I highly suggest you check out the next piece of content PostgreSQL Understanding Deadlocks. This is from CyberTech Postgresql.com and the first thing they cover is how a deadlock happens. And basically you have two transactions that are happening at the same time and you say you have a two rows in a table. One row starts an update so it's locking this row. Another transaction starts an update, locks the second row. Then this transaction tries locking the second row and it says it's now waiting for this lock to release. It continues to wait. Now this transaction tries to update row one. So they're trying to update each other's lock and essentially that starts with our column here, the deadlock timeout. So by default this is 1 second and then if it doesn't resolve within that 1 second, one of these transactions is chosen to win. So this one succeeds doing the update and this one gets an error that it was a deadlock and basically it's going to stop that transaction. This one gets committed, this one does not. And then they talk about how to fix and avoid deadlocks. And basically I agree with him here that there is no magic configuration parameter. Basically the most important thing to do when doing this is ordering your transactions. So for example, imagine that you have 100 transactions that are happening concurrently which in large installations can definitely happen. [00:07:20] You don't want to have the case where this transaction has a pool of rows it needs to update. This transaction has a pool of rows it needs to update, but they're in different orders. So then you have the case. This one starts updating and it updates row one and then it also needs to update row two. It's happening in one order, whereas this one is updating row two and then it updates row one. What you want to have happen is the order to be consistent across your transactions. So when it chooses to do the updates, you want to update it, say by order of the ID. So you'll never get the case where they will need to be updated out of order. So even if these are updating the same transactions, as long as they do it in the same order, you will never get a deadlock. So that's the most important thing to know about deadlocks is that the order is important. Now you may get delays while this transaction waits for another to complete, but as long as things happen in the same order, you won't get a deadlock. But if you want to learn more about that, you can definitely check out this post. [00:08:25] The Next Piece of Content an introduction to PostgreSQL performance, Tuning and optimization. This is from Enterprisedb.com and this is another tour de force blog post that again has its own table of contents because of all the content it covers. So they focus on what to do with the bare metal configuration, how you can tune the operating system, all the different configuration parameters in postgres itself, and then how to analyze different queries and workloads. So again, a great reference post to keep around when you have questions about configuration in regards to tuning the next piece of content implementing Incremental View Maintenance for PostgreSQL part one. And this is from Yugonagata PGSQL blogspot.com and he's talking about a feature that they've started working on. So nothing's final yet, but the feature they are working on is something they call IBM or Incremental View Maintenance and it's specifically with regard to materialized views. So a view is just a virtual table, a materialized view. It's no longer virtual but it makes it into its essentially own physical table and it has the physical rows that it references. But the thing about that, that is a cached view and periodically you will need to refresh it from the base tables and they're talking about the refresh materialized view plan. Now if you have a lot of data behind this view, refreshing it could take a very long time. But what they're looking into is a way to doing Incremental View maintenance where the materialized view stays updated based upon changes in the base tables and it does this without you having to manually set up triggers or anything like that. Apparently it may be using triggers under the covers that we can't see, but basically when you do an update of the base table, the materialized view gets updated if it's defined as being incremental. Now it may be a while before this hits postgres, but it's definitely a great feature that they're working on. And if you want to learn more about it, you can check out this post. [00:10:37] The next piece of content choice of table, column types and order when migrating to PostgreSQL this is from Crunchydata.com. They're talking about the data types you should choose when migrating to postgres and basically when you're choosing to migrate numerical types, they basically have recommendations around what type you should convert it into. The first types you should generally use is INTs and Big INTs. So if it's just a number, if it's just an integer, go ahead and just use the Int or the Big Int. If you have something that's a decimal, then you want to use the numeric or the decimal type. But there's also floats and reels available if you don't need something that's more accurate. So they discuss that through the post and then they also go into a little bit of alignment and how if you want to conserve space, the best way to store data in the table is to first store your largest fixed types first, followed by your smallest fixed size types. So for example, Big INTs go first and then INTs for second. That's because of the alignment and this is related to memory, you could actually have wasted space if you order things differently. But first you want to have your large fixed types, then your smaller fixed types, and then your variable types thereafter, say text strings and things of that nature. So if you want to learn more about that and how that works, you can definitely check out this post. [00:12:01] The next piece of content car one to Boolean transformation while migrating to PostgreSQL. And they're talking about when you're converting from Oracle to postgres. Apparently Oracle doesn't really have a binary data type and how you should do that conversion when working with postgres. They also talk about ways you can set up check constraints in order to filter out rows that are coming in from Oracle. So this post goes into how you can transform from a car one into a Boolean. So if you have interest in converting data from Oracle to postgres, definitely check out this post on a technique to do that. [00:12:41] The next piece of content dramatical effect of LSE instructions for PostgreSQL on Graviton Two Instances and this is from Akoratokov GitHub IO. And he's talking about the Graviton Two processors that are on EC two instances. And just so you know, they're using the development branch of PostgreSQL 14. So this is not a live branch. This is something that's in active development and looking at different instructions that have been enabled in each of these branches, and then looking at the transaction performance across the number of clients. And as you can see exactly, it says here when you have LSE instructions enabled for the Graviton processors in the development branch of postgres 14, you get dramatically consistent transactions per second across high numbers of clients when using the LSE and not so much when using the base development branch. So definitely interesting results for the Graviton Two processors. And I guess as we move forward, more people may start using these type of Arm processors as opposed to just intel or AMD. So if you're interested in that, definitely check out this blog post, the Next Piece of Content. Getting started with PostgreSQL Set Operators this is from Arctite.com, and they're talking about different ways of doing unions. So combining two different queries so you can do a union, you can do a union all, which doesn't remove duplicates. You could do intersections of two different tables, you can do exceptions, or using the Accept keyword to accept different clauses as well as order the resulting result set. So if you want a simple post on how to do set operations with postgres, definitely check out this blog post. [00:14:31] The next piece of content. Pgpool Two logging and Debugging. This is from Bping blogspot.com. They're talking about doing logging and debugging in PG Pool Two. So if you're interested in configuring that, you can check out this blog post. [00:14:45] The next piece of content. TLS demystifying communication encryption in PostgreSQL. This is from Cybertechn Postgresql.com. Now, this post just covers TLS. There's really not too much of a mention to Postgres, but if you want a background on TLS and how it does its encryption, this is definitely a post you can check out. And the last piece of content is the PostgreSQL Person of the Week is Corey Hoonicker. So if you're interested in learning more about Cory and his contributions to postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show. Notes be sure to head over to Scalingposgrows.com, where you can sign up to receive weekly notifications of each episode. Or you can subscribe via YouTube or itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 95

max_locks_per_transaction, Logical Replication, Backup & Recovery, Pgbouncer | Scaling Postgres 95

Episode 307

Collation Conundrum | Scaling Postgres 307

Episode 66

Query Optimization, Normalization, Visualizing Vacuum, Sharding | Scaling Postgres 66

SQL Data Analysis, pgBouncer Setup, Understanding Deadlocks, Tuning & Optimization | Scaling Postgres 163