Use Bigint, High Availability, Autocommit, Synchronized Sequential Scans

[00:00:00] In this episode of Scaling Postgres, we talk about use begint, high availability, auto commit, and synchronized sequential scans. I'm Kristen Jameson, and this is Scaling postgres episode 169. [00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is use big Int in postgres. This is from Rustprooflabs.com, and basically this is a blog post describing how typically he used INTs, but then around two to three years ago converted to only using big INTs. And this blog post actually goes through a bunch of tests to see what is the actual performance difference between INTs and big INTs. And basically the conclusion I came to looking through this is it's maybe about a six to 7% difference based upon some of the things he's tested. Now of course, the big question is why default to using big INTs? Because it's about twice the size. So depending on your data, you could be storing more and then the performance will be different because you'll have a bigger data type to work with and more data in there. But he says the main reason to go with a big Int is that you avoid having to convert an Int to a big Int, which is a pretty big inconvenience to have to deal with, particularly if you're talking about a primary key. So with primary keys today, if you have any thought that the database is going to get any kind of usage, I would definitely go big Ins. And even if it wasn't, I would still probably consider big ins because with fast SSDs, as he's saying here, sufficient Ram and CPU, the performance difference should be relatively negligible. Now, it looks like he was testing this on an eight core CPU with 16 gigs of Ram on Ubuntu 24 and Postgres 13 and used PG bench to test some different performance. Now, some of the selects were a little bit slower comparing Int to big NS, and then he tests some indexes and frequently he says the test results are negligible, but some of the differences are six to 7%, and I probably wouldn't notice that kind of difference either. So I think this has big benefits if you're using it for primary keys because there's not a risk of overrunning the big Int compared to an Int. But he said there may be some reasons you don't want to use big Int. Like if you know there is only certain values being stored within a field, like you'll only ever have ten different integers from a foreign key to something, then it doesn't make sense to make it a big Int. And he also suggests maybe you have low power devices or an embedded use case, or maybe your network bandwidth is limited. But overall, I definitely agree with the recommendation to choose big INTs over INTs. And if you want to learn more about its testing procedures, go ahead and check out this blog post. [00:02:58] The next piece of content PostgreSQL ha with Petrone your turn to test failure scenarios this is from Procona.com, and they did a presentation at Procona Live Online looking at testing failover scenarios using ha with Petrone in Postgres. And this blog post goes through the entire setup of doing it. So they set up a three node etcd cluster, they configured a software watchdog, installed PostgreSQL, got Patroni up and running to manage the Postgres instances, and then set up HAProxy to be able to direct connections either to a write port, which would be the primary, or a Replica port, which would be the reads. And then tested workloads using a specific tool that they developed in Python to do their testing. And then they did a bunch of experiments testing these different failure scenarios, such as losing network connections from either the replica or the primary, or unplugging the network cable from both, or simulating a power outage. Simulating a segfault. By killing the Postmaster process in postgres or by killing Petroni? By implementing CPU saturation, as well as just doing a manual switchover to see how the cluster recovered and the approximate time to do that. So if you're interested in setting up your own high availability scenario, maybe you want to check out this post from Procona.com as well as the accompanying talk. [00:04:31] The next piece of content disabling auto commit in PostgreSQL can damage your health this is from CyberTech Postgresql.com, and they're talking about auto commit. And that is the process. That Postgres wraps every statement you send to the database. Say you're at a psql prompt and you send a statement, it's automatically wrapping that statement within a transaction. Now, you can manually define the transaction boundaries by using Begin and then conclude with a rollback or commit to actually commit that transaction or roll it back. But if you're just doing a quick query or a quick insert statement that is automatically wrapped in a transaction, that Auto commits it. Now, that's a little bit different from Oracle, where you actually have to explicitly give the commit or rollback statements. Now, there's nothing in the database that allows you to, at the server level, turn off this auto commit. Apparently it was done back in 2002, but they quickly removed that feature in 2003 due to the issues. But it is still the case that you can do it on the client. So with the client that you're using, you can actually turn that auto commit off. And he mentions various places you can do it. Like in psql, you can run Set auto commit off. In the JDBC driver, there's an auto commit setting you can change or psychopg or PG Admin Four or other tools. You can turn off this auto commit, but you want to be very careful with doing this, particularly with how PostgreSQL handles its MVCC, its MultiVersion Concurrency Control implementation. Now, the number one problem with this is that you're causing sessions to be idle in transaction and long database sessions that are in the state of idle in transaction can cause a number of problems with Postgres. Now the first problem is you have locking in the database, particularly for DDL commands. So if you just do a single select, you're actually going to have an access share lock on the table. Now that could cause problems with other DDL statements that want to happen. The other issue is not being able to do a vacuum. And he says here, quote, auto Vacuum cannot delete any row versions that are younger than the start of your transaction. So you could be creating a big bloat problem for yourself if you have a lot of these long running idle in transactions that are preventing Auto vacuum from completing. So you definitely want to try to avoid this if you can. Now there are some protections you can add on the server side to, as he says here, defend against Auto commit off. The first one is to set an idle and transaction session timeout to a value greater than zero. So this will automatically cancel those transactions that are there for too long. The other parameter you can change is the old snapshot threshold to something greater than negative one and that will, as he says here, allow vacuum to clean up dead tuples that have been dead longer than that time. But of course the issue with both of these is they do send an error to a client if either one of these scenarios is encountered. So this was a great post talking about Autocommit, and if you're interested in learning more, definitely check out this blog post. [00:07:39] The next piece of content data Warehousing making use of Synchronized sequential Scans. This is from CyberTech Postgresql.com and they're talking about a feature that was a little new to me, where Postgres has the ability to synchronize concurrent sequential scans that are happening against a table. Now he's first talking about let's say you have a database environment and you have, say, one to ten users and your database system has a particular throughput it can handle. Well, if one user is doing a sequential scan on a table that may take 16.6 minutes to run, well, if you add a second user, it's going to double that time. If you have four users, it's going to double it again. And basically your performance keeps dropping the more users you add to the system. Now this happens assuming that each query to the system is entirely independent and there's no dependencies between them. Basically each of these ten users has to do a full sequential scan of each table. But there's a feature in Postgres called synchronized sequential scans. And basically, let's say you have this user, one that has started a sequential scan on a ten terabyte table. And then at this point here, when it's about halfway done, a second user comes in asking for a sequential scan of the same table. Well, it doesn't have to start at the beginning, it actually joins up and starts using the same I O request to collect its information. So it continues on till the end, and then it starts at the beginning to finish up its I O request to return data to the user. The same thing in user three. If it gets here around 75% complete, and then user three starts a sequential scan, it can piggyback on this existing I O request to start filling out what it needs to send to user three. And then once this I O request is done, both user two and three can start another sequential I O request to fill out the information that they need. And there's actually a parameter for this called synchronized sequential scans, and by default it is on. So you can turn it off if you need to. But definitely interesting feature. If you want to learn more about it, definitely check out this post from cybertechn postgresql.com the next piece of content. Better range types in postgres 14, turning 100 lines of SQL into three. And this is from Crunchydata.com, and they're talking about a new feature that allows multirange types. So more than just having a range type, you can have multiple ranges within a type. So for example, the data once stored looks something like this, where you have a range from looks like June Eigth to June 10, and then June 13 to June 16. And you can store that within the same data type. And then you can do queries against it, like does it contain, using this operator, certain dates that you're looking for to see if they exist or not? And he shows how you can do that here. So the same type of range types that exist in a normal range, you can also do as multirange. So they have INTs and big NS numerical types, as well as timestamps with time zones, dates and times. And then they do a real implementation where they needed to check a set of non overlapping types to see if range exists in there. And something that took a great many lines of SQL before. Now you can do in basically three, which is here. And this is basically, quote, finding what dates I'm available for an employment within a given month. So of course this makes it easier to understand, maintain and actually uses less storage. So if you're interested in learning more about this feature, definitely check out this blog post. [00:11:16] And the next piece of content is actually a YouTube video and it's fetching large amounts of data. This is from the CyberTech YouTube channel, and what they're actually talking about is cursors. So let's say you have a ten terabyte table and you select all the data from that table without a where clause and pull the data down, you could potentially crash your client. Now, normally you would never do this, you would want to use a where clause for better performance. But another scenario you can use is cursors. So a cursor allows you to fetch one or more rows from the database from a particular point. So you can think of it as a cursor on a screen. You can move it to a particular location in the data set based upon a query and then fetch one row, ten rows, thousand rows, and then work with it locally on your client and that's much less data to pull down. So if you want to learn how to work with cursors in postgres, this is a great video to check out the next piece of content. PostgreSQL as a microservice this is from Tepou E H.org and the author was on a podcast recently and they were talking about the concept of postgres as a microservice and well, I tend to just think of postgres as a service and they were thinking about it as a storage service. But he says, you know, it's really much more than that. It's not just a very basic source of data here then be able to retrieve it. It also does it with high concurrency, meaning many, many different users can be accessing pulling data down, inserting it, deleting it, updating it, and provide, depending on your configuration, a consistent view as well of the different data in the database due to its implementation. And it's really a great concurrency microservice in his opinion. And he goes into a little bit about the isolation and locking available in postgres that you would not want to build yourself. So if you're interested in that, you can check out this blog post. [00:13:14] The next piece of content ignore Nulls in postgres this is from Patternmatchers WordPress.com and he's talking about a problem where postgres does not support the Ignore Nulls clause, which Oracle does, for example. So he imagines the scenario where you have this table here and you want to generate this column here that shows the last value that's not Null. So, for example. So a is null. A is null. Okay, it's a two. So now this is going to be set at two. The next row is two two because these are Nulls. And then when it changes again, it's okay, now it's a six, now it's a seven. So what is the last value that's not Null? Now, to do it in Oracle, it's relatively simple. You just do this window functions where you're ignoring Nulls, but with postgres it's a little bit more difficult. Now there is this implementation you can do here using a sub query essentially, but there's an actual better way to do it in that you can create your own aggregates. So aggregates are like some average, I typically just use the built in ones, but you can create your own. So to do this scenario, he actually created his own coalesce function first be able to do his implementation and then he created his custom aggregate and it's called Find Last ignore nulls and you pass in an element. So now with the implementation in Postgres, you can just do this select all from the table and then find last ignored nulls passing in the column that you are basing it on and you can do it as a window function and you get the same results. So I found this very interesting and he looks at the performance considerations and it looks to be about twice as fast as the subquery implementation, so definitely interesting. So if you want to learn more, definitely check out this blog post. [00:15:04] The next piece of content. Postgres 14, highlight memory dumps. This is from Pakier XYZ and he's talking about new features to be able to look at what's going on with memory and postgres. Now, it's actually not a dump of something that has crashed, but it's something that you can look real time into a session as to what is going on. [00:15:24] So there's a new view called PG Backend Memory Contexts that within a given session you can look at the memory context by using a query such as this one, and it outputs the information for your individual session. It also has a feature where you can request the memory context for a specific process ID and it will actually log it to the postgres logs. It doesn't show it on the screen, but it does log it to the postgres logs. And basically this is a super user only function. Now, I'm not quite sure why it only logs it to the logs and not to the screen, but it's another feature that's been added to postgres 14, the Next Piece of Content PostgreSQL on Linux Counting Committed Memory. This is from Crunchydata.com and this talks about postgres's use of memory, particularly the comparing commit limit to commit to an as and shows how those vary when the databases started and stopped, and gives insight into how you can look into this different memory. So if you're interested in that, you can check out this blog post. [00:16:30] The next piece of content Grafana Dashboards for Pgsev and this is from Lizolfsky Medium and Pgsev is a new metrics exporter for Prometheus that works with Postgres and PG Bouncer and others. Well, this is using grafana dashboards to be able to use those, but I found this particular post interesting with what metrics he wanted to monitor with postgres and which ones he felt were important, such as looking at the red Metrics requests, errors durations also Looking activity logs, statements, lock and Weight Events Replication Wall Vacuum Maintenance background writes, Disk Space Usage Tables, Wall Archiving System Resource Usage as well as areas for PG bouncer. So if you want to learn about what metrics he feels are important, it's definitely a great blog post to do that. [00:17:23] The next piece of content is Time Zone abbreviations. This is from Momgm us. So this is a very short post that actually shows two views I didn't know existed. One is PG Time Zone names that outprints the different time zone names within postgres, as well as PG Time Zone Breathes, which gives you the abbreviations of the different time zones in postgres. So if you want to check those out and learn more about it, you can check out this blog post. [00:17:49] The next piece of content is actually a YouTube channel. This is the Procona YouTube channel and they've posted over the last two weeks a great many videos from Procona Live, and a number of these, of course, have PostgreSQL content. So if you want some more video content, definitely check out this YouTube channel. [00:18:10] And the last piece of content. The PostgreSQL Person of the week is Tom Kincaid. So if you want to learn more about Tom and his contributions to postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube itunes. Thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 313

The Future of MySQL is Postgres? | Scaling Postgres 313

Episode 376

100K Events Per Second Queue | Scaling Postgres 376

Episode 290

Postgres Releases = Reindexing | Scaling Postgres 290

Use Bigint, High Availability, Autocommit, Synchronized Sequential Scans | Scaling Postgres 169