The Dangers of Temporary Tables | Scaling Postgres 349

[00:00:01] Well, the last 72 hours have been a bit of a blur for me because I either have the flu or that five letter word starting with cov that nobody wants to talk about. And I've definitely had a super high fever but I want to go ahead and get an episode out this week. But unlike me, I hope you, your friends, family and co workers continue to do well Our first piece of content is What Hurts in PostgreSQL Part 1 Temporary Tables this is from percona.com and I don't think it's necessarily true that temporary tables hurt as long as you know how to use them, but some of the use cases he's seen for them definitely cause pain, particularly in the system catalogs, because the first thing he mentioned is Problem one Bloated catalog and poor performance. So he was using this tool called pggather to identify problems in postgres and it noted that there were 30 regular schemas and 186 temporary schemas. So apparently a lot of temporary tables are being created and the catalog is maybe 25 times higher than he normally sees it because of all of these temporary schemas from temporary tables and even a super high OID of PG class. For example, this first example has over a 4 billion OID in PG class which gets credited with every object that you create in the database. So that can definitely cause some performance problems in your system. The other problem related to it is that high autovacuum activity for the catalog tables. So you're having to auto vacuum them a lot because of all this temporary table usage. Problem 3 you get high DDL activity because the thing about Postgres is that the temporary table only exists within the session that you've created it in. For example, in other systems, I don't know if it's Microsoft SQL Server or Oracle or maybe both, you can create temporary tables once in the schema and then each session can use those. And I think each session only sees its data. I think that's how they're designed. But postgres is different. Every temporary table you create only exists within that session on postgres. So if you have to use a temporary table with every connection that you're doing, you're going to be creating tons of different temporary tables, so that's probably not something you want to do application wise. Now one thing that they didn't mention in here is unlogged tables, because unlogged tables do exist across sessions. So if you're using a temporary table in another database solution, want something Similar, although it's still not identical. But if you want that table to persist across sessions, maybe you would like to use an unlogged table instead. But the fourth problem of temporary tables is ineffective connection pooling. Again, because they only exist within a given session, it's hard to share sessions. The very fact that you're using temporary tables means you're using sessions and cannot use transaction pooling effectively. Problem five, you're flooding the PG SQL logs and audit logs with these DDL statements. These create table statements. Problem six, helpless autovacuum. Because autovacuum doesn't work on temporary tables, so it's easy for them to get bloated. I mean, hopefully they're not lasting that long, but that can still be a problem. And again, he says he normally sees this happen with people coming from other database systems to Postgres, where they're using these different types of temporary tables that are actually more in line with the SQL standard. So maybe Postgres at some point will have an enhancement that matches what the SQL standard supports with regard to them. But definitely, if you're creating a lot of temporary tables as part of your application design, you may want to rethink that with postgres. But if you want to learn more, definitely check out this blog post. [00:03:53] Next piece of content. Your data is not safe until it's TDE safe. Here's how. This is from percona.com, and they're talking about a new or a revised new extension called pgtde. [00:04:07] Now, this existed previously, but it had the limitation where it had poor performance multiple times higher than not using it, and indexes were not included. So you could still see the data in indexes on disk. And they're calling this capability TDHEAP Basic. But with the enhancements they've made to pgtde, you now get index encryption support, native replication capability, and minimal performance impact in the early testing that they've done. So maybe an overhead of 10% or so. [00:04:45] Now they do discuss a little bit about how it works and that you do the encryption per table in each table gets its own key. And that key is stored in a key ring. And then that key ring is encrypted by a principal key. And that principal key or the master key is basically what gets recycled. And that basically gives you access to all the encrypted tables. And it's designed to be used in a key management system. That's where you would put the principal key. And right now they support the Hashicorp vault and then KMIP compatible kms. Now it says PGTD is open source and ready to test, but they do say it's included in a percona distribution for 1.1.17, so I'm assuming you can download it separately, but I'm actually not quite sure on that. But once you get the extension installed you just create a table using TDE heap and that encrypts the table and of course the indexes as well. So if you want to learn more, definitely check out this blog post Next Piece of Content announcing Pig the Postgres extension wizard this is from PostgreSQL.org and this is a Go based package manager and it supposedly works with 340 extensions already. [00:06:02] And it says PIG piggybacks on your system's native package manager apt. Yum dnf. So presumably by using Pig you can just use any apt for example, if you're using a Debian system to install the extensions you're interested in. So that sounds interesting. You can click here if you want to learn more about this next piece of content. Anatomy of Table level locks in PostgreSQL this is from Zada I.O. and this post talks about locking in general and the reason for locking, and it goes into background of MVCC and how it works. And in terms of table level locks, it's basically whenever you want to do DDL changes, you want to change the schema in some way and it shows all the different types of locks that are available and how they conflict with one another. Which of course this is in the postgres documentation. But the key thing to keep in mind when you're doing DDL changes is you want to protect yourself. And the number one way to protect yourself from causing unintentional locking across the system and generating what's known as a lock queue is to use a lock timeout. So basically you specify the lock timeout in your session for some number of seconds or parts of a second, and if that DDL operation has to wait for a lock that exceeds that timeout, it'll go ahead and cancel itself. So it basically protects the system. Now what lock queues are is that say you have some DDL that should run very quickly. You want to rename the column of a table that should run very quickly, but the problem is if it has trouble acquiring that lock to do that very quick change, it still has to get in line and then suddenly every other query of that table is getting in line behind it too, and it's blocking other, say, selects from happening, causing that lock queue. But if you want to learn more about that and the importance of the lock timeout, definitely check out this blog post. Oh, also, in order to find out what's locking, there's also the pglocks system view, but I frequently forget that there is this convenient function pglockingpids to show you what other backends are blocking a particular command that you're interested in. So you can see arguably more easy what's blocking something versus the pglocksystemview. [00:08:24] Next Piece of Content There was another episode of postgres FM last week. This one was on PG Squeeze. So Michael and Nikolai actually had, I think Antonin on from Cybertech who actually wrote PG Squeeze. And PGSqueeze is an alternative to PGrepack, so it allows you to compress your tables while they're online. [00:08:47] Now, I haven't used PG Squeeze yet. I did try using pgrepack once and I had issues with it actually completing the operation. This was with a very large table, multiple terabytes, so there might have been some other issue related to it. But the reason that PG Squeeze was developed is because they wanted to be able to kick compression jobs like this off from a background process. But the problem with PGrepack is that it actually incorporates a client and a server architecture, so that would require too much work to kind of re engineer pgrepack. So they went ahead and just rewrote it fresh. In addition, it is constructed differently, so PGrepack uses predominantly triggers to do most of its magic, whereas PG Squeeze uses logical decoding. But during the episode they did mention that a commit has been done to postgres. We'll see if it gets in the next version, but it basically would allow you to do online table resizing, which would be great. So basically now the way you can shrink a table but it causes an entire lock on the table is vacuum full. So you don't want to do that. But apparently this commits introduces a concurrently option, so you could vacuum full concurrently and that would use the PG Squeeze technique to do online table copies. Now there would still be a brief lock starting to use the new table from the old table, but that would be far superior than the options available now with just vacuum full. But if you want to learn more about that, you can listen to the episode here or watch the YouTube video down here. [00:10:24] Next piece of content waiting for PostgreSQL 18 support like with non deterministic collations. [00:10:31] So if you had a collation that was non deterministic, like he's creating a case insensitive collation here specifying deterministic false and creating the table using that collation for a particular column. If you tried to use the like syntax it would just give you an error and it says non deterministic collations are not supported for like but in postgres 18 midnow works. So hopefully that's a benefit that will land in 18 and I should say this is from depeche.com next piece of content Fix a top cause of Slow queries in PostgreSQL no slow query log needed. This is from render.com and what this is basically talking about is indexes on your foreign keys. So of course you want to have primary keys in all your tables. When you define a primary key on a table it automatically gets an index, but foreign keys do not get an index by default and that can cause problems when you're trying to query the table because a lot of times foreign keys are used in queries or if you're using cascading deletes, it's going to use that foreign key index to delete them more efficiently. And they show that very clearly how it works here. So if you want to learn more about the importance of indexes on your foreign keys, you can definitely check out this blog post. [00:11:48] Next piece of content Postgres Tuning and Performance for Analytics Data this is from crunchydata.com and they're going through the scenario where you have an application database that is predominantly doing OLTP work, but more frequently you start getting asked questions that requires running analytical queries. Well how would you handle that in the database? And they discuss some different ways to deal with that. One way is to have some dedicated analytics users to the database and they would get greater workman than OLTP users would because generally they're doing more joins and more sorting operations and will need that additional memory to not run into a disk sort. They might have a greater statement timeout, like maybe they can run a query for 5 to 10 seconds whereas the application you'll want it much shorter compared to an analytical statement. And maybe you want to give analytics users greater parallelism because they can take advantage of that and you only have a very few queries running, so you could probably afford to do more of that work in parallel. [00:12:50] The other thing you can do, they mentioned, is that you can add dedicated indexes for some of the analytics you need to do, or even just pre calculate the data. So this could be as simple as adding some generated columns that pre calculate the data you want to store, or even creating materialized views of whole sets of data that get refreshed on a periodic basis, and if that is insufficient you can start looking into a separate analytics database. You could use your read replicas that does have an identical schema to your primary, so there's some advantages and disadvantages of that, but also your queries can time out when you're running against a read replica. Another option is setting up logical replication, so logically replicating certain tables to another database and using that as a reporting database. You could do that. And then they also mentioned columnar analytics databases and then they go into their data warehouse platform. But there are other columnar analytics databases like there are now some open source choices for using say DuckDB to do analysis even within Postgres with the new PgDuckDB extension. Or there's also TimeScaleDB that also has columnar capabilities in its database as well as the solution mentioned by Crunchy Data here. But you can check out this blog post if you want to learn more. Next piece of content Postgres Per Connection Statistics this is from ardentperf.com and he mentions a number of observability related things he loved to have in postgres. So we really appreciated the recent enhancement that allows backend statistics for I O and hopefully more backend wall statistics as well. That may be coming up. But another thing I found super interesting in this blog is the comment he made here. [00:14:39] One long running topic is better instrumentation for detecting plan changes. The thing we need is some kind of plan hash and Lucas Fiddle is keeping the discussion going with a fresh patch and proposal on the list last week. So I don't know if anything's happening yet, but that would be super interesting to get planned change detection in postgres. But if you want to learn more, definitely check out this blog post. [00:15:05] I hope you enjoyed this episode. Be sure to check out scalingpostgres.com where you can find links to all the content mentioned, as well as sign up to receive weekly notifications of each episode there. You can also find an audio version of the show as well as a full transcript. Thanks and I'll see you next week.

Show Notes

Episode Transcript

Other Episodes

Episode 223

Database Decomposition, Fast Text Search, Understanding pg_stat_activity, Window Functions | Scaling Postgres 223

Episode 165

Postgres Releases, Data Warehouses, Exclusion Operators, Read-Only Users | Scaling Postgres 165

Episode 311

1,400 Times Faster Max & Group By Performance | Scaling Postgres 311