Citus Open Source, Time Series Performance, Subscripting Updates, target_session_attrs

[00:00:00] In this episode of Scaling Postgres, we talk about Citus open source, time series performance, subscripting updates, and target session attributes. I'm Kristen Jameson, and this is scaling postgres episode 155. [00:00:23] All right, I hope you, your friends, family and coworkers continue to do well. Our first piece of content is Citus ten columnar for postgres rebalancer, single node and more. This is from Citusdata.com and they basically released a new version of Citus Open Source of version Ten, and it has a lot of great features to it. The first feature is columnar storage. So basically it sets up column store as opposed to a row store, which is the default that enables very high compression of your data and speeds up analytical type queries, like when you want to average or sum a single column. That's what column based storage speeds up for you. You can now also shard on a single node. So basically, if you want to get prepared early for scaling out your database, you can now do it on a single node and then be able to easily move to multiple nodes to scale out. The next one now available in the open source is a shard rebalancer. So now you can rebalance your shards and allows you to add or remove them. Next feature is joining foreign keys between local PostgreSQL tables and sites tables. The fifth is functions to change the way tables are distributed and more. So, looking at the first feature, the column storage, you basically create a table using columnar, once you have the Citus extension installed, of course, and it uses Zstd compression to achieve 60% to 90% reduction in data size, which is huge. But it does have a number of significant limitations. First, it doesn't support indexes, so basically it relies on the small table size to be able to scan that table very quickly for doing calculations. Secondly, updates and deletes are not supported, so it's an append only table. [00:02:09] And then third, it's best to avoid single row inserts. So basically, since the compression works well in batches, they say here, this is the kind of thing you want to batch update the table to store data. And one suggested way to do it is you use the standard row storage for more recent data and then archive in the columnar storage. The next one they talked about is being able to start out with Citus on a single node so that you can scale it when you're ready. And again, like I mentioned, you just basically add Citus to your preload libraries, create the extension, and you can create a table and then create a distributed table based upon a particular shard key. The next item is a shard rebalancer, so it allows you to rebalance your load across multiple shards that's now available in the open source version. Joins and foreign keys now work between Citus tables and postgres tables. And they give an example here of where you can only apply citus to particular tables. So for example, your main PostgreSQL database has publishers, advertisers and campaigns, but you have a reference table here for a very large table, like clicks is huge. For example, well you would distribute this with Citus and you would classify this ads which helps reference the connections between these. This would be a reference table and then the rest of these would stay stock postgres tables so you can get into Citus slowly. They've added some enhancements to be able to alter the sharding and a lot of other new features. So this was particularly interesting because this has now been added to the open source version of Citus. So if you wanted to try this out for free, you could just download the open source version and give it a try. [00:03:52] The next piece of content is designing high performance time series data tables on Amazon RDS for PostgreSQL. This is from Amazon.com and even though this post talks about Amazon RDS, you can do this with stock postgres. What they are proposing here basically trying to use stock PostgreSQL to do time series data. And they have an example here where they are collecting almost append only data and needing to upload it into postgres. So they have a readings table and a few different indexes on it. And apparently this was probably a customer because this walks through them making particular transitions, changing column types to make things more efficient. So for example, one of the first things they cover is data type considerations. So it's basically determining the smallest type of data type to use for something. So for example, with regard to the ID, is this best to be an integer, a small int or big int? So they have a discussion about that. They also discussed whether double precision makes the most sense and they actually decided to go with something smaller, a real data type, because it was smaller. So basically making the size of the data as small as possible by choosing the most optimal data type size and doing that reduced the data size by about 5% and then they looked at the indexes. So they did have a standard b tree index on the time field. But looking at the use case, particularly with time series data, they said a brin makes more sense and a brin is a block range index. So it doesn't index each specific value like a b tree does, but it references ranges. And they chose a particular number of pages per range and that significantly dropped the size down. And now the table and indexes went from 126GB down to 101gb. So a 20% space savings by moving to brin indexes versus the b tree. Now, you have to be aware of that. You may lose some query performance, but it depends on your use case. If you're looking for individual values, you want to stick with a b tree, but with a brin that works best for ranges of values. And then they looked at partitioning because they were doing a metric ingestion test where they did a parallel load using multiple threads into the database, I believe about a half a billion records, and they were seeing this performance. So it started off pretty well and then it started dropping down here and then it started getting very irregular here. Now, what they said around the 302nd mark is when the checkpoint started kicking off and that generated wall and generated full page writes, which caused the insert performance to struggle. And then here it's basically running out of memory due to the data size increase. So the database exceeded this space in memory, was having to go to disk. So they actually mitigated this problem by doing partition tables and they created around 360 partition tables for all of this data and they showed you how you did it here and it resulted in a dramatic increase in performance that looks like this. So the ingestion stayed a consistent about 1.6 million metrics per second over the duration of the test. Now, what's interesting is that they don't really specify why partitioning helped this because, I mean, you're inserting the same amount of data and presumably everything eventually has to get to disk, but they didn't explicitly say why this was such a great improvement to partitioning. Perhaps it has to do with them doing the load in parallel and they were doing a lot of random I O that was changing pages a lot more with single indexes, whereas it was much more efficient when broken up. Or maybe like in a previous episode that was talking about the vicious circle of wall rights. Maybe auto vacuum was a component and because you have to auto vacuum the whole table, maybe that was bogging down some of what we see here. Whereas when you have 365 partitions, they're a lot smaller and it's easier to vacuum them up. But they didn't really specify why this worked. I would be interested to know how does this look with different number of partitions? So if you had more partitions, would you see any difference? If you had less partitions, would it start dropping down and at what point? And then how is that related to loading the data in parallel? So they didn't specify how many threads they used, but does that impact this performance as well? So this is a significant result, but I would be interested to know exactly why this helped the performance so much. But if you want to learn more about how you can use postgres for time series data without using an extension such as Timescale DB, you can check out this blog post. [00:08:34] The next piece of content is how many engineers does it take to make subscripting work? This is from Earththelion Info and basically he's indicating that there's a patch that's been done that is probably going to be in postgres 14 where you can actually use subscripting syntax to set keys in JSON b as well as query from it. So for me this is huge. I love this developer convenience and I don't know if it's me, but every time I need to update a JSON B column, I need to check the docs on how to do it. So this would be so much easier to work with compared to now. So kudos to everybody who was involved with this. And actually this post is very long. So this is the TLDR, but it describes all the process that went through getting this feature into postgres. So there's a lot of backstory, different developer comments, different things to be aware of when doing postgres patches. So if you're interested in the backstory, you can definitely check out this post. [00:09:35] The next piece of content is New target session adders settings for high availability and scaling in PostgreSQL version four. This is from Cybertechn Postgresql.com and they're talking about target session attributes. So basically when you connect up to psql using libpq you can specify more than one server and it will connect to one. If that's not available, it will connect to another, where you can also specify target session attributes whether that particular target is read, write or not. Otherwise classified as any. So this enables you to say just target to the primary server or not kind of, but in postgres version 14 they've added a lot more different adders that you can use. So now you have primary, you have Standby, you have Read only, you have read, Write and then prefer standby. So primary very similar to read write only. Difference being the default transaction read only is set or not. Same thing with read only and standby, it just varies by this parameter as well and prefer standby. Basically it tries to connect to a standby. If none is available, it just connects to whatever else is available. He also mentions that of course the Lib PQ library is used in a lot of different languages, postgres connections, so for example they mentioned Python, Go, Rust, I know Ruby uses it as well, so a lot of them use this. So you can use this syntax today and it will be even better in version 14. So if you're interested in that you can check out this blog post next piece of content, starting with PG where how can I set configuration parameters? And this is from Dep and he had previously wrote about locating where the config files are, but this talks about where and all the different places you could make changes to them. So there is of course the base postgresql.com file. However within that at the bottom you can include additional files. So you can include files or give an error or include only if the file exists so it won't give an error. It also usually by default includes a directory called confd that's personally where I stick all of my configuration in here and I usually don't change the postgresql.com file because it just makes it easier for upgrades. And whatnot the other place where you can find configuration is in the data directory and it's the PostgreSQL auto.com file. This file you should not manually edit and it should only be altered with the alter system command. So you can from a postgres prompt say alter system, set this parameter to particular value and it will change it in here. Now it won't immediately make it live because you may need to do a reload or a restart. Mini goes into well how do you determine which parameters you can do a reload on and have them work or it requires a restart. He says you can actually query the PG settings table in Postgres and it will tell you for each value whether it just requires a reload such as the archive command or it requires a restart such as the actual archive mode. The next thing he covers is that all the items in the Postgresql.com file are global and even the PostgreSQL auto.com file are global. However, you can set for particular users certain parameters and that they're specified by user here for example. And you can also set them per database or you can set them per user per database and he goes through and shows an example of here of how that hierarchy is established. And the last thing he covers is that you can also set them per session. So per connection to Postgres you can change certain parameters for that session that you're in and lastly covers different ways to reset those values to their default. So if you're interested in learning more about configuration and the different places you can do it, definitely check out this blog post, the next piece of content starting with PG. Where are the logs? This is also from Depsc.com and he's talking about where you can find the logs in your postgres system. So he considers different places where it can be logged, like in Windows. You can log it to the event log in Linux you can do it to the Syslog, you can also log to the standard error, or you can also do it using the logging collector or not, as well as to the CSV log. So if you want to learn more about logging in Postgres, definitely check out this blog post. [00:14:05] The next piece of content is actually a YouTube channel and it's EDB's YouTube channel and in the last week they've posted a lot of different webinars on Postgres. So if you're interested in video content definitely check out this YouTube channel. [00:14:19] The next piece of content is security and containers in cloud native PostgreSQL. This is from Enterprisedb.com and they're talking about their cloud native postgres offering and basically they are bundling up Red Hat Linux in a container with postgres all configured for you to be able to use in things like Kubernetes. And they go through the security process that they're doing to deliver secured images via this means. And also how they've actually configured the container itself following the principle of least privilege during execution. And then covered some points about why Immutability is a benefit in that. In this scenario, you generally would not install patches on the server, you would just get a new container and install that and do a failover process to a standby server to upgrade it, and that's the process you would go through for container upgrades. So if you're interested in that, you can check out this blog post. [00:15:14] The next piece of content is PostgreSQL Getting Started on Ubuntu. And this is a very simple post talking about the best way to install Postgres on Ubuntu, especially if you're looking for a specific version and you don't want the Postgres version bundled with your particular version of Ubuntu. So if you're interested in that, you can check out this post from CyberTech. Postgresql.com next piece of content is how PostgreSQL handles subtransaction visibility and streaming replication. So if you're interested how Postgres works under the covers to accomplish this, definitely check out this post from Higo CA. [00:15:49] Also from Higo CA is how to create a system information function in PostgreSQL. So again, if you want to get more detailed postgres information, you can check out this blog post. [00:16:01] The next piece of content is Pgpool Two's Clustering Modes. This is from Bping Blogspot.com, and she discusses all of the different clustering modes that are available in Pgpool Two and how they work and how to enable them. So if you're interested, you can check out this post. [00:16:18] The next piece of content is regression analysis in PostgreSQL with TensorFlow. Part one getting Started. This is from Enterprisedb.com, and this is the first post in a series that just gets TensorFlow set up with Postgres and Python Three so that you can start using TensorFlow for machine learning. So if you're interested in that, you can check out this blog post. [00:16:41] And the last piece of content is the PostgreSQL Person of the Week is David Wheeler. So if you're interested in learning more about David and his contributions to Postgres, definitely check out this blog post that does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show Notes. Be sure to head over to Scalingposgres.com, where you can sign up to receive weekly notifications of each episode, or you can subscribe via YouTube. RyTunes thanks.

Show Notes

Episode Transcript

Other Episodes

Episode 232

PG15 Beta 4, Primary Key Options, Sequence Limits, Configuration | Scaling Postgres 232

Episode 169

Use Bigint, High Availability, Autocommit, Synchronized Sequential Scans | Scaling Postgres 169

Episode 203

Saving Space, Working with TOAST, Protecting Data, High CPU Utilization | Scaling Postgres 203

Citus Open Source, Time Series Performance, Subscripting Updates, target_session_attrs | Scaling Postgres 155