Fantastic Data Types, Domain Data Types, Insert Performance, Row Differences | Scaling Postgres 121

Episode 121 July 06, 2020 00:16:33
Fantastic Data Types, Domain Data Types, Insert Performance, Row Differences | Scaling Postgres 121
Scaling Postgres
Fantastic Data Types, Domain Data Types, Insert Performance, Row Differences | Scaling Postgres 121

Jul 06 2020 | 00:16:33

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss fantastic data types, custom domain data types, improving insert performance and how to query row differences.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/121-fantastic-data-types-domain-data-types-insert-performance-row-differences/

View Full Transcript

Episode Transcript

[00:00:00] In this episode of Scaling Postgres, we talk about fantastic data types, domain data types, insert performance, and row differences. I'm Kristen Jameson and this is Scaling Postgres, episode 121 point. [00:00:22] All right, I hope you, your family and coworkers us continue to do well in these times. Our first piece of content is Webinar fantastic data types and where to use them follow up. This is from Secondquadrant.com, and it's a webinar that was put on by Second quadrant and a recording is available here. You click on the link and you can register to attend it. What was really great about this presentation is that I've heard of using enums, but I've heard that there were numerous disadvantages. But this presentation really put enums in perspective on how advantageous they can be and ways to avoid some of the disadvantages and really just deleting them if you ever need to is the biggest disadvantage. But otherwise there are a lot of advantage to using them compared to, say, a separate lookup table or to storing raw text in a text field that has to be managed elsewhere. So for this part alone, I definitely suggest you check out this webinar to see their discussion on enums. Their coverage of array and JSON isn't as in depth, but for arrays. They say use it for consistent data type. So if you need to store a set of integers or a set of text, say tags, like user tags or something of that nature, an array is a good use case for that. But don't store an array of multiple different data types that have been converted to text that really doesn't make too much sense, and reach to JSON or ideally JSON b. When you have a variety of data you happen to need to store in JSON. But again, just be cautious of overusing that particular data type, and they do also cover timestamps in their usage as well. So overall, really good presentation I suggest you check out the next piece of content is PostgreSQL useful New data Types. And they're actually talking about domains, which a domain is a concept that you essentially create a new data type that has a specific check constraint. So in this example, he wants to create a color code data type. It's made as a text field, but it has this check constraint that's added to it to validate the color codes. And then this domain just becomes essentially a data type. So he's creating a table here with the column C that has the type of color code, and when you insert a valid color code, it accepts it, and when it violates that constraint, it presents an error message. Now of course, you don't have to create domains. You can put check constraints on existing tables and leaves the type as text, but this just generally makes working with the database easier to do and to reason about when looking at different tables and how the data is laid out. But he looked at some other use cases like maybe you want to have a text string that is only an alphanumeric. Well, this is one way you can use a domain to do it. You can also create a domain for URLs. Any presents this check constraint to be able to check that a particular data type only accepts URLs. So if you're interested in setting up these types of domains in your database, maybe you want to check out this post from cybertechn postgresql.com. [00:03:30] The next piece of content is 13 Tips to Improve PostgreSQL Insert performance. This is from the Timescale.com blog. Now, Timescale is an extension of postgres that does time series data very well. So some of the recommendations are related to Timescale, but the vast majority apply to vanilla PostgreSQL as well. So I'm just going to run through some of these. The first one is use indexes in moderation. So the more indexes you have, the slower inserts happen because you have to update all those indexes. Number two reconsider foreign key constraints, because generally when you have a foreign key constraint, the other table must be checked or validated that that value exists before the insert can happen. So reconsider using those for faster insert performance. Three avoid unnecessary unique keys because again, that's a case where it has to check, does this data already exist in the table before it does the insert. So there is another validation step that if you don't need it, you could increase speed. Use separate disks for wall and data. So depending on your performance, separating the wall from the data can give you better performance. Five use Performant disks, so don't use hard disks. Use SATA or NVMe drive or a high speed San. Now, starting from suggestion six on, they're talking about things related to Timescale DB. However, I find that nearly all of these are also applicable to postgres as well. Number six is use parallel writes. So if you're doing an insert or a copy, use parallel processes or a parallel process to insert that data, because these inserts or copies are done as a single transaction in the single threaded fashion. So if you parallelize it, you can get better write performance, potentially. Now, looks like Timescale DB actually has a parallel copy command, but you could do this yourself using your own language or techniques. With postgres seven insert rows and batches, it's always more advantageous to do a multi row insert versus single row inserts many times. Number eight properly configure shared buffers always important. Nine run our docker images on Linux hosts. So this is mostly timescale DB related. It's basically related to overhead, related to virtual machines. So you need to take these same considerations on how you set up postgres. Ten write data in loose time order. Basically, they're saying you want to try to keep as much in memory as possible. Now, Timescale DB does things related to partitioning. They don't necessarily call it partitioning, but it's time series data that's chunked up by a timestamp or something similar. And they're saying, insert all your data in the timing order, as opposed to say, if you're collecting metrics from servers, don't insert all the data from server A, then all from B and then all from C. Do it. By time instead of inserting data so that indexes and things of that nature are kept in memory as much as possible with the more recent tables. And that's applicable to postgres as well. [00:06:32] Eleven talks about avoid too large chunks and twelve avoid too many or too small chunks. And this kind of goes to partition sizes. You don't want too little data in a partition if you're using postgres and you don't want partitions that are too large, you want them to be just right, essentially. And then 13 is watch row width. So anytime you want to maximize insert performance again, fewer columns generally does better and you want to try to avoid a whole lot of data being inserted at one time. So a lot of great recommendations in terms of maximizing your insert performance and go ahead and check out this post from TimescaleDB if you want more details. [00:07:12] The next post is calculating differences between rows in SQL. This is from postgresql.com and this is a post about window functions. So how do you compare one row to the next row when you're doing a query? The first example that they're reviewing here is lag. So you want to say consult the previous year of a set of data you're looking at. You can use the lag window function and look back one row, but of course when you're doing this you want to order it. So you do an over and order by the year so you can look at the previous information for the previous year to compare it to the first year. And then he says what do you want to do? If you want to look at the difference between this year and the last year, you can actually take the number from this year and minus that window function to give you what the difference is now other than lag, he also talks about first value. So if you want to compare every year from the first value of a given year, you can use the first value function. And here he's showing how you can compare oil production compared to the first year that you have data available for. So you can compare each row from the first year and then he also shows how you can partition the data in your queries, say by country in this example. And here he's showing window functions that are broken out by country. So you can look at Canada's output relative to Mexico's output. So if you're looking to learn more about window functions and learning how to use them, definitely check out this blog post from CyberTech postgresql.com. The next post is indexing JSON B columns in PostgreSQL. This is from Sevloid Net. Now, the title is interesting because he talks about an issue where he was indexing a JSON B field trying to track multiple different IDs for a given document. So apparently different sources have different IDs for it. So he wanted to track it all in one document as opposed to using a separate table. But he ran into a number of issues, particularly when he needed to query based on multiple IDs. Now, I should say this is kind of how the post starts, but the post is a story that kind of weaves in a lot of different information so it's not strictly on indexing JSON B columns. But after he looked and investigated more, his ultimate conclusion was that he had a domain modeling failure and ideally he should set up a separate table, not using JSON B columns. So it's very interesting. The title of this post is JSON B index, whereas ultimately the solution was not to use JSON B columns. But again, the story and the process that he went through is a little bit interesting if you wanted to learn more about that. But given he ran into the situation where multiple IDs could reference a single document, this seemed to be a more efficient solution. Now, we also cover some additional pitfalls where he ran into problems and he also references other people running into problems with the new JIT compiler that is enabled by default in postgres twelve. So again some good information with regard to that. So again, this post is more of a story about a path that he went on, the things he discovered with postgres. So if you're interested in learning more about that, check out this blog post. [00:10:24] The next post is should you ever use synchronous replication in PostgreSQL? Now this is actually curious because I actually think the title of this presentation is how best to use Synchronous replication in PostgreSQL. Now it starts off again with from a story perspective of a customer that was having issues, that ran into some data loss due to a disk issue and how because they had a setting of synchronous commit, their standby did not have all the data that it was supposed to have and it was because of how synchronous commit was set up. And they go over the different settings that are possible for synchronous commit from off, remote write on local and remote apply and they describe how you can set this up to get better performance or better protection of your data. So essentially it's a continuum. If you want the best performance you could turn synchronous commit off, but you're potentially risking some data loss doing that, of course, because as mentioned here, quote commits are sent to the application as soon as the transaction is handled by the back end process, but before the transaction is flushed to the wall. So essentially it's somewhere in the memory, but not committed to the disk on the primary database server. So this doesn't even have to do with the replicas. Whereas the most safety you can get, but the least performance because it has to wait is remote apply. Because commits are sent by the primary to the application only after the standbys defined in synchronous. Standby names have confirmed that the transactions have been applied beyond the wall to the database itself. So this post talks about essentially the trade off. If you want more speed but are less concerned about safety, you could go the synchronous commit off route, but if preserving data is the most important and performance not so much, then you could go with a synchronous commit with a remote apply. But generally, if you want to learn more about synchronous replication and how best to configure some of your settings, definitely check out this post from Enterprisedb.com. [00:12:25] The next post is Oracle do PostgreSQL binary objects. Now I read this title and I was expecting to look at a comparison between them, but actually most of this post talks about PostgreSQL in general and it's a really good weighing of pros and cons on where you should store binary objects. So if you store binary objects in the database, he mentions three advantages they get backed up with appropriate transaction integrity, they get transported to physical streaming replication, and they don't require separate connection stream to the caller. But then they list eleven different issues that are disadvantages with storing them in the database, primarily related to their size and performance. So if they're small it's less of an issue. But I won't run through all of these right now, but definitely a great list of potential disadvantages. And then of course, they give you some recommendations on how best to store binary objects giving these pros and cons. So if you are storing binary objects in your PostgreSQL database and you're not sure if you're doing it right or if you want to use a different method, definitely check out this blog post, because it's not just about oracle, but in general. What are some best practices with regard to storage of binary objects and working with your database? And this is from secondquadron.com. [00:13:45] The next post is actually a YouTube channel and it's a look at the elephant's trunk PostgreSQL 13. This is from the Warsaw PostgreSQL Users Group YouTube channel and it's a presentation about an hour and 17 minutes in length going over the new features in PostgreSQL 13th. And Magnus Hagander has done this a number of times with different versions of PostgreSQL and this is his most recent version 13 version. So if you want to learn more about postgres 13 and all the different features coming in it, definitely check out this video. [00:14:18] The next post is Webinar PostgreSQL partitioning roadmap follow up. This is from Secondquarter.com and they're going over the roadmap for partitioning updates and improvements to PostgreSQL. Now, they covered some of the historical changes with PostgreSQL eleven and version twelve, and also what's coming in 13 and potentially some thoughts about 14. So again, they continue to push as much as they can to fill out the performance, make management easier, and looking into the future. For Sharding, again, this still seems a fairly long way away, but again, it's something being considered. And then they also talk about their bi directional replication product for a small portion of the webinar. So if you're interested, you could just click this link here to get registered to view it. [00:15:07] The next piece of content is how secure is your Postgres? Part One Access this is from Enterprisedb.com, and they're talking about different levels of access to the PostgreSQL instance in terms of setting up security from physical access, how you can connect to it, whether you're using a Unix domain socket locally or using TCB IP connection. And then once you allow a TCB IP connection, what kind of firewalls are in place to prevent access from other areas, as well as ensuring that the data is encrypted in transit by using transport encryption such as TLS to do that, and different ways of configuring these SSL parameters. So if you're interested in learning more about secure access to PostgreSQL, definitely check out this blog post. [00:15:56] And the last piece of content is the PostgreSQL Person of the Week is Andrew Dunstan. So if you're interested in learning more about Andrew and his contributions to PostgreSQL, definitely check out this blog post. [00:16:09] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com, where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube. Right tunes. Thanks.

Other Episodes

Episode 126

August 09, 2020 00:14:13
Episode Cover

BRIN Pitfalls, pg_stat_statement Troubleshooting, Natural Join, Geography Type | Scaling Postgres 126

In this episode of Scaling Postgres, we discuss BRIN pitfalls, using pg_stat_statements for troubleshooting, natural joins to detect duplicates and the geography type. To...

Listen

Episode 187

October 17, 2021 00:15:05
Episode Cover

Sharding, Federation, Linux HugePages, Performance Tuning | Scaling Postgres 187

In this episode of Scaling Postgres, we discuss one method to shard a database, federating your database, the benefits of configuring linux HugePages and...

Listen

Episode 117

June 08, 2020 00:17:35
Episode Cover

Index Deduplication, Fast Hierarchy, More Monitoring, Cloud-Native | Scaling Postgres 117

In this episode of Scaling Postgres, we discuss index deduplication in Postgres 13, fast hierarchical access, more essential monitoring and cloud native Postgres. To...

Listen