Episode Transcript
[00:00:00] In this episode of Scaling Postgres we talk about postgres goodies materialized versus rollup, column oriented and parallel scan. I'm Kristen Jameson and this is Scaling Postgres episode 37.
[00:00:22] Alright, our first piece of content is webinar. PostgreSQL is not your traditional SQL database. This is from the second quadrant.com blog. So this is a webinar that they put on and you can get access to the recording of the webinar here and you just need to do a registration, but then you get immediate access. Now this was a really great presentation and the slides folder actually right here, I'll include the link as well. So this is the name of the presentation by Gulcinjelnik, perhaps my apologies in that pronunciation. So this is the agenda for the presentation. It went over the design choices of PostgreSQL and it's basically talking about kind of the object oriented features and some special features of postgres that have been added to it that make it a little bit different. And this was a good review for me because it actually brought some things to my attention that perhaps I should be using more features like maybe the enum feature and also some particular configurations to use for gen indexes that could give you better performance for some query types and make the index smaller. So basically she goes over arrays, how they could be used. Enums used for typically for status fields that are changing the status. You can use an enum and that's a more efficient way to store the data. Talked about JSON types and JSON types. Essentially just store a replica of a JSON you've received whereas JSON B is good for actually storing and ordering the JSON and you can query it very fast using a gen index and the presentation goes into full text search and the different ways you could use it and rank it. And how do you do similar searches as opposed to exact searches as well as talking about accent and language support. So again, this is a really great presentation and even I learned a few things. So I highly encourage you to check this out, access the slides and download the webinar presentation.
[00:02:24] The next article is Materialized Views versus roll up tables in Postgres. And this is from the Citusdata.com blog. And basically this is for when you're perhaps wanting to do some type of real time analytics or it'd be hard to be real time, but with materialized views but different ways that you can look at rolled up aggregates essentially. So firstly go over what is a view. And a view is essentially a virtualized table. It doesn't really exist as a table, but it's a virtual structure that allows you to pull data from multiple or even just a single table, only a few rows, but it has to access all that tables, all of those tables. When you do a query, when you materialize your view, it actually goes from being a virtual table to actually a physical one that you can reference. And then there is a command that you can refresh it and you can see refresh materialized view right here and you can refresh it concurrently so it can be done without having to lock the table. And they say this use case using materialized views is good. Maybe if you are going to roll up some totals in the evening when a system is not busy, but it may not be great for something that needs to be more real time.
[00:03:40] And here they're talking about the incremental roll ups when you need to be more scalable. So basically this is essentially a summary table that they're rolling up analytics counts and they're inserting into the roll up table and doing a select from their page views that's their analytic counts table and grouping it by day and page. Now, because they put a unique constraint on this roll up table, you're not going to be able to insert more than one day here, but what they're going to use is an on conflict, do an update. And in the process of doing it, they're actually going to be using some functions and they reference another post that we've actually covered on a previous episode of Scaling Postgres that I highly suggest you check out. But basically when you're using some of these functions that they present there, you can insert a day or you can just update the counts and presumably you can set this up on a schedule. Maybe you want to do it every hour or every 15 minutes and it will update those totals for you. So it will be more real time than having to do essentially you're updating just those values that have changed versus refreshing an entire materialized view. So if you're looking into wanting to refresh essentially summary tables, here are two techniques you can use to potentially do that. So, definitely a blog post to check out.
[00:05:04] The next post is a poor man's column oriented database in PostgreSQL. And this is from Brianlikes postgres.com. Now, I'm going to mention the first three sentences of the blog post or quote them. I mean, let's get this out of the way. If you need a real column oriented database, use one. Don't even think about using the insanity below in production. So really this is just an interesting thought experiment blog post, but I thought it was interesting and it helps understand the rationale about why OLAP or online analytical processing environments use column oriented databases. And he says here two properties of a column oriented database that make it beneficial in OLAP environments are queries that involve only a subset of a table's columns. So maybe you're aggregating some value in that column and it only needs to read those columns off of a disk and it saves an IO. In addition, storing each column separately means it can compress that data better since it's all of the same type. Now, what he mentions here is that is he creates a one column table with just a unique identifier, and then he creates another table that contains the unique identifier and then one other column with the data in it. So essentially, it's one table dedicated for all of that column's information. In order to use it, you have to join to multiple tables to be able to pull data from it. And he goes over his whole process of setting this up, different functions that he used, different trigger functions to be able to set this up. And then he has the results for a particular query where he's just summarizing the one value from a traditional table. So a table with all the columns in it not really column oriented and it completes in 55 seconds. So a traditional table, 55 seconds. And then in his combined where he's kind of column ordering things and he does the same operation, sum that one column from his column oriented store and it finishes in 17 seconds. So it actually worked. And he says a quote in here, there we go. PostgreSQL is smart enough to query only the underlying tables it needs and completes in less than a third of the time of the traditional version. So it has an advantage when it's only looking at one column and aggregating all of the values there. But what happens when you want to retrieve just one row of all the data, like you would typically do in an online transactional processing application, like give me the entire row with all the columns. So selecting one row from the traditional table took 00:16 milliseconds so far less than a millisecond, whereas selecting one row from his poor man's column store view took 69 milliseconds. So significantly longer to ask for all the columns from a particular row. And then he looks at insert performance and update performance, as well as delete performance. And in terms of his conclusion with regard to this quote, the traditional table beats the pants off of the view on single row operations. So this makes sense. You're storing data essentially by rows. So if you're asking for one row or inserting one row, deleting one, it's very easy to do. And then with regard to columns, it has to actually touch multiple column tables to actually change the data and to pull a whole row back. So I thought this post was very interesting because it actually shows you the rationale for column oriented stores and why they tend to be used for analytical processing. It's slower to put the data in, it's slower to ask for just a row of data. So it's terrible for transactional processing workloads, but it works really well when you're analyzing a few columns, like doing a summary, an aggregate of some sort or maybe in an average, some sort of function across a single column and enables more space efficiency. So definitely an interesting perspective and experiment that he puts in his post here.
[00:09:14] The next post is using parallel sequential scan in PostgreSQL and this is from the Rafiasab blogspot.com. It's all in database blog. And this was interesting in terms of telling you or explaining kind of why parallel sequential scan can potentially cost more than a regular scan, in what cases that is. It's basically the coordination of the different workers that are involved when doing a parallel sequential scan. So, for example, she says, let's say there's a table with 100 tuples, 100 rows, and we have two workers and one leader. Essentially that needs to coordinate the work and aggregate the work of the workers. Let's say an arbitrary cost of scanning one tuple is ten. The cost of communicating one tuple from the worker to the leader. So if it has some work, it's done to pass that to the leader is 20. The cost of dividing the tuples among the workers is 30. And she's assuming that the leader gives 50 tuples to each worker. So the cost of a sequential scan, not parallel of a table, would be the cost of scanning one tuple times the number of rows. So it's a cost of 1000. But for that same row, if you're going to be needing to scan the whole thing, you have to add all these different costs up and it comes out to 2500 as the cost, so more than twice as slow. So all the coordination of the different workers to actually get the same result, it's going to be twice as slow with regard to looking at cost values. But she said let's change it up a little bit here.
[00:10:52] Quote let's say we want to list only the tuples which have the A column. One of the columns in her table is greater than 80. So basically pull out 20 tuples from the table that has 100 tuples. So the sequential scan will be the same. You still have to scan through all 100 tuples. Cost of scanning a tuple is ten, so you still have the same cost, it's still going to be 1000. However, in parallel it works out differently where the fewer rows that you actually need each worker to pull back and then send to the coordinator or the leader, the smaller that number is, the more efficient you can get. So here the cost drops down to 730. So actually the cost of doing the parallel scan is less than just a standard sequential scan. And she says, quote, a parallel sequential scan is likely to improve the performance of queries that require scanning a large amount of data, but only a few of them satisfy the selection criteria. So again, the more specific data that you're looking for in a large table, the more efficient a parallel sequential scan will be. And she goes into different configuration variables that you can set for your PostgreSQL instance. So if you're interested in parallel sequential scan, how it works, and kind of the rationale behind it. This is definitely a great blog post to check out.
[00:12:15] The next post is it's just an expression from Haxoclock Blogspot.com blog. So this blog post is essentially talking about text search and specifically case insensitive text search. So one way to do it is to search on a column using Ilike but the performance of that gets pretty bad pretty quickly. You could also use a case insensitive regex but again the performance of this cannot be great at times. But he says there is one great extension you could use called CI Text which is a case insensitive text. It lets you use indexes and you still get case insensitive matching. But he said, well, let's say you don't have that extension installed or you didn't create the table with that field type in the first place. And you need to, for example, search like email address is the classic example where you have a lowercase email address and you're trying to find a match for it in the table. And the data may not be lowercase in the table, so you can say where lower equals this, but the problem is it's not going to use an index. You can see this is doing a sequential scan on the email addresses even if you have an index on it, but what you can do is create an expression index so it creates this index on email addresses, the lowering essentially email address and of course now this query will use this index. So those are a couple of different solutions to doing case insensitive searches in your database.
[00:13:44] The next post is peeking at query performance of the upcoming version eleven of PostgreSQL. Now this was released last week but postgres eleven has been out. So this blog post seems a little bit dated, but in here they're using the release candidate one, so just something to keep in mind. And they're using some test queries to test the performance of how has Eleven changed? And they talk about some of the hardware and there's server config that they use to do the tests and they show an example of a sum query, a join query, a cube query and an index query. And there are a few that are a little bit better with the version eleven release candidate but some of them have dramatic improvements like the cube in the index. So if you're looking for some performance differences in version eleven, this is potentially a blog post to check out.
[00:14:39] The next post is the biggest mistake postgres ever made. This is from the Craig Kirsten's blog and it goes into a little bit of the history of postgres and where it came from. Basically it gets its name from Ingress. So post it happened after the Ingress database so that's how Postgres came into being. But as he says quote in the early days postgres there was no SQL. Not no SQL, but there was not SQL so it didn't support it. But in 1995 they added SQL support and with it they decided to change the name to PostgreSQL. So that single thing is essentially the biggest mistake Postgres ever made because now the official way to say it, and even I've been corrected on it, was is PostgreSQL. Basically in retrospect it seems like we should have just named it Postgres. But anyway, this is an interesting post, talking a little bit about the history and the biggest mistake Postgres has ever made.
[00:15:37] The next post is introducing PostgreSQL anonymizer. So this is a very early extension that someone has put together, but it's something that may be interesting. I'm not sure about the implementation, but what it does. You can put a mask for certain roles to be able to obfuscate data. So in other words, you keep the same data in the database, but when you're going to retrieve it, you can actually replace a function that replaces what particular users will see in the database. So for example, if someone with DBA rights were to query the database, they would get the exact data. So in this case they would see the name, they would see the phone number, but for other users that has a masked role. You could define functions so that it will scramble the data returned so they can't actually see what the actual values are. So again, this is super, super early. But if you're potentially interested in this for achieving GDPR compliance or just need to anonymize data, perhaps this is a project you want to check out.
[00:16:47] The last post is TimescaleDB 10, is production ready. So essentially Timescale DB 10 has been released and so this is a general announcement discussing it and what have been the changes added since release candidate one. So if you're interested in a time series database that is essentially an extension of PostgreSQL, definitely a blog post to check out.
[00:17:11] That does it. For this episode of Scaling Postgres, you can get links to all the content mentioned in the show notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode. Or you could subscribe via YouTube or itunes. Thanks our.