Episode Transcript
[00:00:00] In this episode of Scaling Postgres, we talk about the future, slow counts, avoiding or in replication. I'm creston. Jameson. And this is scaling postgres episode twelve.
[00:00:20] Alright, the first article we have is it's the future for databases. And this is from the Citusdata.com blog. And this is presumably a fictional story of someone who's a data architect in San Francisco and wanting to talk with someone who is presumably up to date with the latest technologies. And basically the data architect wanted to just use postgres, but then the person was trying to convince them to use other, what they're calling neo Martin data architectures. So this is a very humorous post to go through and just see. So trust me, you'll want to read it. It's pretty humorous. So go ahead and check it out.
[00:01:10] The next post is actually a YouTube video that was posted from PyCon, Canada and it's postgres MVCC. And you or Why Count everything is Slow by David now this was presented in November of 2017, but it was just posted recently and it goes over the issues why count is slow, because this is one of the many questions that gets asked when things start to slow down. Why is counting all the rows in a table slow? And it's basically because of MVCC, essentially row visibility. Now, the presentation takes a while to get up to speed, so if you have some knowledge of postgres, I would definitely suggest checking it at around the twelve minute and 32nd mark because that's where he starts going to slow counting. He does a little bit of live coding where he shows essentially row visibility if you're performing a transaction, how within the transaction you see one version of the data whereas everyone else sees the old version. And then once you commit it, you can then see the changed versions. And he talks about some of the ramifications for using MVCC and why essentially you can get some slow counts. And then he talks about the ramifications of okay, now why do you have auto vacuum and why is that important and why do you have transaction IDs and why can they potentially run out? So it starts off explaining why count is slow, but also gets into more some of the deeper architectural details of PostgreSQL. Now he is a developer, so this is from a developer's perspective, but it gives a really good overview of how Postgres is designed way it is. If you're already familiar with Postgres and the internals, this might be a little bit too basic for you, but if you're a developer wanting to get a bit of a sense on why Postgres is designed the way it is, this is a great overview. So I definitely suggest you check it out.
[00:03:21] The next article is avoiding. Or for better query performance, and this is from the Cybertechql.com blog and it basically shows you areas in your queries if you're using or that can cause some poor performance problems.
[00:03:38] So he goes over an example setting up two tables with some primary keys, some foreign keys and even some indexes. And he basically says the good or are things where it's not in the where statement. Essentially it's in a case expression or in the select list and he shows an example of some bad ors. Now sometimes you just can't help it, but so in this example there's no good alternative than selecting this way. Now you can just select based upon the ID given in this text, field value underscore 42 is there because you know, 42. So you can just select by the ID. But basically minimizing your use of or in the where statement will speed up query performance. And probably the most or the best advice is that N is better than or and he shows it explicitly here. When you're doing this ID or this ID, it does a bitmap index scan from each of the conditions and then bitmaps or them, whereas you can just do a single index scan when you're using N. Then he goes into another example of where if you were using the like operator, something like this cannot use an index. So for example, he's getting a sequential scan when using the like operator with two different texts. However, you can use a PG trigram module and a gen index to be able to use an index for these particular types of queries. And then lastly, he follows up with the ugly or where you're oring between two different tables and he shows one alternative to that is using Union and that generates a more efficient plan for pulling your data. So definitely a blog post to review to help make your queries faster.
[00:05:28] Next blog post is PostgreSQL data types point. So this is a continuation of the PostgreSQL data types series that this blog has been producing. This is Ta Poueh.org blog. Now, point of course is used for geospatial data. So it's essentially latitude longitude that you can store in a single data type called point. So this is quite a long blog post and there's a lot that goes into it and I would say definitely a very niche data type used in specific cases and I haven't really seen it been used. Mostly I've seen people using other data types to store latitude and longitude. Not necessarily storing it in one data type field, but definitely if you are doing or working with geospatial related data, definitely a data type data type to check out to see if it could be beneficial for you.
[00:06:26] The next blog post is an overview of the serial pseudotype data type for PostgreSQL. This is from the several nines.com blog. So serial is considered a pseudo data type because you usually use it when you're creating a table and specifying serial. It actually creates an integer data type and it's a primary key that will auto increment using sequences. So this goes over some of the concepts of unique keys declaring a serial data type. So here they do it in the Create table statement and they tell you what it's actually creating. There is no serial data type in the resulting table. It becomes an integer field. But it does create a related sequence, defines ownership and relates the sequence to the integer that's been created. And it shows a little bit about how it works when you are inserting data, talking about missing serial values and even the sequence manipulation functions. So if you're curious about how serial works when setting up your tables, definitely a blog post to check out.
[00:07:34] The next blog post is actually some slides from a presentation PostgreSQL replication. And this is by Christophe Pettis at PostgreSQL Experts. Now, this is a very comprehensive presentation and it goes over all the different types of PostgreSQL replication, from log shipping wall files, to using streaming replication to using logical replication. And I was familiar with a great deal of it, but there was definitely a great refresher in and around slide 38 for me where he talks about the logical replication options and continuing on to talk about some things that you need to be aware of. So for example, sequence values are not replicated and he's discussing the core in PostgreSQL ten, the core logical replication. However, he is also discussing PG Logical, which is a separate tool that's available for versions less than ten. And he also talks about how Truncate does not get replicated for Core, but in PG Logical it replicates Truncate but not cascade Truncate. And just be aware, you can only replicate a real table to a real table. So no materialized views, views, foreign tables or partition root tables. So it will replicate the data in the partition tables but not the root table itself. And it cannot replicate temporary or unlocked tables. And particularly copy operations are broken into individual inserts. So individual statements are unrolled. So a single update changing 10,000 rows will be applied as 10,000 updates. And he talks about other issues such as if a primary with logical subscribers fails over to a secondary, the current logical replication state is not passed over to the secondary, so this can cause synchronization problems. But he did make a note that PostgreSQL eleven should address this. So, definitely a great overview of all sorts of different replication technologies available in PostgreSQL, both physical replication, streaming logical replication. So definitely a presentation I would suggest you review.
[00:09:44] The last blog post is PG Friday BDR around the globe. So this is an interesting post. It's about BDR, which is bi directional replication, or BDR stands for bi directional replication and it's a tool of Second Quadrant. So this is from the Second Quadrant.com blog and it was interesting because it discusses the use case on why you would particularly want to use this tool. So it does logical replication, multimaster replication, basic conflict resolution in terms of the last update wins. It handles distributed, locking global sequences and high latency replay. So they go over scenarios where you have, say, four different data centers around the world and how this tool can help to do that. So from the perspective of Scaling, if you are looking to scale geographically in this fashion and essentially have one common data set, this is definitely a tool to examine and see if this would be beneficial for you.
[00:10:48] That does it. For this episode of Scaling Postgres, you can get the links to all the content presented in the show notes. Be sure to head over to Scalingpostgres.com where you can sign up to receive weekly notifications of each episode, or you could subscribe via YouTube or itunes. Thanks.