Build A GPT In SQL | Scaling Postgres 299

Episode 299 January 21, 2024 00:12:21
Build A GPT In SQL | Scaling Postgres 299
Scaling Postgres
Build A GPT In SQL | Scaling Postgres 299

Jan 21 2024 | 00:12:21

/

Hosted By

Creston Jamison

Show Notes

In this episode of Scaling Postgres, we discuss how you can build a GPT in 500 lines of SQL code, how to optimize extension loading, the best way to set passwords and being able to change generated columns.

To get the show notes as well as get notified of new episodes, visit: 

https://www.scalingpostgres.com/episodes/299-build-a-gpt-in-sql/

 

View Full Transcript

Episode Transcript

[00:00:00] You know, frequently I cover blog posts on scaling postgres that cover using SQL for this type of programming task or this other type of task. Think of the advent of code series where someone from crunchy data was doing all the tasks using SQL. But we're going to be talking about a GPT that was written with 500 lines of SQL code. But before we get into that, I hope you, your friends, family and coworkers continue to do well. And our first piece of content is happy New Year. GPT in 500 lines of SQL. This is from explainextended.com and it's funny he said he asked Chat GPT, hey, can you implement a large language model in SQL? And Chat GPT says no, you can't do that. It's not suitable for implementing large language models, et cetera, et cetera. To which he wrote, it just proves that if you want something done right, you have to do it yourself. Or another way to say it in today's parlance is here, hold my beer. So even though this is not a performance based post, as much as AI is coming to the fore, I really felt this was the strongest post this week and wanted to highlight it on scaling postgres because there's a lot of information about AI and large language models that I haven't seen described in this particular way. He really goes from the bare bones and starts building up to what a GPT is, basically a generative pre trained transformer and understanding about how you build one. Because now he's going to build it in SQL. Like for example, here's just some code that has functions about hey, we're going to be doing these particular steps to build this and we're actually basically building a GPT-2 he says. But just to forewarn you, your brain may start to explode as you get deeper into it, as we get into the different function equations and matrices and maths and things of that nature. But I still endorse that this is a very good read to try and put more pieces together of how AI works. Now, the actual resulting source code is at the bottom here where you can expand the source and this is all the 500 lines of SQL code to get it working. And as an endorsement that this is a great post. Talking about this, four different commenters definitely gave him kudos for writing this. So if you're interested at all about learning more about AI or even some SQL, definitely encourage checking out this blog post. [00:02:32] Next piece of content optimizing PostgreSQL functions with preloaded libraries. This is from cyberduck hyphen postgresql.com. And he's talking about a situation where he's using some PostGIs code, and on the first invocation it takes about ten milliseconds, but on a subsequent run of it it took 0.6 milliseconds, so less than a millisecond. The reason being is because when a database connection is first established, this extension, PostGis is not available yet. It has to be loaded the first time it's used inside of a connection. Now of course this has implications when you do a restart, because these libraries aren't loaded yet, and all the connections have to be established to try and load up the different extensions as well. But you could even see some of these performance issues if you're not using a pooler like PG bouncer, for example. Every new connection initiation that first query is going to be the slow. So the way around it is to actually load libraries early, and there's actually a setting you can change. They say it's a fairly old feature in Postgres session preload library, so you can define the library that you want to preload before starting a session. Now, it's still a little bit slow at 2.8 milliseconds, but definitely faster than without using this. So something to consider if you're seeing performance problems potentially related to this next piece of content. Postgres password encryption without leaking credentials this is from launchbylunch.com. He's talking about different ways that you can set a password for a user in postgres. And if you just create a user, or alter a user and set the password to a string, you're basically sending the password in the clear over the network if you're connecting across the network to the database. So he looks at some solutions to that. Now, first off is postgres encryption. So he says up until version ten, there was only MD five options available, and by default when you create a password, it will go ahead and encrypt it, if that's the default encryption method. Or for postgres versions after ten you could use Scramshaw 256. And basically everyone should be using the Scramshaw 256 as opposed to MD five now. So basically if you create a user, set the password to this literal string, the database will take it and encrypt it to whatever the default encryption is, MD five Scramshaw 256. But you can also send passwords that are already encrypted. So you can send this MD five hash as the literal string, and it will store that in the password database of postgres, or you can even send a Scramshaw 256 version as a literal string and it will just store that in there to be able to do the password comparisons. So basically how you would do this is you would need to encrypt it before creating the user or altering the user and changing the password. Now you could use external tools to do this, but they're also a part of libraries like LibPQ or the Java library has methods to do it as well. He also mentions a node library. So by pre hashing the password you get to avoid sending a password in cleartext over the network or even potentially log somewhere. So this is the ideal way to set passwords in postgres. But check out this blog post if you want to learn more. [00:05:51] Next piece of content waiting for PostgresQl 17 Alter table command to change generation expression this is from depeche.com and he's talking about generated columns and that apparently in version 17 you can now update them so you can alter the table to change what expression it uses. So in this case he created the column by extracting the year from a timestamp, but then he altered the table and set the expression as this new one, basically making a two digit year, and it worked just fine. The disadvantage of this, it has to do a whole table rewrite, which is definitely not ideal, but at least you can do it now as of 17 the thing he really didn't like about this is now suddenly when you alter a table you have set expression as expression, but then you also have set default expression. So it just starts to get confusing the different terms here in the help file, but a good addition to have in upcoming postgres 17. [00:06:52] Next piece of content Postgres postmaster file explained this is from crunchydata.com and the postmaster pid file gets created whenever postgres is started in the data directory, and it gives some important information that he lists out here. So the first line is the PID or the process id of the parent postgres process. So it's just an integer value here that corresponds to process running on the system. The second line is the data directory for postgres. The third is the epoch value of the postgres start time. The fourth line is the port postgres is listening on. The fifth line is the unixocket directory. The 6th line is the TCP IP interfaces being listened on. So for example, asterisks just means all of them. Line seven is the shared memory key and id, and lastly, the current status, which is ready, and the rest of the blog post just goes into detail about some of those different pieces of information. So if you want to learn more, definitely check out this blog post. [00:07:54] There was another episode of Postgres FM last week. This one was on PG Badger, and in this episode Michael was actually joined by Alicia, who is the program manager for the Azure database for Postgres Gorilla Microsoft, and they talked all about PG Badger, which is, I guess primarily a log analyzer, although it does a lot more than that. But it's primarily a tool to help you analyze your log files. So you can point it to files and it helps you analyze them. Hate to say it, I haven't really used PG Badger yet because I just tend to analyze the logs on the system themselves. Rarely do people actually send me logs, but if you're running your own systems, maybe you would like to check out this tool and listen to the episode here. Or check out the YouTube video down here. [00:08:41] Next piece of content understanding PostgreSQL, aggregation and hyper functions design this is from timescale.com and the first part of this post is definitely postgres related. The second part is more relevant for timescale, but still, I really enjoyed how it explained things in this blog post. So he talks about aggregates versus functions. Basically functions transform an input to an output, generally operate one row at a time. Aggregates does an operation across many rows, and he has a lot of great visual aids explaining how the process of aggregation works, and even some animations here as you can see how he's explaining how an average work basically add up the total value of the column you're wanting to average and then the count of those rows to give you the average. Then he also talks about how this works in parallel as well. Basically have two different processes go through the averaging calculations and then combine them as the final function to give the result. Now from the timescale perspective, he talks about them using a two step aggregation and their hyper function. So basically when you do an average, they basically have a transition function and then the outer call does the calculation. So I thought this blog post was pretty interesting explaining how postgres operationally does what it does, and also covering timescale at the end here as well. [00:10:08] Next piece of content introducing pg query for postgres 16 parsing SQL, JSOn, Windows support, PlpGSQL, parse mode, and more. This is from pganalyze.com. They released another version of their open source library called Pgquery where it basically takes SQL queries and turns them into a syntax tree. And they actually are using some of the postgres source code as part of this library to be able to extract this information. [00:10:38] And I didn't know it, but this has apparently been around for ten years, so that's quite a long time. So this of course talks about the library and all the enhancements added to the most recent version. So check this blog post out if you want to learn more. [00:10:51] Next piece of content how to speed up count distinct this is from eversql.com. He talks about many different ways to speed up count in general, but also count distinct. So this is quite a long blog post, but they were saying if you can handle some inaccuracies, meaning you're okay with an estimate, you could do things like hyperlog log if you want something more accurate. If you could use a covering index for the particular column that you're wanting to run distinct on, that can give you some better performance as well as some additional techniques. So check out this blog post if you want to learn more. The last piece of content is introducing a new community wiki page operations cheat sheet. This is from Postgres blogspot.com and he includes a link to the postgres wiki where there's now an operations cheat sheet. So this gives a whole lot of information on how to manage postgres from an operational perspective. So definitely check out this resource if you want to learn more. [00:11:50] I hope you enjoyed this episode. Be sure to check out scalingposgres.com where you can find links for all the content mentioned in the show, as well as the podcast version and a full transcript. Also, while you're there, I welcome you to sign up for the email so you get notifications of whenever new shows come out. Apart from that, thank you and I will see you next week with our 300th episode. [00:12:15] You.

Other Episodes

Episode 218

June 05, 2022 00:17:03
Episode Cover

PG14 Index Bug, View Permissions, Logical Replication Conflicts, AlloyDB Under the Hood | Scaling Postgres 218

In this episode of Scaling Postgres, we discuss a severe bug with index creation in Postgres 14, view permissions options, dealing with logical replication...

Listen

Episode 210

April 10, 2022 00:09:52
Episode Cover

Date Time Bins, Search Every Field, JSON_TABLE, Picking Primary | Scaling Postgres 210

In this episode of Scaling Postgres, we discuss how to bin date times, search every field of a table, use JSON_TABLE and methods for...

Listen

Episode 33

October 08, 2018 00:09:12
Episode Cover

Per-Table Autovacuum, FDW Synchronization, Distinct On | Scaling Postgres 33

In this episode of Scaling Postgres, we review articles covering per-table autovacuum, FDW for data synchronization, distinct on and Postgres Open. To get the...

Listen