WEBVTT

00:00.000 --> 00:27.960
Okay, quite down everyone. The next talk is about to begin, quite, quite down. Okay, good, good, good.

00:27.960 --> 00:34.960
So it is a great pleasure for me to introduce our next speaker. Ruben came to this Devroom last year and

00:34.960 --> 00:41.200
did an absolutely amazing presentation on Vulcan and the use of Vulcan and Lama CPP and

00:41.200 --> 00:45.800
ever since I've been bugging him to keep presenting on that lovely subject and so he agreed

00:45.800 --> 00:50.360
and so he is taken away Ruben. Thank you.

00:50.360 --> 01:00.360
All right, so my name is Wilmotlam. I'm now a very fresh machine learning engineer

01:00.360 --> 01:07.360
from Reddit but this work has, like, my work on Vulcan has mostly happened in my free time,

01:07.360 --> 01:13.360
actually. It's a great match of that. But I want to, like, briefly introduce what's

01:13.360 --> 01:18.840
the Vulcan API. Why should you even care about it? Why is it relevant? Just briefly about

01:18.840 --> 01:25.840
Lama CPP as well, like, then what have we done since last year? There was a ton of work

01:25.840 --> 01:33.840
on the back and on the Lama CPP back and there's a lot that has changed since then.

01:33.840 --> 01:40.840
Some benchmarks actually, that show like the part is how does it actually, like, compare to the

01:40.840 --> 01:47.840
usual suspects for running large language models on GPUs. What are the difficulties that

01:47.840 --> 01:53.840
are struggling with and that you will also struggle with if you try to use working for some

01:53.840 --> 02:00.840
of this and the conclusion about, is it worth using, is it worth putting in the time to use

02:00.840 --> 02:07.840
Vulcan here? So, basically, what is Vulcan? Isn't that a gaming API? So, yes, it's an API

02:07.840 --> 02:19.840
graphics. You can actually, like, it's a successor to OpenGA and the idea there was to get

02:19.840 --> 02:24.840
rid of some of the inefficiencies in OpenGA by making it a lot more abstract. And what they

02:24.840 --> 02:33.840
ended up with is basically, like, a generic interface to GPUs. And so, you can actually use the same

02:33.840 --> 02:43.840
kind of API codes and the same kind of shader or, quote unquote, kind of code to run on all

02:43.840 --> 02:52.840
kinds of GPUs, not just the Nvidia, the usual Nvidia graphics start. And so, my interest

02:52.840 --> 02:58.840
here is mostly about, like, I don't have a huge, like, $200,000 Nvidia servers somewhere.

02:58.840 --> 03:05.840
I don't have data centers. So, I just have, like, some PC somewhere with an old graphics card.

03:05.840 --> 03:12.840
How do I make that run in a large language model so that I can run, so that I can actually

03:12.840 --> 03:20.840
use it for something that I don't want to share with a cloud. So, Vulcan can actually do that.

03:20.840 --> 03:25.840
You don't actually have to use the graphics part of it. You can just from computer

03:25.840 --> 03:30.840
class as a replacement for kernel, it's basically the same thing. And that way, you can actually

03:30.840 --> 03:37.840
use it for machine learning. And what I did was I edited to learn a CPU, like, over two years ago

03:37.840 --> 03:43.840
now, and it has grown a lot since, um, learn a CPU probably, you should be familiar with

03:43.840 --> 03:49.840
with, with also talks already about it. So, I'm not going to go to deep into this, but basically

03:49.840 --> 03:54.840
the idea was that whatever hardware you're playing around somewhere, you should be able

03:54.840 --> 04:00.840
to run an element. And I'm on it. Of course, what kind of, I'm, you can run, so it depends

04:00.840 --> 04:06.840
on, like, how much memory you do you have, and how, like, patient are you, with waiting for

04:06.840 --> 04:13.840
your responses. So, you're a long as you be based on a, like, static graph structure.

04:13.840 --> 04:20.840
It's, uh, not a different from, like, other approaches we've seen. But the cool thing

04:20.840 --> 04:25.840
and that has also grown, like, I think since last years, that, uh, it's basically

04:25.840 --> 04:31.840
abstracted away all of the back end stuff into, uh, something that you can execute on various

04:31.840 --> 04:37.840
patterns. So, you saw the graph, the compute graph that contains all of the different operations,

04:38.840 --> 04:43.840
gets sent to a back end, and it can even be split up and sent to multiple back ends. So,

04:43.840 --> 04:49.840
there's a lot of, like, interesting stuff you can do here. So, there's a lot of, like,

04:49.840 --> 04:54.840
back ends that currently exist, like, the most relevant ones, of course, like CPU,

04:54.840 --> 05:02.840
CUDA metal, and Vulkan. The rock M1 for NDS, basically, on top of the CUDA back ends,

05:02.840 --> 05:08.840
the pre-uses, most of the code from that. And then there's also an open-sear based

05:08.840 --> 05:13.840
one that's, I think, that in that mobile phones, there's a can for, I think, who are

05:13.840 --> 05:19.840
away accelerators. There's Web GPU that's, I'm not sure how usable that is yet, which is also

05:19.840 --> 05:25.840
interesting. And some, like, Blas, as NDS, which is just trying to make the, trying to

05:25.840 --> 05:32.840
offer large matrix modifications to libraries that are more optimized on the CPU and that,

05:32.840 --> 05:38.840
in that example. So, what actually happens since last year, one of, I think, the major,

05:38.840 --> 05:46.840
like, the most important thing we've done is, like, is flash attention. So, probably,

05:46.840 --> 05:52.840
if you've ever looked into, like, attention and the way it's used in large language models,

05:52.840 --> 05:57.840
you've also come across flash attention. It's, like, the paper was hugely influential.

05:57.840 --> 06:03.840
There's multiple versions of it now, like, the usual way to edit, I think, in, in, in,

06:03.840 --> 06:08.840
PyTorch based projects is to actually use the, like, the, the code from, from specific,

06:08.840 --> 06:14.840
GitHub repos. So, we have to custom shader there that was, actually, not, I didn't write

06:14.840 --> 06:21.840
it, that was, just was from a video. There, in last year, I think it, like, the one version of it,

06:21.840 --> 06:27.840
like, that was Nvidia specific already existed. That's the Cooperative Matrix 2 variant.

06:27.840 --> 06:32.840
That's some Cooperative matrices, this boycins abstraction for tensor course.

06:32.840 --> 06:40.840
So, or, or, like, any kind of, like, matrix acceleration. And so, since last year,

06:40.840 --> 06:46.840
we've, like, we've also worked on the, like, Cooperative Matrix 1, which is the

06:46.840 --> 06:52.840
Kronos version, which is not specific to Nvidia. So, that is the one that, for example,

06:52.840 --> 06:56.840
run on modern indie hardware. And, of course, there's also scalar version.

06:56.840 --> 07:02.840
If the GPU doesn't have any kind of hardware acceleration for matrix multiplications,

07:02.840 --> 07:09.840
even, we can still run flash attention. It will give you, it will give you, actually,

07:09.840 --> 07:15.840
a huge increase in performance with, with context that has grown very large.

07:15.840 --> 07:23.840
That's become incredibly important in modern logic initiatives, because the, because the context

07:23.840 --> 07:30.840
that these models can support is extremely big. And so, with, with this, you fuse a lot of,

07:30.840 --> 07:35.840
a lot of operations into one. And so, you do not need huge intermediate buffers.

07:35.840 --> 07:42.840
And you get, and you can run it in one, like, in one kernel core instead of a bunch of operations.

07:42.840 --> 07:50.840
So, with something, like, 128K context, you will see a huge difference from using this.

07:50.840 --> 07:58.840
And so, implementing this was very important. And also, like, making it available to more hardware,

07:58.840 --> 08:06.840
was also a huge step. And, is one of the things that made, that made us, made a huge difference

08:06.840 --> 08:12.840
for performance in, in the Vulcan backend. There's still a lot to do there. Like, I've just,

08:12.840 --> 08:18.840
just over the last few weeks, I've spent time, like, optimizing the AMD, the version running

08:18.840 --> 08:22.840
on modern AMD hardware. There was, like, I got a ton of performance, all of that.

08:22.840 --> 08:26.840
I had some crazy reports of people getting, like, four times faster inference from it.

08:26.840 --> 08:32.840
So, that's, and the same thing, there's probably still a lot of, like, optimization work

08:32.840 --> 08:38.840
that can be done. So, yeah, for anyone else wants to take a look at it.

08:38.840 --> 08:44.840
I would be happy not to have to do all of it. So, yeah.

08:44.840 --> 08:52.840
Another thing that, that's actually, like, one thing, I wanted, like, I worked on, over the last, like, maybe,

08:52.840 --> 08:58.840
half a year ago or so, is, like, using the PFI, or in eight accelerations that's hard

08:58.840 --> 09:10.840
to feature, where you have, like, a dot product of, like, a four, like, packed in eights,

09:10.840 --> 09:16.840
and one, you packed them into one, in 32, you multiply each one, you add that to another

09:16.840 --> 09:22.840
integer, and you get a result, and all of that happens in a single clock cycle.

09:22.840 --> 09:26.840
And that's, that's something that's available on some of the GPUs that are not,

09:26.840 --> 09:32.840
that do not have the hardware to use, like, something that tends, of course, to accelerate

09:32.840 --> 09:38.840
matrix modifications. And so, this is actually something very interesting for, for

09:38.840 --> 09:46.840
me, because we mostly focus on quantized models, and the quantization schemes that we're using,

09:46.840 --> 09:54.840
they make it, they allow you to do a lot of the operations within, like, in eight,

09:54.840 --> 10:00.840
like, multiplications and additions, so you can actually use this, and also get a lot of,

10:00.840 --> 10:04.840
like, a big performance increase. We, the hardware that this most affects is, for example,

10:04.840 --> 10:10.840
Nvidia Pascal, which was the last generation with all tens of course, which does have this

10:10.840 --> 10:16.840
dot product support, and we also have, like, an AMD, the Vega 20 is very interesting,

10:16.840 --> 10:20.840
that was one of the, like, secret, you want to do GPU, you want to do, like, GPU,

10:20.840 --> 10:26.840
acceleration, like, large, for a large range, multi-tone, what are cool accelerators, you

10:26.840 --> 10:32.840
can import from China from data centers, something that was one of those, was the MI50,

10:32.840 --> 10:36.840
that's very, got 20, so that's one of the things that profits a lot from this.

10:36.840 --> 10:42.840
And the inter GPUs also, this is also, like, the fastest, you can,

10:42.840 --> 10:46.840
other, the most, like, relevant acceleration feature on, in the economist.

10:46.840 --> 10:54.840
And so, from writing, like, I just had to write this code, like, I had to add this to the code

10:54.840 --> 10:58.840
once, and it runs on all of these GPUs, and also helped in some other cases,

10:58.840 --> 11:02.840
with, and so, of course, when usable for other reasons.

11:02.840 --> 11:08.840
So, one more thing that's very interesting that, primarily, just both of us, has been working

11:08.840 --> 11:12.840
on this operator fusion. So, in large range, most of you have to turn off, like,

11:12.840 --> 11:16.840
I mean, this, this example is accepted, exaggerated, I just made up some,

11:16.840 --> 11:22.840
times, but basically, for each operation, in the models, you often have the pattern that

11:22.840 --> 11:27.840
you have a big operation, and then a few follow-up, like, small transformations on the same,

11:27.840 --> 11:31.840
on the same, on the same data, on the result of the big operation.

11:31.840 --> 11:37.840
If you, if you do that in the, like, in a regular way, then you get, you get some kind of

11:37.840 --> 11:41.840
dispatch over it, you have to load the memory, you have to do the calculation, you have to

11:41.840 --> 11:45.840
start, and then you have to load it again, and do another transformation on the same

11:45.840 --> 11:50.840
data. And so, if you put all of that into the big operation, basically, you can

11:50.840 --> 11:55.840
save a lot of time by not storing the intermediate result, and by

11:55.840 --> 12:03.840
dispatching, if you are kind of, or computers. So, that's one optimization.

12:03.840 --> 12:09.840
That's quite useful, but it's also very specific, like, we don't have a generic way

12:09.840 --> 12:13.840
of doing this, so you can just, if you add a new model, architecture, and it's

12:13.840 --> 12:19.840
working, it works differently, you won't immediately do this because the, because

12:19.840 --> 12:23.840
the operations don't fit what's already implemented in this sense.

12:23.840 --> 12:29.840
So, the existing fusions won't apply to the new model, and so someone has to go and

12:29.840 --> 12:35.840
look at the new model, figure out where it's potential for fusion, and then

12:35.840 --> 12:39.840
actually implemented. So, I thought some cool ideas about how that could be done in

12:39.840 --> 12:43.840
a more dynamic way, but that's one of the areas that we're still, like, that

12:43.840 --> 12:49.840
would be interesting to look at, but someone has to find a time, of course.

12:49.840 --> 12:55.840
There's much more that has happened, so we also got B416 support, which

12:55.840 --> 12:59.840
wasn't originally in support and working, but got in through extensions.

12:59.840 --> 13:05.840
There was a lot of work on reducing CPU overhead, so in, in the beginning,

13:05.840 --> 13:11.840
we had, like, last year, even still, we had, like, kind of try run.

13:11.840 --> 13:15.840
That way, you had to go through the whole model and figure out how much

13:15.840 --> 13:19.840
memory we actually need in the temporary compute purpose,

13:19.840 --> 13:23.840
allocate that, and then run, and then go through the whole graph again,

13:23.840 --> 13:27.840
and actually run, then actually run the, the computer

13:27.840 --> 13:31.840
uh, kind of. And then, uh,

13:31.840 --> 13:35.840
yeah, we found a way to reduce that by,

13:35.840 --> 13:41.840
by, uh, by doing all of these steps, like, on demand, so you,

13:41.840 --> 13:45.840
actually basically just wait until all operations are done,

13:45.840 --> 13:47.840
and then, then, figure out, like,

13:47.840 --> 13:51.840
resize this purpose, something, and then, uh, continues.

13:51.840 --> 13:55.840
So there, there was some crazy work on, like, fences, like,

13:55.840 --> 13:59.840
about, like, basically fences, something you wait for, so you wait for an operation,

13:59.840 --> 14:03.840
on the GPU to finish, and, uh, someone figured out that,

14:03.840 --> 14:07.840
if you, actually, just wait until the whole, like, graph has been computed,

14:07.840 --> 14:09.840
that takes out some of the, like,

14:09.840 --> 14:13.840
that, uh, that, that for some, uh, the CPU has to wake up again

14:13.840 --> 14:17.840
after sleeping for so long, that it actually takes quite a bit longer,

14:17.840 --> 14:21.840
and has to be, and so, that was solved by adding your fence somewhere

14:21.840 --> 14:23.840
early on the graph, and then busy idling,

14:23.840 --> 14:25.840
naturally, and so the CPU is actually sleeping at that point.

14:25.840 --> 14:29.840
Uh, there were also some stable diffusion operators, which isn't, like,

14:29.840 --> 14:31.840
relevant for large language, uh,

14:31.840 --> 14:33.840
moderate, but there's also, like,

14:33.840 --> 14:35.840
it's very cool to be able to run, uh, like,

14:35.840 --> 14:37.840
very fusion and Vulcan, and, uh,

14:37.840 --> 14:41.840
huge, like, huge number of all that stuff that happened here.

14:41.840 --> 14:43.840
So, uh, I want to show some benchmarks here.

14:43.840 --> 14:47.840
So, uh, on Nvidia, actually, like, uh,

14:47.840 --> 14:51.840
basically what I've done is just run the same kind of, like,

14:51.840 --> 14:53.840
the, the llama bench tool that's in the repo,

14:53.840 --> 14:57.840
and, uh, in this case, on my 3090, and I've run it with a

14:57.840 --> 15:01.840
good aback and, uh, in the, in the, in the, in the

15:01.840 --> 15:05.840
than the same thing with Vulcan, on the Y-axis,

15:05.840 --> 15:09.840
it's, uh, how fast the result was on the X-axis.

15:09.840 --> 15:13.840
It's how, uh, how much, uh, context was in the KV cache.

15:13.840 --> 15:17.840
That's actually, exactly where Flash attention becomes extremely important.

15:17.840 --> 15:21.840
And here you can see that, uh, basically, the Vulcan backend is slower,

15:21.840 --> 15:23.840
and the CUDA backend, but it's not by much.

15:23.840 --> 15:27.840
And, actually, other context, uh, it, uh,

15:27.840 --> 15:31.840
it stays, uh, approximately within that area.

15:31.840 --> 15:35.840
And, uh, so the performance is actually competitive, although,

15:35.840 --> 15:39.840
so, uh, there, so you might ask, like, why, why would I use Vulcan if I

15:39.840 --> 15:43.840
can also just use CUDA, but there are some, some cases

15:43.840 --> 15:45.840
where that would actually make things a lot easier.

15:45.840 --> 15:49.840
Uh, for example, if you have, if you already have a game or something

15:49.840 --> 15:53.840
that you want to integrate AI into, and so, uh, you could now,

15:53.840 --> 15:57.840
like, add, add CUDA into that, but that would be a huge hassle.

15:57.840 --> 16:01.840
Or you just use Vulcan, which you've already using for graphics,

16:01.840 --> 16:05.840
and you can get pretty competitive performance out of that as well.

16:05.840 --> 16:09.840
So this is prompt processing, so prefer that's, also,

16:09.840 --> 16:13.840
where the TensorFlow's could use, um, and, uh,

16:13.840 --> 16:15.840
uh, the same, like, important generation.

16:15.840 --> 16:19.840
There's also some differences, like, the, in GPTOS, as we're still, like,

16:19.840 --> 16:22.840
this optimization left to be done with the lagging behind CUDA there,

16:22.840 --> 16:26.840
but actually there was, like, here on the, on the DeepC2, which is,

16:26.840 --> 16:30.840
that's actually the, uh, DeepC2's, the architect, that's actually the,

16:30.840 --> 16:32.840
and 4.7 flash model of the reason one.

16:32.840 --> 16:36.840
And so for some reason, uh, we actually faster currently in Torpen's generation

16:36.840 --> 16:40.840
on Vulcan, that I could have, which is quite interesting.

16:40.840 --> 16:44.840
Um, so, but more interesting for me, personally, is the, uh,

16:44.840 --> 16:48.840
in the IX-86 years, which is the Strix Halo GPU.

16:48.840 --> 16:51.840
Uh, you can get that with 128 gigabytes of available VRAM,

16:51.840 --> 16:56.840
which makes it very interesting for, um, like, a mixture of export models.

16:56.840 --> 17:01.840
And so here, uh, you, uh, you can see that, uh, on the, on the old Lama,

17:01.840 --> 17:04.840
it be models, it's actually slightly slower than from processing,

17:04.840 --> 17:09.840
but in both the huge GPTOS S120B and the, uh, new, like,

17:09.840 --> 17:15.840
year and 4.7, it's, it outperforms the rock and back and, uh,

17:15.840 --> 17:17.840
in prom processing.

17:17.840 --> 17:22.840
And, uh, talking generation, same thing, even in, in the GPTOS S-like case,

17:22.840 --> 17:24.840
it's actually a big difference there.

17:24.840 --> 17:29.840
It's, uh, it's the Vulcan back and is currently quite a bit faster here.

17:29.840 --> 17:34.840
Uh, on the, yeah, on the, like, year and 4.7 slightly behind, um,

17:34.840 --> 17:39.840
after at longer context, but, uh, faster, lower context, so.

17:39.840 --> 17:41.840
That's work left to be done there.

17:41.840 --> 17:44.840
Uh, there are also cases where there's still a lot of work to be done.

17:44.840 --> 17:47.840
So, uh, there's this, exactly, the very, about 20 thing.

17:47.840 --> 17:51.840
So, uh, there, you can see that the scalar flight retention is actually,

17:51.840 --> 17:54.840
the implementation is actually not optimized for the start where it doesn't run,

17:54.840 --> 17:55.840
where it's, so, really.

17:55.840 --> 18:00.840
So, why we can, can be faster, for example, in the GPTOS S-case.

18:00.840 --> 18:04.840
Uh, at zero context, it drops much faster, so it larger context,

18:04.840 --> 18:06.840
you get much less performance here.

18:06.840 --> 18:11.840
Um, and talking generation, uh, similar, so it drops to faster,

18:11.840 --> 18:12.840
so optimization to be done.

18:12.840 --> 18:13.840
Someone wants to look into it.

18:13.840 --> 18:14.840
I'm happy to arrive.

18:14.840 --> 18:17.840
Otherwise, I'm going to have to do that, I guess.

18:17.840 --> 18:22.840
Um, so actually, uh, another example here is Intel.

18:22.840 --> 18:25.840
And, uh, I wanted to show this because it lighted.

18:25.840 --> 18:28.840
It's an example for driver issues that I'm still running into.

18:28.840 --> 18:32.840
Uh, so you can see that the, that the, like, the resides,

18:32.840 --> 18:36.840
I got done at much sense, like there's something that actually got faster.

18:37.840 --> 18:40.840
At larger context, which isn't great.

18:40.840 --> 18:44.840
And so, uh, basically, this is an example of a driver issue.

18:44.840 --> 18:50.840
Like, the Intel Linux driver is still like, uh, the Vulcan driver is actually not optimized for this kind of thing.

18:50.840 --> 18:52.840
And there's a lot to it.

18:52.840 --> 18:55.840
I'm, I'm having a lot of issues optimizing for it.

18:55.840 --> 18:57.840
And so that leads to examples like this.

18:57.840 --> 19:01.840
Um, I've, I've had issues with all drivers at this point.

19:01.840 --> 19:04.840
Like, I've had, I think I've found bugs in all of them.

19:04.840 --> 19:09.840
So, uh, that's, uh, yeah, this is, uh, one of the things I'm dealing with here.

19:09.840 --> 19:16.840
Uh, the other thing that I'm, uh, still, like, yeah, how do I actually optimize a computer,

19:16.840 --> 19:21.840
and one of the issues I have is for, while Nvidia does provide, um,

19:21.840 --> 19:25.840
a way like with inside graphics to, to get some inside there.

19:25.840 --> 19:28.840
Of, for, for, for AMD, I don't have anything like that.

19:28.840 --> 19:33.840
And for Intel, I don't either, so it's actually a lot of guesswork to optimize a shader here.

19:34.840 --> 19:41.840
You can apply the same techniques as for CUDA, um, but, uh, AMD doesn't behave the same way as Nvidia.

19:41.840 --> 19:46.840
And it's much different, so it's actually, yeah, that's, uh,

19:46.840 --> 19:54.840
I, I had to do like a lot of guessing a lot of trial and error to figure out what is actually fast on what hardware.

19:54.840 --> 20:00.840
So, yeah, in, in conclusion, for working as very interesting, you can, you can get a lot of performance

20:00.840 --> 20:06.840
out of it as you've seen. Actually, you can beat some of you proprietary APIs if you put in another work.

20:06.840 --> 20:11.840
But, uh, the development side is actually much harder than something like CUDA,

20:11.840 --> 20:17.840
because you have to do a lot of the work on the whole side, uh, a lot more work, a lot of boilerplate,

20:17.840 --> 20:21.840
to, uh, to actually get any error with working.

20:21.840 --> 20:27.840
The tooling is limited in comparison, uh, as a lot, like I'm hoping that can be improved in the future.

20:27.840 --> 20:35.840
Uh, it's, uh, like, uh, out there. Um, the hardware compatibility is much much broader than any other, like, uh,

20:35.840 --> 20:40.840
any of the other usual APIs, so that's the big advantage. The binary size is something that's often forgotten.

20:40.840 --> 20:46.840
You actually get much, much smaller binaries, like, uh, if you, if you download PyTorch for CUDA, you get more to

20:46.840 --> 20:51.840
pre-gegar bytes of device code. If you, in theory, if you do the same thing,

20:51.840 --> 20:58.840
if you're working, you would have something very small because the code gets compiled to the device specific code on demand,

20:58.840 --> 21:04.840
during, uh, during the run. And the performance of all you can can actually be very good.

21:04.840 --> 21:11.840
But, uh, there's always something you cannot do with working, uh, so there's the potential slightly lower,

21:11.840 --> 21:15.840
but as you've seen, like, you can get actually pretty close.

21:16.840 --> 21:22.840
Uh, so yeah, so I hope, uh, I've got some interest sitting using working, helping out on the back end,

21:22.840 --> 21:27.840
or maybe integrating it somewhere else. I hope this, uh, that in future, we can, like, uh,

21:27.840 --> 21:34.840
use it more often and get to somewhere where we are not as, uh, as limited to one single,

21:34.840 --> 21:41.840
vendor or one single, like way of, uh, of writing, uh, or where we have to actually write,

21:41.840 --> 21:46.840
completing new kernels to use a different, uh, different GPU, and not use different hardware.

21:46.840 --> 21:48.840
So yeah, thank you.

21:48.840 --> 21:51.840
Thank you.