WEBVTT

00:00.000 --> 00:10.000
We see that this workout has been running for nearly two weeks and I checked today and

00:10.000 --> 00:11.000
they are still running.

00:11.000 --> 00:13.000
So it has been more than three weeks.

00:13.000 --> 00:16.000
The GPU utilization has been so high.

00:16.000 --> 00:21.000
On the contrary, I have some examples of interactive workout here.

00:21.000 --> 00:29.000
These are two instances of Oama and we can see that the notebooks have pretty

00:29.000 --> 00:31.000
varying utilization here.

00:31.000 --> 00:38.000
For example, this notebook has zero average GPU utilization in past days so that

00:38.000 --> 00:40.000
is quite waste.

00:40.000 --> 00:45.000
And the same applies to Oama instances because they just reply to user

00:45.000 --> 00:49.000
questions at some times.

00:49.000 --> 00:55.000
And this is a closer comparison of two different notebooks of two different

00:55.000 --> 00:56.000
users.

00:56.000 --> 01:05.000
The first users notebook, we can see that the GPU and CPU utilization happen at the same time

01:05.000 --> 01:07.000
for some reason.

01:07.000 --> 01:10.000
And there are quite gaps between that.

01:10.000 --> 01:12.000
For example, this is one whole day.

01:12.000 --> 01:19.000
So the user has created the notebook but is not using it all the time and it is waste.

01:19.000 --> 01:27.000
So this user, however, we can see that even there are times when no resources are used.

01:27.000 --> 01:31.000
For example, here it's not very well visible, maybe.

01:31.000 --> 01:36.000
But this is GPU utilization, then the GPU is idle but CPU is used.

01:36.000 --> 01:41.000
Which means that the user is somehow interacting with the notebook.

01:41.000 --> 01:48.000
And this means that we cannot just decide based on GPU utilization to maybe

01:48.000 --> 01:52.000
kill the notebook or somehow remove this workload.

01:52.000 --> 01:56.000
Because first it's interactive, user might interact with it later.

01:56.000 --> 02:02.000
And second, GPU utilization itself is not the marker of whether it's used or not.

02:02.000 --> 02:08.000
However, we might want to tell user one that he should be using the jobs or something similar.

02:08.000 --> 02:12.000
I don't know why he's using the notebooks.

02:12.000 --> 02:19.000
So the obvious observation is that we experience low resource utilization.

02:19.000 --> 02:25.000
And I think that many customers and administrators know that there is low resource utilization.

02:25.000 --> 02:29.000
But especially with GPU utilization, it's painful because it's expensive.

02:29.000 --> 02:35.000
And there are some existing solutions such as time slicing or multi-instance GPUs.

02:35.000 --> 02:45.000
Which allow us to maybe use GPU more, but it is not possible to maybe kill the workload using the GPU

02:45.000 --> 02:47.000
and retain the progress on the GPU.

02:47.000 --> 02:56.000
And that's why we suggest using GPU checkpoint restore, which does not help this problem.

02:57.000 --> 03:00.000
Thank you, Ricky.

03:00.000 --> 03:10.000
So transparent checkpoint, as Victor mentioned, allows us to essentially preempt running applications

03:10.000 --> 03:15.000
and resume their execution either on the same or different notes.

03:15.000 --> 03:24.000
This allows us to, for example, check, essentially say, yes.

03:25.000 --> 03:31.000
So this allows us to improve the utilization by essentially pre-ortizing different workloads

03:31.000 --> 03:36.000
and preserving their state when, for example, they have different priority.

03:36.000 --> 03:39.000
The challenge, however, in implementing GPU checkpoint,

03:39.000 --> 03:43.000
is that we have to save the internal state of the GPU and restore it back.

03:43.000 --> 03:50.000
And this is often proprietary and we have to provide low performance overhead

03:50.000 --> 03:53.000
and make sure that the application is inconsistent state.

03:53.000 --> 03:59.000
So there have been several methods proposed in the literature on how to checkpoint and restore GPU state.

03:59.000 --> 04:04.000
And the most commonly used one is what is known as device proxy interception.

04:04.000 --> 04:12.000
For essentially, it is a line called a shared library that is pre-loaded for the application

04:12.000 --> 04:19.000
and replaces every device API call with essentially client server mechanism that keeps a record of,

04:19.000 --> 04:25.000
essentially every memory transfer and every API call made to the GPU.

04:25.000 --> 04:30.000
And this allows us to separate the CPU state from the GPU state and during restore,

04:30.000 --> 04:34.000
we can replay the API calls and restore the application.

04:34.000 --> 04:37.000
There are several challenges with this approach.

04:37.000 --> 04:44.000
First, this device proxy mechanism has to be able to load the code kernels

04:44.000 --> 04:48.000
or the GPU kernels from the binary, which means it has to implement the same

04:48.000 --> 04:52.000
mechanisms used, for example, by the Kudarante.

04:52.000 --> 05:01.000
It adds an interception overhead for every API call and keeps a record of every cost to device memory transfer.

05:01.000 --> 05:07.000
And it requires also a more specific implementation.

05:07.000 --> 05:09.000
So it doesn't work for all GPUs.

05:09.000 --> 05:13.000
You have to essentially implement the GPU support yourself.

05:13.000 --> 05:17.000
And one of the main challenges is that it requires dynamic linking.

05:17.000 --> 05:21.000
So, for example, if you want to use PyTorch with this mechanism,

05:21.000 --> 05:24.000
you have to recompile PyTorch with dynamic linking.

05:24.000 --> 05:28.000
And we did a simple experiment with an open source tool called cricket

05:28.000 --> 05:31.000
that implements this checkpointing mechanism.

05:31.000 --> 05:38.000
And we can observe that essentially loading a starting the PyTorch application.

05:38.000 --> 05:44.000
For example, it's much slower than essentially the standard execution.

05:44.000 --> 05:49.000
And every API code that is used during a new network training essentially has a lot

05:49.000 --> 05:52.000
of overhead.

05:52.000 --> 05:59.000
So, we are exploring a new mechanism called essentially relying on the GPU driver

05:59.000 --> 06:03.000
to provide the checkpoint restore mechanism that is integrated into two-code

06:03.000 --> 06:04.000
Creole.

06:04.000 --> 06:07.000
Creole stands for checkpoint restore user space.

06:07.000 --> 06:12.000
It's an open source tool that allows us to save and restore the state-of-links processes.

06:12.000 --> 06:16.000
And it has been integrated into several container-and-times.

06:16.000 --> 06:19.000
Such as Docker, Podmon, and more recently in Kubernetes.

06:19.000 --> 06:23.000
And it provides GPU checkpointing support with the plugin mechanism for both

06:23.000 --> 06:26.000
ambient and video GPUs.

06:26.000 --> 06:31.000
The way it works is using essentially the CUDA plugin allows

06:31.000 --> 06:34.000
to track point and video applications.

06:34.000 --> 06:39.000
And it uses a two-code checkpoint, which was recently released by Nvidia,

06:39.000 --> 06:45.000
which communicates with the driver and essentially stops the application from running

06:45.000 --> 06:49.000
on the GPU and moves the state of the application to host memory.

06:49.000 --> 06:54.000
And this allows us to create a unified checkpoint that contains both the CPU and GPU state.

06:54.000 --> 07:00.000
The advantages of this approach is that it doesn't require preloading of shared libraries,

07:00.000 --> 07:05.000
which is very helpful for containers because essentially injecting the library into the container

07:05.000 --> 07:08.000
and setting up the environment variables is challenging.

07:08.000 --> 07:14.000
It also doesn't introduce overhead of API coin perceptions and recording the memory transfers.

07:14.000 --> 07:18.000
And it works both static and dynamic linking.

07:18.000 --> 07:25.000
And because it's implemented in the GPU driver itself, it works for all types of GPUs.

07:25.000 --> 07:28.000
The way this mechanism is intended.

07:28.000 --> 07:29.000
Yes?

07:29.000 --> 07:31.000
Yes, only works for Nvidia.

07:32.000 --> 07:35.000
We have a separate plugin for AMD GPUs.

07:35.000 --> 07:39.000
This is using the essentially an implementation in the Linux kernel.

07:39.000 --> 07:45.000
I have a separate slide, but due to time constraints, I focus on the Nvidia implementation.

07:45.000 --> 07:51.000
And the integration with Kubernetes is based on Adrian's work, Adrian Reber,

07:51.000 --> 07:57.000
who implemented support for ANZIP checkpoint in Kubernetes recently.

07:57.000 --> 08:02.000
And since these plugins are integrated with Kubernetes,

08:02.000 --> 08:05.000
we don't have to make additional changes.

08:05.000 --> 08:11.000
For example, to preload libraries or similar additional features.

08:11.000 --> 08:17.000
And checkpoint itself is essentially working on individual containers.

08:17.000 --> 08:20.000
So it allows us to save the state of individual containers,

08:20.000 --> 08:23.000
we can pause and restore them later on.

08:23.000 --> 08:25.000
And I have a short demo.

08:25.000 --> 08:29.000
So in this demo, we have a Kubernetes instance running,

08:29.000 --> 08:35.000
essentially opened with UI with Olamma and Jupyter notebook.

08:35.000 --> 08:42.000
So in the Jupyter notebook, we have essentially a neural network training

08:42.000 --> 08:46.000
that has high GPU utilization.

08:46.000 --> 08:49.000
But it will be running for a long time.

08:49.000 --> 08:53.000
So it's a low-prior task that essentially occupies about

08:53.000 --> 08:59.000
8 gigabytes of GPU memory and 100% of GPU utilization.

08:59.000 --> 09:04.000
And we have two users using something similar to tragedy,

09:04.000 --> 09:08.000
where we can essentially perform model inference.

09:08.000 --> 09:12.000
But because the GPU is over 100% utilized,

09:12.000 --> 09:16.000
these requests are very slow and takes long time to get response back.

09:16.000 --> 09:19.000
So this is a common problem.

09:20.000 --> 09:24.000
These type of applications use a user layer require low response time.

09:24.000 --> 09:31.000
And optimizing the way this model's response is one of the main challenges today.

09:31.000 --> 09:36.000
So we have a checkpoint mechanism that allows us to essentially

09:36.000 --> 09:42.000
install the training task, the Jupyter notebook,

09:42.000 --> 09:44.000
and then move the state into host memory.

09:44.000 --> 09:47.000
This is what the checkpoint showscriptives.

09:47.000 --> 09:52.000
And this allows us for the inference jobs to complete much quicker

09:52.000 --> 09:58.000
because that essentially we can preempt the training task

09:58.000 --> 10:04.000
and then we can resume it later on with the rest of script.

10:04.000 --> 10:10.000
And in our evaluation, we focused on what are the main factors that add latency

10:10.000 --> 10:15.000
and how does this mechanism scale with multiple GPU devices?

10:15.000 --> 10:21.000
We observe that because large language models usually have a lot of GPU state.

10:21.000 --> 10:26.000
Essentially moving the state from the GPU to host memory is fairly fast.

10:26.000 --> 10:30.000
This is what we can see for with a 100 on the left graph.

10:30.000 --> 10:37.000
And then saving this essentially the whole state to disk is usually very slow.

10:37.000 --> 10:42.000
And restoring it back is also, it takes long time.

10:43.000 --> 10:50.000
We in this graph we can observe that the blue bars show the total restore time.

10:50.000 --> 10:56.000
So this is the time we it takes to essentially load the GPU state in host memory

10:56.000 --> 10:58.000
and then from host memory back into the GPU.

10:58.000 --> 11:04.000
And the GPU restore time is in orange and the unlocked time is essentially

11:04.000 --> 11:09.000
synchronization mechanism that allows us to restore the application consistent state

11:09.000 --> 11:12.000
and this is negligible.

11:12.000 --> 11:16.000
The checkpoint size depends on essentially the model parameters

11:16.000 --> 11:20.000
and the quantization of the model.

11:20.000 --> 11:27.000
And we can observe that most like 90% of the checkpoint size is GPU state

11:27.000 --> 11:32.000
and it could be up to 97% for large models.

11:32.000 --> 11:36.000
I just want to take a moment to acknowledge other people who contribute to this work.

11:36.000 --> 11:42.000
So Andre from Google cases and Steven from the NVATM and Felix and his team from Andy

11:42.000 --> 11:46.000
and Adrian who contributed with the Kubernetes implementation and

11:46.000 --> 11:51.000
look us with providing access to the two Kubernetes clusters.

11:51.000 --> 11:56.000
And thank you so much for coming so early today.

11:56.000 --> 12:02.000
And in summary we present fully transparent GPU track pointy mechanism

12:02.000 --> 12:08.000
that works both with Andy and NVG GPUs and has been integrated with Kubernetes.

12:08.000 --> 12:13.000
And we will be have transfer of any questions.

12:13.000 --> 12:23.000
Can you take the mic so we get the questions on the recording as well.

12:23.000 --> 12:26.000
Any questions for Redstone and then in Victoria?

12:26.000 --> 12:27.000
Yeah.

12:27.000 --> 12:31.000
Well, please one, don't repeat the question.

12:31.000 --> 12:36.000
So first part of the question is to what extent does this checkpoint mechanism

12:36.000 --> 12:40.000
support checkpoint networking protocols or networking interactions?

12:40.000 --> 12:45.000
And second part of the question is to what extent do you have to restart on the same GPU

12:45.000 --> 12:50.000
or can you use let's say another Nvidia type of GPU and restart there?

12:50.000 --> 12:51.000
Yeah.

12:51.000 --> 12:53.000
Thanks so much for the question.

12:53.000 --> 12:55.000
So for networking.

12:55.000 --> 12:59.000
So we can about stateless protocol such as TCP.

12:59.000 --> 13:01.000
UDP is not stateless protocol.

13:01.000 --> 13:04.000
So we can simply check on restore UDP sockets.

13:04.000 --> 13:09.000
For TCP sockets, the kernel has essentially a functionality called TCP repair

13:09.000 --> 13:14.000
that essentially allows us to create a TCP socket, set it in the correct state

13:14.000 --> 13:17.000
and then resume essentially the TCP connection.

13:17.000 --> 13:21.000
We have a locking mechanism that essentially allows us to drop incoming packets

13:21.000 --> 13:26.000
so that essentially when the client is not running, when the socket is closed,

13:26.000 --> 13:28.000
the kernel doesn't send responses back.

13:28.000 --> 13:30.000
So the connection stays open.

13:30.000 --> 13:35.000
So there are a few mechanisms that have been introduced to handle network connections

13:35.000 --> 13:37.000
and for different types of GPUs.

13:37.000 --> 13:43.000
So we talk with Stephen and he says about essentially if it might be possible to migrate

13:43.000 --> 13:46.000
for example from 8100 to a 100 GPU,

13:46.000 --> 13:50.000
but this will be very difficult because essentially the way it's implemented.

13:50.000 --> 13:53.000
It's very different for every GPU.

13:53.000 --> 13:59.000
But it is possible to migrate from essentially between different GPUs as long as they are the same.

13:59.000 --> 14:02.000
And you have the same number of GPUs.

14:08.000 --> 14:10.000
More questions.

14:10.000 --> 14:12.000
Hey, thank you. Great talk.

14:12.000 --> 14:17.000
I have a question. Do you have any insight on when to checkpoint?

14:17.000 --> 14:22.000
Like, what's the trigger point that you would check point a job

14:22.000 --> 14:26.000
and then restore it obviously when there's interaction going,

14:26.000 --> 14:28.000
but when to know when to check point.

14:28.000 --> 14:34.000
I mean, sometimes maybe you're just before the user then interact again with the job, right?

14:34.000 --> 14:39.000
Yeah, so I mean, there are many research papers on this topic.

14:39.000 --> 14:45.000
So there are essentially some of the papers focused on providing photo neurons.

14:45.000 --> 14:50.000
So essentially periodically check point and then using the latest checkpoint to recover in the case of failure.

14:50.000 --> 14:54.000
Microsoft recently published a paper called Just In Time Checkpointing,

14:54.000 --> 14:58.000
which aims to essentially optimize this mechanism.

14:58.000 --> 15:01.000
And there's also serverless use cases.

15:01.000 --> 15:07.000
For example, it's much cheaper to run this type of inference workloads in serverless platforms.

15:07.000 --> 15:12.000
And then checkpoints are usually used to decrease the start-up times of inference jobs.

15:12.000 --> 15:14.000
Did that answer your question?

15:15.000 --> 15:17.000
More questions?

15:21.000 --> 15:25.000
So another question is a little bit more on the amount of data.

15:25.000 --> 15:28.000
Do you do anything in terms of data redundancy reductions?

15:28.000 --> 15:33.000
Like, if you compute a multiple GPUs often the workload contains the same amount of data,

15:33.000 --> 15:38.000
is there some way to basically just drop the stuff that's similar on the GPUs

15:38.000 --> 15:40.000
or is it just take the data from the GPU?

15:40.000 --> 15:42.000
Can you continue from there?

15:42.000 --> 15:46.000
Yes. So we evaluated scalability with data parallel workloads.

15:46.000 --> 15:49.000
This is essentially when we have the model using multiple GPUs and

15:49.000 --> 15:53.000
separating, for example, the data set to run on multiple GPUs.

15:53.000 --> 15:55.000
And this scales linearly.

15:55.000 --> 15:59.000
Essentially, we have to save the state of every GPU when you have multiple GPUs and

15:59.000 --> 16:00.000
restore it back.

16:00.000 --> 16:04.000
The mechanism that we're currently looking at is compression.

16:04.000 --> 16:06.000
So compression is very effective.

16:06.000 --> 16:08.000
We have seen some promising results.

16:08.000 --> 16:13.000
And this usually allows us to, it's the most efficient way of

16:13.000 --> 16:17.000
compressing the data, reducing the size of the checkpoint.

16:17.000 --> 16:23.000
And there is a research paper that will appear later this year discussing this as well.

16:27.000 --> 16:28.000
Any more questions?

16:28.000 --> 16:39.000
So this was now for Nvidia and AMD GPUs.

16:39.000 --> 16:45.000
In our environment, we were interested in doing something same,

16:45.000 --> 16:50.000
but totally different environment, which is an embedded software.

16:50.000 --> 16:53.000
And we don't have that.

16:53.000 --> 16:59.000
Usually, other type of GPUs are Mali or Broadcom video course.

16:59.000 --> 17:06.000
So if you want to extend this, have this mechanism to work on all the GPUs.

17:06.000 --> 17:08.000
What would you advise?

17:08.000 --> 17:13.000
How to approach that and in your presentation?

17:13.000 --> 17:18.000
I had the impression that it was an API approach in the beginning with AMD.

17:18.000 --> 17:22.000
And then there's the approach that you do with Nvidia, which is more the driver.

17:22.000 --> 17:25.000
It's held that preparing the data structures or something.

17:25.000 --> 17:26.000
What will you advise?

17:26.000 --> 17:27.000
Yes.

17:27.000 --> 17:30.000
So the API mechanism is the current state of the art.

17:30.000 --> 17:33.000
So this is, for example, what Microsoft are using internally.

17:33.000 --> 17:37.000
They also announced that this is the mechanism that he used, for example,

17:37.000 --> 17:42.000
to run charge IP key and to provide photo learns for their training workloads.

17:42.000 --> 17:44.000
And we're presenting a different approach.

17:44.000 --> 17:47.000
So our approach doesn't require this API interception mechanism.

17:47.000 --> 17:52.000
And it's based on what Nvidia have introduced recently into the 4D GPU drivers,

17:52.000 --> 17:56.000
and what AMD contributed to the pre-project.

17:56.000 --> 18:00.000
And in terms of different GPUs, essentially you have to,

18:00.000 --> 18:06.000
so what I can say is that I tested the GPU checkpoint mechanism with Nvidia

18:06.000 --> 18:10.000
with about 10 different GPUs and confirm that it works.

18:10.000 --> 18:16.000
But yeah, you just have to see and run it and see if it works for the GPUs that you have.

18:16.000 --> 18:19.000
But so for us, it's not Nvidia.

18:19.000 --> 18:21.000
It's another GPU.

18:21.000 --> 18:29.000
So we would have to get somehow get the vendor doing a PUC and trying to get

18:29.000 --> 18:33.000
replicate what Nvidia is doing.

18:33.000 --> 18:34.000
Okay.

18:34.000 --> 18:35.000
Yeah.

18:35.000 --> 18:40.000
For the GPUs of 4D, you need to find the signal.

18:40.000 --> 18:43.000
If you want additional GPUs of 4D, you need to find the support.

18:43.000 --> 18:50.000
Because we, in theory, upstream, have seen that the GPU part is something you cannot do as an open source.

18:50.000 --> 18:53.000
But you can do as an open source implementation,

18:53.000 --> 18:57.000
but you need to have the inside knowledge of the GPU without that it's,

18:57.000 --> 19:00.000
I would say it's impossible to unique the vendor.

19:00.000 --> 19:01.000
Yeah.

19:01.000 --> 19:05.000
What question?

19:14.000 --> 19:22.000
I'm curious about the, the Kubernetes, sorry, the interactive Jupyter users.

19:22.000 --> 19:23.000
Yes.

19:23.000 --> 19:30.000
And if you've been using this mechanism, how do you let them know that they're being preempted

19:30.000 --> 19:35.000
and when it's going to be available again, and obviously they might be a bit surprised.

19:35.000 --> 19:36.000
Yeah.

19:37.000 --> 19:43.000
Not using exactly GPU checkpointing right now, because it's very new,

19:43.000 --> 19:48.000
and we haven't tested it for like production use case, I would say.

19:48.000 --> 19:55.000
But what we already do is, like, stupid killing of this workload,

19:55.000 --> 20:01.000
and we let people know in advance with like regular automatic emails.

20:01.000 --> 20:05.000
We are monitoring the usage, like in Grafana,

20:05.000 --> 20:11.000
and we set up some politics based on our own decision that like if you are,

20:11.000 --> 20:14.000
I don't for two days, then like we are killing you.

20:14.000 --> 20:19.000
And because we have a email for every user, we are always in the name of it.

20:19.000 --> 20:22.000
If you want download your data, just like save something,

20:22.000 --> 20:24.000
because we're going to do this.

20:24.000 --> 20:28.000
And it has been very helpful, and originally, like a year ago,

20:28.000 --> 20:33.000
we had these politics on seven days, and very quickly decreased it to one day,

20:33.000 --> 20:37.000
because it was not manageable this way.

20:37.000 --> 20:41.000
And we would rather prefer now to start using checkpointing,

20:41.000 --> 20:44.000
because we know that for some people there,

20:44.000 --> 20:48.000
progress is important, but if there are somewhere on holidays or what,

20:48.000 --> 20:54.000
they don't check for emails, and we get automatic replies back that they are not checking emails.

20:54.000 --> 20:58.000
So it's unfortunate for them, and that would be better.