WEBVTT 00:00.000 --> 00:10.000 We see that this workout has been running for nearly two weeks and I checked today and 00:10.000 --> 00:11.000 they are still running. 00:11.000 --> 00:13.000 So it has been more than three weeks. 00:13.000 --> 00:16.000 The GPU utilization has been so high. 00:16.000 --> 00:21.000 On the contrary, I have some examples of interactive workout here. 00:21.000 --> 00:29.000 These are two instances of Oama and we can see that the notebooks have pretty 00:29.000 --> 00:31.000 varying utilization here. 00:31.000 --> 00:38.000 For example, this notebook has zero average GPU utilization in past days so that 00:38.000 --> 00:40.000 is quite waste. 00:40.000 --> 00:45.000 And the same applies to Oama instances because they just reply to user 00:45.000 --> 00:49.000 questions at some times. 00:49.000 --> 00:55.000 And this is a closer comparison of two different notebooks of two different 00:55.000 --> 00:56.000 users. 00:56.000 --> 01:05.000 The first users notebook, we can see that the GPU and CPU utilization happen at the same time 01:05.000 --> 01:07.000 for some reason. 01:07.000 --> 01:10.000 And there are quite gaps between that. 01:10.000 --> 01:12.000 For example, this is one whole day. 01:12.000 --> 01:19.000 So the user has created the notebook but is not using it all the time and it is waste. 01:19.000 --> 01:27.000 So this user, however, we can see that even there are times when no resources are used. 01:27.000 --> 01:31.000 For example, here it's not very well visible, maybe. 01:31.000 --> 01:36.000 But this is GPU utilization, then the GPU is idle but CPU is used. 01:36.000 --> 01:41.000 Which means that the user is somehow interacting with the notebook. 01:41.000 --> 01:48.000 And this means that we cannot just decide based on GPU utilization to maybe 01:48.000 --> 01:52.000 kill the notebook or somehow remove this workload. 01:52.000 --> 01:56.000 Because first it's interactive, user might interact with it later. 01:56.000 --> 02:02.000 And second, GPU utilization itself is not the marker of whether it's used or not. 02:02.000 --> 02:08.000 However, we might want to tell user one that he should be using the jobs or something similar. 02:08.000 --> 02:12.000 I don't know why he's using the notebooks. 02:12.000 --> 02:19.000 So the obvious observation is that we experience low resource utilization. 02:19.000 --> 02:25.000 And I think that many customers and administrators know that there is low resource utilization. 02:25.000 --> 02:29.000 But especially with GPU utilization, it's painful because it's expensive. 02:29.000 --> 02:35.000 And there are some existing solutions such as time slicing or multi-instance GPUs. 02:35.000 --> 02:45.000 Which allow us to maybe use GPU more, but it is not possible to maybe kill the workload using the GPU 02:45.000 --> 02:47.000 and retain the progress on the GPU. 02:47.000 --> 02:56.000 And that's why we suggest using GPU checkpoint restore, which does not help this problem. 02:57.000 --> 03:00.000 Thank you, Ricky. 03:00.000 --> 03:10.000 So transparent checkpoint, as Victor mentioned, allows us to essentially preempt running applications 03:10.000 --> 03:15.000 and resume their execution either on the same or different notes. 03:15.000 --> 03:24.000 This allows us to, for example, check, essentially say, yes. 03:25.000 --> 03:31.000 So this allows us to improve the utilization by essentially pre-ortizing different workloads 03:31.000 --> 03:36.000 and preserving their state when, for example, they have different priority. 03:36.000 --> 03:39.000 The challenge, however, in implementing GPU checkpoint, 03:39.000 --> 03:43.000 is that we have to save the internal state of the GPU and restore it back. 03:43.000 --> 03:50.000 And this is often proprietary and we have to provide low performance overhead 03:50.000 --> 03:53.000 and make sure that the application is inconsistent state. 03:53.000 --> 03:59.000 So there have been several methods proposed in the literature on how to checkpoint and restore GPU state. 03:59.000 --> 04:04.000 And the most commonly used one is what is known as device proxy interception. 04:04.000 --> 04:12.000 For essentially, it is a line called a shared library that is pre-loaded for the application 04:12.000 --> 04:19.000 and replaces every device API call with essentially client server mechanism that keeps a record of, 04:19.000 --> 04:25.000 essentially every memory transfer and every API call made to the GPU. 04:25.000 --> 04:30.000 And this allows us to separate the CPU state from the GPU state and during restore, 04:30.000 --> 04:34.000 we can replay the API calls and restore the application. 04:34.000 --> 04:37.000 There are several challenges with this approach. 04:37.000 --> 04:44.000 First, this device proxy mechanism has to be able to load the code kernels 04:44.000 --> 04:48.000 or the GPU kernels from the binary, which means it has to implement the same 04:48.000 --> 04:52.000 mechanisms used, for example, by the Kudarante. 04:52.000 --> 05:01.000 It adds an interception overhead for every API call and keeps a record of every cost to device memory transfer. 05:01.000 --> 05:07.000 And it requires also a more specific implementation. 05:07.000 --> 05:09.000 So it doesn't work for all GPUs. 05:09.000 --> 05:13.000 You have to essentially implement the GPU support yourself. 05:13.000 --> 05:17.000 And one of the main challenges is that it requires dynamic linking. 05:17.000 --> 05:21.000 So, for example, if you want to use PyTorch with this mechanism, 05:21.000 --> 05:24.000 you have to recompile PyTorch with dynamic linking. 05:24.000 --> 05:28.000 And we did a simple experiment with an open source tool called cricket 05:28.000 --> 05:31.000 that implements this checkpointing mechanism. 05:31.000 --> 05:38.000 And we can observe that essentially loading a starting the PyTorch application. 05:38.000 --> 05:44.000 For example, it's much slower than essentially the standard execution. 05:44.000 --> 05:49.000 And every API code that is used during a new network training essentially has a lot 05:49.000 --> 05:52.000 of overhead. 05:52.000 --> 05:59.000 So, we are exploring a new mechanism called essentially relying on the GPU driver 05:59.000 --> 06:03.000 to provide the checkpoint restore mechanism that is integrated into two-code 06:03.000 --> 06:04.000 Creole. 06:04.000 --> 06:07.000 Creole stands for checkpoint restore user space. 06:07.000 --> 06:12.000 It's an open source tool that allows us to save and restore the state-of-links processes. 06:12.000 --> 06:16.000 And it has been integrated into several container-and-times. 06:16.000 --> 06:19.000 Such as Docker, Podmon, and more recently in Kubernetes. 06:19.000 --> 06:23.000 And it provides GPU checkpointing support with the plugin mechanism for both 06:23.000 --> 06:26.000 ambient and video GPUs. 06:26.000 --> 06:31.000 The way it works is using essentially the CUDA plugin allows 06:31.000 --> 06:34.000 to track point and video applications. 06:34.000 --> 06:39.000 And it uses a two-code checkpoint, which was recently released by Nvidia, 06:39.000 --> 06:45.000 which communicates with the driver and essentially stops the application from running 06:45.000 --> 06:49.000 on the GPU and moves the state of the application to host memory. 06:49.000 --> 06:54.000 And this allows us to create a unified checkpoint that contains both the CPU and GPU state. 06:54.000 --> 07:00.000 The advantages of this approach is that it doesn't require preloading of shared libraries, 07:00.000 --> 07:05.000 which is very helpful for containers because essentially injecting the library into the container 07:05.000 --> 07:08.000 and setting up the environment variables is challenging. 07:08.000 --> 07:14.000 It also doesn't introduce overhead of API coin perceptions and recording the memory transfers. 07:14.000 --> 07:18.000 And it works both static and dynamic linking. 07:18.000 --> 07:25.000 And because it's implemented in the GPU driver itself, it works for all types of GPUs. 07:25.000 --> 07:28.000 The way this mechanism is intended. 07:28.000 --> 07:29.000 Yes? 07:29.000 --> 07:31.000 Yes, only works for Nvidia. 07:32.000 --> 07:35.000 We have a separate plugin for AMD GPUs. 07:35.000 --> 07:39.000 This is using the essentially an implementation in the Linux kernel. 07:39.000 --> 07:45.000 I have a separate slide, but due to time constraints, I focus on the Nvidia implementation. 07:45.000 --> 07:51.000 And the integration with Kubernetes is based on Adrian's work, Adrian Reber, 07:51.000 --> 07:57.000 who implemented support for ANZIP checkpoint in Kubernetes recently. 07:57.000 --> 08:02.000 And since these plugins are integrated with Kubernetes, 08:02.000 --> 08:05.000 we don't have to make additional changes. 08:05.000 --> 08:11.000 For example, to preload libraries or similar additional features. 08:11.000 --> 08:17.000 And checkpoint itself is essentially working on individual containers. 08:17.000 --> 08:20.000 So it allows us to save the state of individual containers, 08:20.000 --> 08:23.000 we can pause and restore them later on. 08:23.000 --> 08:25.000 And I have a short demo. 08:25.000 --> 08:29.000 So in this demo, we have a Kubernetes instance running, 08:29.000 --> 08:35.000 essentially opened with UI with Olamma and Jupyter notebook. 08:35.000 --> 08:42.000 So in the Jupyter notebook, we have essentially a neural network training 08:42.000 --> 08:46.000 that has high GPU utilization. 08:46.000 --> 08:49.000 But it will be running for a long time. 08:49.000 --> 08:53.000 So it's a low-prior task that essentially occupies about 08:53.000 --> 08:59.000 8 gigabytes of GPU memory and 100% of GPU utilization. 08:59.000 --> 09:04.000 And we have two users using something similar to tragedy, 09:04.000 --> 09:08.000 where we can essentially perform model inference. 09:08.000 --> 09:12.000 But because the GPU is over 100% utilized, 09:12.000 --> 09:16.000 these requests are very slow and takes long time to get response back. 09:16.000 --> 09:19.000 So this is a common problem. 09:20.000 --> 09:24.000 These type of applications use a user layer require low response time. 09:24.000 --> 09:31.000 And optimizing the way this model's response is one of the main challenges today. 09:31.000 --> 09:36.000 So we have a checkpoint mechanism that allows us to essentially 09:36.000 --> 09:42.000 install the training task, the Jupyter notebook, 09:42.000 --> 09:44.000 and then move the state into host memory. 09:44.000 --> 09:47.000 This is what the checkpoint showscriptives. 09:47.000 --> 09:52.000 And this allows us for the inference jobs to complete much quicker 09:52.000 --> 09:58.000 because that essentially we can preempt the training task 09:58.000 --> 10:04.000 and then we can resume it later on with the rest of script. 10:04.000 --> 10:10.000 And in our evaluation, we focused on what are the main factors that add latency 10:10.000 --> 10:15.000 and how does this mechanism scale with multiple GPU devices? 10:15.000 --> 10:21.000 We observe that because large language models usually have a lot of GPU state. 10:21.000 --> 10:26.000 Essentially moving the state from the GPU to host memory is fairly fast. 10:26.000 --> 10:30.000 This is what we can see for with a 100 on the left graph. 10:30.000 --> 10:37.000 And then saving this essentially the whole state to disk is usually very slow. 10:37.000 --> 10:42.000 And restoring it back is also, it takes long time. 10:43.000 --> 10:50.000 We in this graph we can observe that the blue bars show the total restore time. 10:50.000 --> 10:56.000 So this is the time we it takes to essentially load the GPU state in host memory 10:56.000 --> 10:58.000 and then from host memory back into the GPU. 10:58.000 --> 11:04.000 And the GPU restore time is in orange and the unlocked time is essentially 11:04.000 --> 11:09.000 synchronization mechanism that allows us to restore the application consistent state 11:09.000 --> 11:12.000 and this is negligible. 11:12.000 --> 11:16.000 The checkpoint size depends on essentially the model parameters 11:16.000 --> 11:20.000 and the quantization of the model. 11:20.000 --> 11:27.000 And we can observe that most like 90% of the checkpoint size is GPU state 11:27.000 --> 11:32.000 and it could be up to 97% for large models. 11:32.000 --> 11:36.000 I just want to take a moment to acknowledge other people who contribute to this work. 11:36.000 --> 11:42.000 So Andre from Google cases and Steven from the NVATM and Felix and his team from Andy 11:42.000 --> 11:46.000 and Adrian who contributed with the Kubernetes implementation and 11:46.000 --> 11:51.000 look us with providing access to the two Kubernetes clusters. 11:51.000 --> 11:56.000 And thank you so much for coming so early today. 11:56.000 --> 12:02.000 And in summary we present fully transparent GPU track pointy mechanism 12:02.000 --> 12:08.000 that works both with Andy and NVG GPUs and has been integrated with Kubernetes. 12:08.000 --> 12:13.000 And we will be have transfer of any questions. 12:13.000 --> 12:23.000 Can you take the mic so we get the questions on the recording as well. 12:23.000 --> 12:26.000 Any questions for Redstone and then in Victoria? 12:26.000 --> 12:27.000 Yeah. 12:27.000 --> 12:31.000 Well, please one, don't repeat the question. 12:31.000 --> 12:36.000 So first part of the question is to what extent does this checkpoint mechanism 12:36.000 --> 12:40.000 support checkpoint networking protocols or networking interactions? 12:40.000 --> 12:45.000 And second part of the question is to what extent do you have to restart on the same GPU 12:45.000 --> 12:50.000 or can you use let's say another Nvidia type of GPU and restart there? 12:50.000 --> 12:51.000 Yeah. 12:51.000 --> 12:53.000 Thanks so much for the question. 12:53.000 --> 12:55.000 So for networking. 12:55.000 --> 12:59.000 So we can about stateless protocol such as TCP. 12:59.000 --> 13:01.000 UDP is not stateless protocol. 13:01.000 --> 13:04.000 So we can simply check on restore UDP sockets. 13:04.000 --> 13:09.000 For TCP sockets, the kernel has essentially a functionality called TCP repair 13:09.000 --> 13:14.000 that essentially allows us to create a TCP socket, set it in the correct state 13:14.000 --> 13:17.000 and then resume essentially the TCP connection. 13:17.000 --> 13:21.000 We have a locking mechanism that essentially allows us to drop incoming packets 13:21.000 --> 13:26.000 so that essentially when the client is not running, when the socket is closed, 13:26.000 --> 13:28.000 the kernel doesn't send responses back. 13:28.000 --> 13:30.000 So the connection stays open. 13:30.000 --> 13:35.000 So there are a few mechanisms that have been introduced to handle network connections 13:35.000 --> 13:37.000 and for different types of GPUs. 13:37.000 --> 13:43.000 So we talk with Stephen and he says about essentially if it might be possible to migrate 13:43.000 --> 13:46.000 for example from 8100 to a 100 GPU, 13:46.000 --> 13:50.000 but this will be very difficult because essentially the way it's implemented. 13:50.000 --> 13:53.000 It's very different for every GPU. 13:53.000 --> 13:59.000 But it is possible to migrate from essentially between different GPUs as long as they are the same. 13:59.000 --> 14:02.000 And you have the same number of GPUs. 14:08.000 --> 14:10.000 More questions. 14:10.000 --> 14:12.000 Hey, thank you. Great talk. 14:12.000 --> 14:17.000 I have a question. Do you have any insight on when to checkpoint? 14:17.000 --> 14:22.000 Like, what's the trigger point that you would check point a job 14:22.000 --> 14:26.000 and then restore it obviously when there's interaction going, 14:26.000 --> 14:28.000 but when to know when to check point. 14:28.000 --> 14:34.000 I mean, sometimes maybe you're just before the user then interact again with the job, right? 14:34.000 --> 14:39.000 Yeah, so I mean, there are many research papers on this topic. 14:39.000 --> 14:45.000 So there are essentially some of the papers focused on providing photo neurons. 14:45.000 --> 14:50.000 So essentially periodically check point and then using the latest checkpoint to recover in the case of failure. 14:50.000 --> 14:54.000 Microsoft recently published a paper called Just In Time Checkpointing, 14:54.000 --> 14:58.000 which aims to essentially optimize this mechanism. 14:58.000 --> 15:01.000 And there's also serverless use cases. 15:01.000 --> 15:07.000 For example, it's much cheaper to run this type of inference workloads in serverless platforms. 15:07.000 --> 15:12.000 And then checkpoints are usually used to decrease the start-up times of inference jobs. 15:12.000 --> 15:14.000 Did that answer your question? 15:15.000 --> 15:17.000 More questions? 15:21.000 --> 15:25.000 So another question is a little bit more on the amount of data. 15:25.000 --> 15:28.000 Do you do anything in terms of data redundancy reductions? 15:28.000 --> 15:33.000 Like, if you compute a multiple GPUs often the workload contains the same amount of data, 15:33.000 --> 15:38.000 is there some way to basically just drop the stuff that's similar on the GPUs 15:38.000 --> 15:40.000 or is it just take the data from the GPU? 15:40.000 --> 15:42.000 Can you continue from there? 15:42.000 --> 15:46.000 Yes. So we evaluated scalability with data parallel workloads. 15:46.000 --> 15:49.000 This is essentially when we have the model using multiple GPUs and 15:49.000 --> 15:53.000 separating, for example, the data set to run on multiple GPUs. 15:53.000 --> 15:55.000 And this scales linearly. 15:55.000 --> 15:59.000 Essentially, we have to save the state of every GPU when you have multiple GPUs and 15:59.000 --> 16:00.000 restore it back. 16:00.000 --> 16:04.000 The mechanism that we're currently looking at is compression. 16:04.000 --> 16:06.000 So compression is very effective. 16:06.000 --> 16:08.000 We have seen some promising results. 16:08.000 --> 16:13.000 And this usually allows us to, it's the most efficient way of 16:13.000 --> 16:17.000 compressing the data, reducing the size of the checkpoint. 16:17.000 --> 16:23.000 And there is a research paper that will appear later this year discussing this as well. 16:27.000 --> 16:28.000 Any more questions? 16:28.000 --> 16:39.000 So this was now for Nvidia and AMD GPUs. 16:39.000 --> 16:45.000 In our environment, we were interested in doing something same, 16:45.000 --> 16:50.000 but totally different environment, which is an embedded software. 16:50.000 --> 16:53.000 And we don't have that. 16:53.000 --> 16:59.000 Usually, other type of GPUs are Mali or Broadcom video course. 16:59.000 --> 17:06.000 So if you want to extend this, have this mechanism to work on all the GPUs. 17:06.000 --> 17:08.000 What would you advise? 17:08.000 --> 17:13.000 How to approach that and in your presentation? 17:13.000 --> 17:18.000 I had the impression that it was an API approach in the beginning with AMD. 17:18.000 --> 17:22.000 And then there's the approach that you do with Nvidia, which is more the driver. 17:22.000 --> 17:25.000 It's held that preparing the data structures or something. 17:25.000 --> 17:26.000 What will you advise? 17:26.000 --> 17:27.000 Yes. 17:27.000 --> 17:30.000 So the API mechanism is the current state of the art. 17:30.000 --> 17:33.000 So this is, for example, what Microsoft are using internally. 17:33.000 --> 17:37.000 They also announced that this is the mechanism that he used, for example, 17:37.000 --> 17:42.000 to run charge IP key and to provide photo learns for their training workloads. 17:42.000 --> 17:44.000 And we're presenting a different approach. 17:44.000 --> 17:47.000 So our approach doesn't require this API interception mechanism. 17:47.000 --> 17:52.000 And it's based on what Nvidia have introduced recently into the 4D GPU drivers, 17:52.000 --> 17:56.000 and what AMD contributed to the pre-project. 17:56.000 --> 18:00.000 And in terms of different GPUs, essentially you have to, 18:00.000 --> 18:06.000 so what I can say is that I tested the GPU checkpoint mechanism with Nvidia 18:06.000 --> 18:10.000 with about 10 different GPUs and confirm that it works. 18:10.000 --> 18:16.000 But yeah, you just have to see and run it and see if it works for the GPUs that you have. 18:16.000 --> 18:19.000 But so for us, it's not Nvidia. 18:19.000 --> 18:21.000 It's another GPU. 18:21.000 --> 18:29.000 So we would have to get somehow get the vendor doing a PUC and trying to get 18:29.000 --> 18:33.000 replicate what Nvidia is doing. 18:33.000 --> 18:34.000 Okay. 18:34.000 --> 18:35.000 Yeah. 18:35.000 --> 18:40.000 For the GPUs of 4D, you need to find the signal. 18:40.000 --> 18:43.000 If you want additional GPUs of 4D, you need to find the support. 18:43.000 --> 18:50.000 Because we, in theory, upstream, have seen that the GPU part is something you cannot do as an open source. 18:50.000 --> 18:53.000 But you can do as an open source implementation, 18:53.000 --> 18:57.000 but you need to have the inside knowledge of the GPU without that it's, 18:57.000 --> 19:00.000 I would say it's impossible to unique the vendor. 19:00.000 --> 19:01.000 Yeah. 19:01.000 --> 19:05.000 What question? 19:14.000 --> 19:22.000 I'm curious about the, the Kubernetes, sorry, the interactive Jupyter users. 19:22.000 --> 19:23.000 Yes. 19:23.000 --> 19:30.000 And if you've been using this mechanism, how do you let them know that they're being preempted 19:30.000 --> 19:35.000 and when it's going to be available again, and obviously they might be a bit surprised. 19:35.000 --> 19:36.000 Yeah. 19:37.000 --> 19:43.000 Not using exactly GPU checkpointing right now, because it's very new, 19:43.000 --> 19:48.000 and we haven't tested it for like production use case, I would say. 19:48.000 --> 19:55.000 But what we already do is, like, stupid killing of this workload, 19:55.000 --> 20:01.000 and we let people know in advance with like regular automatic emails. 20:01.000 --> 20:05.000 We are monitoring the usage, like in Grafana, 20:05.000 --> 20:11.000 and we set up some politics based on our own decision that like if you are, 20:11.000 --> 20:14.000 I don't for two days, then like we are killing you. 20:14.000 --> 20:19.000 And because we have a email for every user, we are always in the name of it. 20:19.000 --> 20:22.000 If you want download your data, just like save something, 20:22.000 --> 20:24.000 because we're going to do this. 20:24.000 --> 20:28.000 And it has been very helpful, and originally, like a year ago, 20:28.000 --> 20:33.000 we had these politics on seven days, and very quickly decreased it to one day, 20:33.000 --> 20:37.000 because it was not manageable this way. 20:37.000 --> 20:41.000 And we would rather prefer now to start using checkpointing, 20:41.000 --> 20:44.000 because we know that for some people there, 20:44.000 --> 20:48.000 progress is important, but if there are somewhere on holidays or what, 20:48.000 --> 20:54.000 they don't check for emails, and we get automatic replies back that they are not checking emails. 20:54.000 --> 20:58.000 So it's unfortunate for them, and that would be better.