WEBVTT 00:00.000 --> 00:17.200 I folks, some confusion that may have been zero overhead is better than zero copy, so it's 00:17.200 --> 00:19.720 zero copy plus plus. 00:19.720 --> 00:29.880 It's, I'm giving the stock under the SDS, not because the first use cases I have 00:29.880 --> 00:38.560 for it is for the, for the self file system and for any software-defined file system as well, 00:38.560 --> 00:39.560 right? 00:39.560 --> 00:44.920 But it's a generic infrastructure for zero overhead, my goal. 00:45.880 --> 00:51.400 We look at the motivation, why copy is bad, I hope everyone knows why, but, you know, 00:51.400 --> 01:00.040 I'll say a few words, a protocol solution, you'll go through a demo that I have, it's on 01:00.040 --> 01:07.520 GitHub, I'm not sure it compiles, it's a good luck, it will improve in your course. 01:07.560 --> 01:14.560 And I'll be talking about the usage, right, or how we, I'm playing to you. 01:14.560 --> 01:25.040 So that I'm movement that, and we're going to focus on a very specific location on the 01:25.040 --> 01:27.360 boundary between the kernel and the user space, right? 01:27.360 --> 01:34.960 So when you receive data or send data, of course, the kernel uses space boundary and by 01:34.960 --> 01:43.680 it's removing, my, your cache gets polluted, your cycles are getting wasted, and it's, 01:43.680 --> 01:46.000 it's been on your performance. 01:46.000 --> 01:49.680 So copy is bad, it's not used. 01:49.680 --> 01:57.120 Usually what we have, as I mentioned, you will have this, a neat network interface card, 01:57.120 --> 02:01.560 you know, it actually does move data, it does use some of the bugs, but we're talking 02:01.640 --> 02:03.480 about the CPU, right? 02:03.480 --> 02:08.520 So you'll have the kernel buffer, and you'll have one copy into the user, and for example, 02:08.520 --> 02:14.920 for the focus of a desktop today, and this still, we're talking about proxy systems. 02:14.920 --> 02:23.480 So for something like a Ganesha, or, you know, something easier, like, 02:23.480 --> 02:29.960 managed to write something, or a CBIN, that's a better example, where you have data that you 02:29.960 --> 02:33.560 want to cache on your side, and then for it, right? 02:33.560 --> 02:39.560 So when you are not actually interested in the data itself, 02:39.560 --> 02:46.440 but rather you're a proxy system, you have data coming in, and you have to send it out. 02:46.440 --> 02:49.800 Once or twice, or how many times that you would like, 02:49.800 --> 02:53.880 but none of the times you're actually interested in reading the information itself. 02:53.880 --> 02:59.000 So if you don't, you don't interested, let's see if we move this capability, what we can get, 02:59.080 --> 03:00.680 and we can get a lot, right? 03:00.680 --> 03:04.360 So usually, you'll have this buffer, you copy into the user, 03:04.360 --> 03:10.680 you cache it if it's a CBIN, or some kind of cache, and then you copy it back each and every time 03:10.680 --> 03:13.480 when you have to send it, right? 03:13.480 --> 03:15.640 So how many CPU cycles? 03:18.040 --> 03:21.880 A lot, it depends a lot on the speed of your network archive. 03:21.880 --> 03:29.720 So the working size of your memory really matters. 03:29.720 --> 03:36.440 So it's kind of, it's a number, it makes sense, but really depends on a lot of different factors 03:36.440 --> 03:38.840 on how much memory you're using. 03:38.840 --> 03:42.280 When you're not copying data, you're using much less memory, 03:42.280 --> 03:48.920 or fewer bytes, so this will improve your performance. 03:48.920 --> 03:52.280 The lot, again, depending on your exact use case. 03:52.280 --> 03:58.360 So what's your overhead elements, and why I'm calling it zero overhead, 03:58.360 --> 03:59.960 you know, zero copying? 03:59.960 --> 04:03.880 Usually have copies, but you do have a message, 04:03.880 --> 04:07.960 zero copy, and other stuff inside the account today, already. 04:07.960 --> 04:15.240 They all have their own problems, right? 04:15.240 --> 04:23.880 They inefficient in some ways, they're trying to remap your page tables on the fly, 04:23.880 --> 04:25.720 and it hurts performance, right? 04:25.720 --> 04:32.280 So with zero overhead, it means your information moves from the Nick to 04:32.280 --> 04:38.520 and back to itself, basically, without any overhead, just the control plane, 04:38.520 --> 04:44.600 and your data plane doesn't move any bytes, and it doesn't manipulate any metadata 04:44.600 --> 04:47.960 that you would need in order to avoid moving bytes. 04:47.960 --> 04:50.280 It's just there, right? 04:50.280 --> 04:56.840 So now that you're aware of the magical capabilities of our solution, 04:56.840 --> 05:00.760 let's kind of see what it actually does. 05:00.760 --> 05:03.960 Okay, actually, repeating myself here. 05:03.960 --> 05:12.440 So the idea is that because you're not moving data, 05:12.440 --> 05:16.600 you actually keep the data inside the account, right? 05:16.600 --> 05:20.840 It stays inside the account, it stays inside the account, it stays inside the account buffers. 05:20.840 --> 05:27.720 You, as the proxy system, only a handler, and you're kind of offset and size, 05:27.720 --> 05:30.520 which you want to do when it's sent, right? 05:30.520 --> 05:34.120 So we only have the buffer, it is socket IDs, 05:34.120 --> 05:36.440 offset and left as I mentioned. 05:36.440 --> 05:40.440 There are no pointers to the data, the handlers that you can use, 05:40.520 --> 05:43.640 you know, to address the data and say, here, I'm sending it, 05:43.640 --> 05:49.880 this, it's a handler, not a pointer, right, because you can't actually access it. 05:49.880 --> 05:56.200 And all the kernel IO happens exclusively inside the kernel, right? 05:56.200 --> 06:04.520 User space, in this example, we're using IO-uring to communicate with our 06:04.600 --> 06:05.800 kernel driver, that does it. 06:05.800 --> 06:13.480 I will give an example of a specific, specific use case, a bit later, right? 06:15.080 --> 06:18.840 So what the user does do, right? 06:18.840 --> 06:27.320 So we allocate buffer handles, this is a solution to a problem that we have today. 06:27.320 --> 06:34.360 Mainly, we kind of make sure that there is a slot in our kernel 06:35.320 --> 06:38.920 driver space to for allocated buffers. 06:38.920 --> 06:43.080 It's just to make sure that we can manage our memory in a good way. 06:43.080 --> 06:46.680 We can get back pressure in the kernel sockets and things don't explode. 06:47.400 --> 06:51.880 It's not in necessity, it's a solution today, right? 06:53.720 --> 06:59.560 In a good way, we just, you know, we still pack it and get a pointer or a way of 06:59.880 --> 07:04.760 addressing your bytes in a way that the kernel in the sense, right? 07:06.200 --> 07:08.600 But today we pre-allocate the buffers and we know 07:09.720 --> 07:13.640 which, the buffers are actually descriptors, right? We'll pre-allocate descriptors. 07:14.280 --> 07:20.280 And we have them ready and we know that the TCP sockets can handle it and we can manage them. 07:20.840 --> 07:22.760 I'll give an example, it's a bit abstract. 07:24.040 --> 07:29.480 We create and manage sockets, but we, again, the sockets, we are using our kernel sockets. 07:30.520 --> 07:35.880 So we have a kernel driver that creates kernel sockets and what you have in use space 07:36.840 --> 07:45.080 is an obstruction. We have this library that gives you, it provides you a socket like a API. 07:45.640 --> 07:50.520 But the sockets themselves that are using, you know, TCP, you to people, whatever you like, 07:50.520 --> 07:59.000 answer the kernel, right? So we get received completionifications and you can request 07:59.080 --> 08:01.880 to get a peak inside the data that you're seeing, right? 08:02.680 --> 08:07.320 Because when you receive data, you may want to look at the headers, for example, right? 08:07.320 --> 08:11.320 And if it's kind of type of TLV information, right? This is what you got. 08:11.960 --> 08:17.560 This is the size of it and you know, the offset of whatever you want to do, right? 08:17.560 --> 08:23.240 You don't actually need to see the whole data, you might need to see parts of it, like, small parts of it. 08:24.040 --> 08:28.680 And so, but you can access the data in any way, right? 08:30.440 --> 08:36.680 This is the kernel architectural, basically, as I've described, you have this DevSafeK process. 08:37.480 --> 08:46.280 Hender, like, a fuse, and you have this buffer pool inside the kernel. 08:46.280 --> 08:51.720 You have the socket manager and you have kind of a zilcopy iron engine, it's kind of an obstruction pins. 08:52.680 --> 09:02.680 That there are changes to be made inside the kernel sockets, or actually, oh, oh, okay. 09:02.680 --> 09:07.880 Now this class is a bit later, it's a bit one like to talk about it, but that's the architecture. 09:07.880 --> 09:16.760 You have the kernel sockets, you have the buffers, and they have the application that one of the uses potential use cases is a fuse. 09:17.480 --> 09:23.400 Again, we'll discuss it a bit later, we have the handlers, handles, you know, can create it, 09:24.120 --> 09:31.880 and we have the actual binaries, the library for, for, for interaction, we work kernel device. 09:33.640 --> 09:40.440 So, data flow, we see, we see buff, so we have the, we have the, we have the, we have the kernel 09:40.520 --> 09:46.520 machine, the kernel is a message, which, the problem that we have in today's implementation, 09:46.520 --> 09:53.160 that the kernel API copies the bytes inside the kernel as well, right, it has the same copy semantics. 09:53.800 --> 09:59.720 But it's not a given, right, there are copies that can be used, and you can get the IOVect that, you know, 09:59.720 --> 10:03.720 from the actual buffers that you, that we'll receive from the link itself, right. 10:03.720 --> 10:09.880 So it's an implementation limitation today, because my focus was to get the infrastructure going 10:09.880 --> 10:15.400 and then to actually make sure that, you know, the bits and bytes actually do not copy, right. 10:16.040 --> 10:20.440 So you have the kernel received message, and it goes into our data buffer flow, 10:20.440 --> 10:29.080 that's kind of the pre-allocation that we have today already, right. So, in the correct implementation, 10:29.080 --> 10:33.960 we have the correct, we have zero copies, right, and you have this buffer ID, which today 10:34.040 --> 10:42.760 actually points to an actual buffer, actual size, but there is a point to an IOVect of receive pages, 10:42.760 --> 10:50.920 right, again, they have the sizes, but it's not a single buffer, it's an IOVect, as we receive from the 10:51.720 --> 11:01.400 net also, yeah, after the speed. That's the forward path, you, you receive the buffer, 11:01.480 --> 11:05.480 you have this ID, and then you send it, right, but again, you're going to send it on an actual 11:05.480 --> 11:14.440 buffer, so get you send it to a device, and it sent via the kernel, uh, kernel socket, like again, 11:15.480 --> 11:27.880 true zero copies need to be a handlebar, okay. Why use space initiates and and price, and a couple 11:27.880 --> 11:34.280 of excuses, mainly I didn't get to fixing it properly, but you still need that pressure on your 11:34.280 --> 11:40.440 TCP circuits, right, when you have two circuits, and you combine in them, one of the projects that 11:40.440 --> 11:44.440 you go and I will involve a couple of years ago, it's just connecting two TCP circuits, 11:45.160 --> 11:50.760 you'll get to this, you, you will get an improvement in the TCP performance, they just 11:50.840 --> 11:59.400 amagic of it, just look at the KDCP, it's on YouTube, uh, laughter sentty means that, 12:01.400 --> 12:09.720 there is no kind of several bugs that you can handle, get into, uh, and no hot buffer locations, 12:09.720 --> 12:13.880 right, so you pre-all look at the memory, you don't know how to get anything on the fly, 12:13.880 --> 12:19.240 and simply they happen, like, one, two, three, four, five, instead of handling, uh, K solar and 12:19.320 --> 12:24.840 saying, hey, this is the kernel address, now we need to remove the base and kind of the opposite, 12:24.840 --> 12:30.440 it's such a very complicated, but for the initial implementation, it's kind of, the sum of the 12:30.440 --> 12:40.440 problems, uh, kind of resulted in our, uh, in our module that having a prime, uh, the memory. 12:41.480 --> 12:47.400 So, kind of components today, we have the buffer publication, we have the soccer management, we have the 12:47.480 --> 12:53.880 are you in interface to interface between your application, and you can, the kernel sockets, 12:55.080 --> 13:03.000 two, zero copy, it's in the works, you have this user space library that you can create context, 13:03.000 --> 13:09.640 destroy context, allocate buffers, uh, free them, uh, socket, listen, uh, you can be, you know, 13:09.720 --> 13:20.520 sending receiving, a, basically, a socket again, and here is kind of a chunk of, uh, an actual 13:20.520 --> 13:27.240 code, you can get it in, in, uh, on GitHub, like, a, you receive buffer, and then, 13:28.280 --> 13:34.120 what's kind of, uh, it's not a very good chunk of memory, but here, you receive bytes, and now, 13:34.680 --> 13:41.080 you send them with the context that you created, and this is the out socket, and this is the buffer, 13:41.080 --> 13:49.560 right, it's a handle, offset, if you want to send it all, length, and, and some type, right, 13:50.360 --> 13:56.360 and here, you just flush it, it's an IO, you ring, uh, application, and you just sent the, 13:57.320 --> 14:06.760 not the bytes, right, you just sent the metadata, that's all we did, mate, uh, forms expectations, 14:06.760 --> 14:10.680 we're talking about, we're talking about, probably, from previous experience, we're working in 14:11.640 --> 14:18.440 40 gigabytes, 100 gigabytes, again, if you're working with a 10 gigabytes, uh, 14:19.640 --> 14:24.840 you're probably okay with copying as well, but again, it very depends, depending on, 14:25.720 --> 14:38.120 your exact use case, uh, so with traditional, you have two copies, uh, it's there, right, 14:38.760 --> 14:45.240 you copy a lot of bytes, and you have some latency, again, because you're, you're 14:45.320 --> 14:52.760 disrupting the cache and, uh, and wasting cycles on, on copying, you're not efficient, right, 14:52.760 --> 14:59.560 M of zero copy, you don't have actual copying, but you are there inside very critical 14:59.560 --> 15:05.560 paths of your, of your, uh, of your process, managing your cache levels, and it's inefficient. 15:06.440 --> 15:14.280 Zero overhead means that you have none of it, right, you just, we see in handlers and sending 15:15.160 --> 15:25.080 the way. So, why say fuse, why are we having this discussion in, in, in, as, as there's, 15:26.200 --> 15:30.760 is because one of the use cases that we want to reverse is the few splined, right, 15:31.800 --> 15:40.120 so fuse is a, I don't know, how, how many are familiar with it, basically, a use space interface 15:40.200 --> 15:48.520 for your abstract, uh, virtual file system, right, so you have fuse in, uh, fuse kind of driver, 15:49.400 --> 15:56.680 and all the callbacks for a regular file system, operations are delegated back to the, 15:56.680 --> 16:03.720 use space color, right. So, uh, we have the use space client, and we want to 16:04.680 --> 16:13.160 receive the bytes, uh, from the network and keep inside the canal, and then push them back into the 16:13.160 --> 16:20.040 fuse client and back to the user. So, today, if you've client, uh, we'll receive the bytes, 16:20.040 --> 16:26.280 copy them into user space, then copy them back into fuse, and then maybe copy it back, I'm not sure 16:26.360 --> 16:34.680 about it, uh, back to the user, again, right. So, it's free copy, with our solution, uh, the initial 16:34.680 --> 16:43.080 copy is missing, obviously, the path between the, uh, network and the fuse, the couple of ways 16:43.080 --> 16:49.320 to go about it, uh, I'm, I'm, can split between splies and, and maybe see if I, you can be optimized. 16:50.280 --> 16:57.400 But if it's done, basically, we'll have Dr. Arriving, the logic of a fuse client remaining in 16:57.400 --> 17:03.400 user space, and the actual bytes, moving, we are one of the available mechanisms, 17:04.200 --> 17:12.440 into the fuse canal driver, and then to the user, right. So, it's both, uh, send and receive, 17:12.520 --> 17:19.720 just, you know, the other way around. So, basically, aside from fuse providing the bytes to the 17:19.720 --> 17:29.720 user, there are no copies. Um, okay, skip the, uh, so, you have splies, I mean, maybe you 17:29.720 --> 17:37.960 only be pf, I don't know, I was, uh, just think about it. Implementation status, as I mentioned, 17:38.040 --> 17:43.640 most of the things are done, and I need to choose your copy, then we'll talk about fuse integration, 17:45.320 --> 17:50.040 some performance benchmark, our demo support, I know, just, you know, vibing. 17:53.400 --> 18:00.600 What else do we need? So, uh, library integration, standalone demo is available right now. 18:01.320 --> 18:05.800 Again, you need to see if what's ups, ups, you know, up there is a compiling. 18:08.920 --> 18:14.600 Then we will need to integrate with the fuse, uh, and use capers for, you know, 18:14.600 --> 18:22.520 such a large operation, right. What else we can do? We have additional things in, in, in, um, 18:22.520 --> 18:28.200 as they are, like, or are self-specifically Ganesha, right. We've Ganesha is a user space process 18:28.840 --> 18:34.840 that's, on one hand, is a self-clined, and on the other is NFS exporter, right. 18:35.480 --> 18:40.120 But it doesn't actually need to touch any of the bytes that it's, uh, you know, servicing. 18:41.400 --> 18:48.040 So having Ganesha use this kind of solution is ideal. It will just collect, identify as from the 18:48.040 --> 18:55.240 canal, keep the buffers wherever they may be, depending on the, catching, uh, algorithm or 18:55.240 --> 19:01.240 catching policy that Ganesha may have at hand at the moment, and then service it as many times as 19:01.320 --> 19:06.760 it needs to, right. So Ganesha is the ideal on the client, but it's kind of, you know, 19:07.320 --> 19:14.840 down the road because there's, you know, many complex things. Some other, there are non-file 19:14.840 --> 19:21.400 system solutions, uh, CDMs, right. So when you have a file stream, like a movie stream, right, 19:21.400 --> 19:27.240 you just receive it, and then whenever a client that's close to you needs to see whatever, 19:27.560 --> 19:35.080 you know, whatever, Keden, uh, Denson Keden, you want to see, you know, it's a service from the canal. 19:37.160 --> 19:45.320 Uh, some of Nankesh D, uh, read this again for the same, uh, use case. You receive whatever, 19:46.120 --> 19:52.200 you want to service, but you keep it inside the kelom memory, you keep the handler in all the 19:52.280 --> 20:02.520 logic that you had remains unmodified. And that's kind of the, uh, general proxy pattern. 20:04.120 --> 20:10.280 Uh, so let's talk about it. The key takeaways, you have zero overhead, it's better than zero 20:10.280 --> 20:18.520 copy, right. None of the usual things that are that we have today with zero copy solutions are here. 20:18.520 --> 20:26.360 So it's, it's, it's zero, right. So zero overhead, it's a handlebase architecture. The reason why we 20:26.360 --> 20:34.120 chose this, it's very unintrusive, right. So, uh, it's a canal module that can be built outside of 20:34.120 --> 20:41.960 three, right. It can be just our own implementation. So I don't really need to talk to a worry 20:42.040 --> 20:50.920 about upstream things, the thing, but you know, it's in the, in the plan, uh, proxy workloads benefit, 20:50.920 --> 20:58.120 like, or everything that I mentioned. And so any feedback, any little feedback, because again, 20:58.120 --> 21:06.280 this is working progress is welcome. Uh, the 50, 70 per the person production is, if you're going 21:06.360 --> 21:15.320 up to a 100, there will be Nick. Yeah, you'll be spending about 70 per, 70% of your, uh, CPU, 21:15.320 --> 21:22.280 just moving bikes. If you go about, you know, 100, there will be links and end up. So I, you know, 21:22.280 --> 21:26.840 going forward, and again, 100, there will be links are here for, you know, being here for 21:26.920 --> 21:36.200 since about 10 years. So, it's now, uh, questions. Thank you. 21:40.200 --> 21:44.920 First of all, first of all, um, if you put your reach out there, it will cost somebody's a symbol of 21:44.920 --> 21:51.880 one, um, in the second question, I have, uh, your age and age also, so you don't have to set 21:51.960 --> 21:58.040 under score in there. That's true. I mean, how is this set specific? Okay. So, uh, it's good 21:58.040 --> 22:02.760 question. So it's a double question. Yes, it's both going to show it's both sound, but it's just an example. 22:02.760 --> 22:09.000 You can use anything else as well. Why does it have Seth? Because I'm bad with names, and we can 22:09.000 --> 22:16.120 change it. There's no, no, no, no, no, no, no, no, no, Seth, uh, yeah, got. What's the difference 22:16.680 --> 22:24.040 between Gladys- bekleres from Ioling? I'm certain. Oh, yes. 22:27.400 --> 22:32.840 For all that, Scott? I also loved assuming the other, ooh, yes. 22:40.200 --> 22:44.600 You can few multiple operations on the master by the whole set. 22:44.600 --> 22:51.600 Okay, I'm not sure can repeat it to the let's take it out later 23:01.320 --> 23:03.320 chemistry couple 23:14.600 --> 23:39.200 So the question is if we might still want redaxes or in this case full redaxes to 23:39.240 --> 23:43.320 Well, I'm the five-star on it's usually done by the Nick today 23:43.600 --> 23:48.280 So what you get is what you get you don't have additional validations on the CPU 23:48.280 --> 23:50.280 Right 23:50.280 --> 23:52.280 We do have a 23:52.600 --> 23:58.400 Peak logic inside of you can get because you receive a stream of bytes that's the CCP 23:58.920 --> 24:03.060 You still need to understand how to divide it logically. So you do have peak and 24:03.060 --> 24:08.380 Pick means yeah, we copy some of the bytes and you know it depends on you 24:08.380 --> 24:10.380 What does you know how many bytes you want? 24:10.380 --> 24:16.380 So from safety for you know for the quality it remains like more a lot of the 24:17.380 --> 24:23.420 Making sure that you know no no bits were flipped is done on the hardware level right both in the receiver and the same side 24:24.820 --> 24:29.820 Don't use the geophile. I think the F5 sides, but I'll end that you'll be good 24:30.820 --> 24:32.820 Okay 24:45.820 --> 24:55.700 So the question is about our day may are they may is is nice a the problem of our day may is twofold one is that you need 24:55.700 --> 25:00.960 Dedicate hardware and you don't have it usually, right even for for rocky 25:01.300 --> 25:06.820 Variants or you know it's over our day may a most people won't have it, right? 25:07.460 --> 25:13.720 And the other is it's kind of a competitive solution in a way like other day may doesn't need this and 25:15.420 --> 25:22.820 The the only benefit for our day may would actually be any application. Actually does need to to read because it 25:22.820 --> 25:24.820 writes into your user space 25:25.340 --> 25:27.340 Memory and then you can access it 25:32.780 --> 25:40.060 It's not a common API. I don't want to talk about I don't know what I implement I be verbs inside of this 25:40.580 --> 25:44.960 It's distinct it services a very specific 25:46.140 --> 25:48.140 Neville proxy 25:48.140 --> 25:55.900 Zero copy solution right for if you're not an act of proxy if you're not proxy in some stream for one side to another 25:57.580 --> 25:59.580 You can do you know 25:59.580 --> 26:01.580 Something else 26:03.660 --> 26:05.660 Okay, if we a time 26:07.660 --> 26:11.180 Okay, if I do the space process talking to use 26:11.900 --> 26:15.580 I want to ship that without if you cover whatever to okay 26:18.380 --> 26:24.860 No, but again, I'm not sure so you use talking at unique circuits and you want to do 26:25.420 --> 26:28.620 zero copy with fuse from user space like 26:30.860 --> 26:32.860 So again fuse 26:32.860 --> 26:36.220 It you know and writing now the user space 26:36.820 --> 26:39.460 File not net of it not a net profile system 26:43.460 --> 26:47.460 Yeah, that's not a net of that's a net of works 26:49.140 --> 26:50.140 All right 26:50.140 --> 26:52.140 We have one more question 26:56.140 --> 27:17.860 Yeah, the user space register the buffer yes, so I am using just the interface to communicate with the kernel and then 27:18.140 --> 27:25.300 The and the and the user there are multiple ways about it. I even just seem the most suitable for it 27:25.820 --> 27:29.340 It's for the interface for the communication or for the data transfer 27:35.180 --> 27:37.180 Thank you folks 27:48.140 --> 27:50.140 You