WEBVTT 00:00.000 --> 00:14.600 Hello everyone, my name is Wilson and I work for Feature Fusion doing mostly continuous stuff 00:14.600 --> 00:19.000 and Linux run on stuff around that together with Stefan Grabber. 00:19.000 --> 00:22.000 Hello, so Stefan Grabber, I'm the city of Feature Fusion. 00:22.000 --> 00:26.360 I do some other stuff but I'm indeed also the particular for all Linux containers, one of the 00:26.360 --> 00:31.520 main trainers of Incas alongside Christian, Max, but for the folks and we're going to be talking 00:31.520 --> 00:36.200 about resource management in containers. 00:36.200 --> 00:42.480 So first of all, let's start from quickly looking at the ways to limit resources and Linux. 00:42.480 --> 00:47.040 We have plenty of ways, first of all, people may configure the Linux kernel through the 00:47.040 --> 00:52.360 using the kernel parameters, for example, let's CPUs or limit the number of CPUs that can 00:52.360 --> 00:53.360 be used. 00:53.360 --> 01:01.640 Obviously, CSFS can be used to configure stuff, some CSCOs that get a set affinity or obviously 01:01.640 --> 01:08.840 Ulimits and Cgroups for containers, probably Cgroups is the main one and sometimes because 01:08.840 --> 01:15.520 of these kernel at the same time doesn't provide any clear way to tell the user space 01:15.520 --> 01:19.720 about, for example, how many CPUs I can use right now. 01:19.720 --> 01:25.160 The user space application developers need to understand a lot about how stuff works inside 01:25.160 --> 01:29.960 to fight an answer to questions like that. 01:29.960 --> 01:39.960 And in system containers, we take care of that and we try to, in LexiPers, we try to make 01:39.960 --> 01:45.960 any application to run inside the container and feel like it's the full host kind of access, 01:45.960 --> 01:46.960 right? 01:46.960 --> 01:52.320 And we obviously need to emulate a virtualize a lot of things, we need to make sure that 01:52.320 --> 02:01.280 the user issues the common like uptime or free or H-top user can see the right stuff appears 02:01.280 --> 02:07.040 in the numbers and it can be really, really complex, problem, sometimes even unsolvable 02:07.040 --> 02:12.400 because we have no, not enough kernel API, in this case, we need to develop that APIs 02:12.400 --> 02:19.120 for us and here are the list of things, it's not a full-of-course that we need to virtualize 02:19.120 --> 02:26.120 in Lexi to make the system containers to work properly. 02:26.120 --> 02:31.640 And how it works really quickly, so basically, application runs inside the container, 02:31.640 --> 02:37.520 of course, it's a philosophical term because nobody knows what the container is, right? 02:37.520 --> 02:41.800 But in case of Lexi, it's well defined, so we have all the name, we use all the name spaces, 02:41.800 --> 02:49.000 we have C groups, and let's say process inside the container issues, wants to read some 02:49.000 --> 02:51.480 file from ProCAP time. 02:51.480 --> 02:57.840 So what we have right now, we usually over mount ProCAP time, file with file from specially 02:57.840 --> 03:03.800 crafted, syntactical file system called LexiFS, which is a fused-based file system, so it 03:03.800 --> 03:09.940 managed by the fused-demon, the tool that works in user space and gets a request from the 03:09.940 --> 03:18.840 kernel, makes some actions on that request depending on which file it is, right, and 03:18.840 --> 03:21.680 provides the answer to the request, right? 03:21.680 --> 03:26.980 So basically, the key point here is that we basically over mounting every single file we 03:26.980 --> 03:31.320 need or dietary that we need to virtualize. 03:31.320 --> 03:39.160 And internally, on the fused-lexiFS fused-demon side, we need to solve, usually we need to follow 03:39.160 --> 03:44.420 this pattern, we need to determine which container process belongs to, and it can be 03:44.420 --> 03:49.380 challenging, because obviously, again, there is no such thing as container really in the 03:49.380 --> 03:51.480 kernel, on the kernel standpoint, right? 03:51.480 --> 03:57.020 So usually for us, it means that we need to determine the initial process of the container, 03:57.020 --> 04:03.940 so we rely on, we assume that the name space is used, of course, so we use some trickery 04:03.940 --> 04:10.940 to figure out who is the initial processing that container, so which means pin name space 04:10.940 --> 04:16.500 in this case, and so we usually using this, this is using a unique circuit, and circuit 04:16.500 --> 04:22.780 control message called SDM credentials to do that, so basically, we're creating a unique 04:22.780 --> 04:27.660 circuit, we're creating, we're working a process, we're entering a pin name space of that 04:27.660 --> 04:35.420 process, that issued the fused request, then we're sending, especially, crafted SCM message 04:35.420 --> 04:42.620 with the PAD number one, and we kind of are using, we know that kernel, what kernel does, 04:42.620 --> 04:47.820 is kernel does the proper PAD translation, so if you have a unique circuit, and one side 04:47.820 --> 04:54.300 of that, if you have a unique circuit player, and one side lives in one pin name space, 04:54.300 --> 04:59.420 and another side lives in another pin name space, then if you send the PAD through a circuit 04:59.420 --> 05:04.100 control message, then we know that the kernel will translate PAD for us, so basically, we kind 05:04.100 --> 05:12.380 of are using this thing to figure out which PAD the process has on the host, if that 05:12.380 --> 05:18.340 process has a PAD number one inside the pin name space, so this is how we solve the first 05:18.340 --> 05:25.340 thing, and second, and then we have the another very much more complex thing, because 05:25.340 --> 05:30.700 it's not always, again, it's not always even possible to figure out which effectively 05:30.700 --> 05:37.300 means process has, so we need to calculate that, sometimes it can be trivial, I can case 05:37.300 --> 05:41.820 of uptime, obviously, what we do, which has to go to the PAD number one in that continuum 05:41.820 --> 05:46.620 and figure out how for how long this process runs, and this is at that time of the 05:46.700 --> 05:51.700 continuum, but for some cases, like memory, it can be really not trivial, because depending 05:51.700 --> 05:57.940 on the C group version you rely on, there can be no direct mapping, because in C groups 05:57.940 --> 06:03.660 we have software, it's hard-limits, other things, and it's really complex thing, and then 06:03.660 --> 06:12.100 basically, once we figure out all the stuff, we can build the proper output, because all 06:12.100 --> 06:18.020 the files we emulate, they are textual format files, and then we just give the reply back 06:18.020 --> 06:25.140 to the kernel, which the user, I mean, the user request reply, and the kernel can answer 06:25.140 --> 06:35.220 the user space and block the process and give the answer, which, but what I covered 06:35.220 --> 06:42.940 is really basics, we have seen some difficulties, for a long time, for example, we 06:42.940 --> 06:48.660 just start views, because it's also complex thing, because we have no device in space, 06:48.660 --> 06:55.060 and it's hard to understand, it's hard to formally define which block devices I used 06:55.060 --> 06:59.420 by container and which are not, because there is no such thing in the kernel, right? 06:59.420 --> 07:06.940 And sometimes people struggle to see what they expect with regards to swap utilization 07:06.940 --> 07:13.740 inside the system container, sometimes people complain about H-top, not showing the right 07:13.740 --> 07:18.740 values, expected values, when people doing stress load, because for example, if you do a 07:18.740 --> 07:24.660 stress load on the CPU, on the host, then inside the container H-top, you'll show you that 07:24.660 --> 07:29.060 all the CPUs are busy, while inside the container you have nice thing, at the same time, 07:29.060 --> 07:33.580 but it's not something we can really fix just in user space, we obviously need some kernel 07:33.580 --> 07:38.860 support for that, and the problem is that previously in C-group 1, we had a special controller 07:38.860 --> 07:46.180 called CPU Act, which allowed to account the CPU time spent, but in C-group 2 unfortunately, 07:46.180 --> 07:51.100 we don't have any analog for that, probably because of performance reasons, because it's 07:51.100 --> 07:59.740 kind of costly thing, another challenge is that not everything goes from pro-cafers, right? 07:59.740 --> 08:07.980 So we have some C-scores that can allow user space to retrieve some information, like the 08:07.980 --> 08:13.420 easiest example, probably season for C-score, it allows to also retrieve a bunch of information 08:13.420 --> 08:19.020 like load, average, uptime, payment force, up and for and other things, and we cannot 08:19.020 --> 08:28.020 really hijack that using LX CFS, and also there are some other sources of information, 08:28.020 --> 08:33.680 like C-SFS, especially with regards to this information about CPU topology and a number 08:33.680 --> 08:41.020 of CPUs and CPUs states, like in case if they're a hot-hot-logobal, right? 08:41.020 --> 08:48.080 And I've seen just recently, maybe like a year ago, we have received a real world 08:48.080 --> 08:54.800 use case is not a synthetic L1 when inside the VM I don't remember maybe it was open stack VM 08:54.800 --> 09:01.920 but I'm not sure but anyways like it was a VM and the person was argue was not happy about seeing 09:01.920 --> 09:07.360 really weird results in the monitoring tool because the monitoring tool I think it was something 09:07.360 --> 09:11.920 is something that was not exported probably it was showing really weird values in terms of how many 09:11.920 --> 09:17.600 CPUs this container have because container had just one CPU but this thing was showing 09:17.680 --> 09:26.160 much more and after analyzing I figured out that actually the problem was that the VM the inside 09:26.160 --> 09:32.720 which this container is deployed was having like 12 CPU slots while only 3 CPUs were present 09:32.720 --> 09:37.360 and the interesting thing was that the kernel was handy this really in the final way in 09:37.360 --> 09:45.040 on CSFS the number of CPUs user might see on CSFS was less the number of CPUs in a 09:45.040 --> 09:52.800 C group you won CPU Act file so in CPU Act file there was like 12 CPUs while in the CSFS there was only 09:52.800 --> 09:58.240 3 CPUs and I was like okay that's weird let's go to the kernel code and it was kind of obvious 09:58.240 --> 10:05.440 because in the kernel code one thing which I mean CPU Act was using for each present CPU macro 10:05.440 --> 10:13.360 is the kernel iterator over the CPUs while the CSFS was using CPU Act was using for each possible 10:13.360 --> 10:19.920 CPU which larger set so again like usually if you just I don't know running stuff on your laptop 10:19.920 --> 10:25.200 probably you have just everything equal you you won't probably want to face that but on on on in some 10:25.200 --> 10:32.480 cases we see especially on the series we see you should like that that's it I will just give the 10:32.720 --> 10:34.720 microphone to Stefan 10:40.400 --> 10:45.920 hello it's me again all right so I just wanted to show the current state of things and then we can talk 10:45.920 --> 10:55.200 a little bit about the future depending on how much time have so let me just get a quick terminal up 10:56.160 --> 11:02.160 somewhere okay that's here 11:09.760 --> 11:15.600 all right so if we're just looking at the basics and we get ourselves the the then 11:15.600 --> 11:26.880 13 container and we give it two CPUs to the exam okay and then we want to go to the same thing 11:26.880 --> 11:35.360 with Alpine okay we go in the in the WN1 11:37.520 --> 11:43.360 3 and 4 are both happy they're both using plot files let's see if it's doing a job everyone's happy 11:45.600 --> 11:52.240 let's go look at the other one oops there's no bash in Alpine um free yeah 32 gigs of RAM 11:53.360 --> 11:59.840 and product yeah that one looks good um which one is wrong not a CPU some of the like effectively 11:59.840 --> 12:04.880 we get some inconsistent information there and the reason I is that some of the tools in Alpine 12:04.880 --> 12:12.000 are now using system for instead of using XFS well instead of using plot thankfully in Gus 12:12.000 --> 12:21.040 how to work around for that so we can do security this goes intercept system for sure 12:22.400 --> 12:27.040 let me start this thing 12:30.400 --> 12:34.880 and we're back to normal but now it means we've got two completely different systems that need to 12:34.880 --> 12:39.200 there with that we've got the system going toception code inside of Incus that needs to 12:39.200 --> 12:44.400 catch every single code it's info and then go and get the data and return the right thing 12:44.400 --> 12:50.080 as well as having LexiFS also do all of the podcast masking for all of that information 12:50.080 --> 12:53.920 that's gonna duplicate data we can share some amount of code here and there but that's kind of 12:53.920 --> 13:01.440 knowing it's also basically losing battle we know that you know there's more and more of 13:01.440 --> 13:06.960 those kind of things that are gonna keep popping up and there is no really appetite in the Linux 13:07.920 --> 13:14.560 internet external for having the probe vertices info view respect C groups or respect other 13:14.560 --> 13:20.480 kind of limits that's just not something that anyone wants to have an interest for and so we need to 13:20.480 --> 13:28.000 look at what can we do about this stuff and it's basically what I'll we're saying a continuous plan 13:28.000 --> 13:31.760 and it's because not because it's gonna take us 10 years to implement it's gonna take 10 years to 13:31.760 --> 13:38.080 have a big amount of adoption of this thing so basically what we need is to get to the point where 13:38.880 --> 13:45.120 user space no longer passes product to figure out their resource limits or no longer really tries 13:45.120 --> 13:51.680 to use this info to get their resource limits instead the the rough idea is to try and have a 13:52.640 --> 14:01.280 probably a rest rate with then a CIPI library as well as a volume small service 14:02.240 --> 14:06.160 that will provide the same information so if you're written in rust we can use the 14:06.160 --> 14:11.440 create and you can use that to get a very reliable answer as to what your resource limits actually 14:11.440 --> 14:17.520 look like as well as some other non-ambient system info whether it's like LSMs or that can 14:18.480 --> 14:24.720 and if you're not interested but you're in some dynamically linked language running on the system 14:24.720 --> 14:30.880 you can link to the library and get the same information either in both of those cases you can 14:30.880 --> 14:35.440 either query about yourself or any other process because you're typically doing that query in 14:35.440 --> 14:40.240 process so if there's any security checks or whatever they will care probably and then there's 14:40.240 --> 14:45.680 option number three which is for what happens if you're dealing with cross compiles that 14:45.680 --> 14:52.160 think binaries things that go binaries some drama stuff that kind of thing where it's not really practical 14:52.160 --> 14:59.360 to depend on a shared library that gets pretty quickly so the idea is to effectively have 14:59.360 --> 15:07.440 a circuit activated process running on the host system which you can quickly open a unique circuit 15:07.440 --> 15:12.800 too and then make the same query over over that and then we get the SEM thread on the other side 15:12.800 --> 15:17.680 and can figure out who the color process is where they are and then return the right response 15:17.680 --> 15:21.200 in this scenario because you're effectively doing your queries through a third-party process 15:21.200 --> 15:25.760 you wouldn't only be able to do it against yourself like you only get your own resource information 15:25.760 --> 15:30.560 you can't use that to try and trick the system into getting information about processes that 15:30.560 --> 15:34.880 you're not otherwise access allowed to access well that's kind of the rough idea 15:36.400 --> 15:40.560 we've been talking about that before it's the second time we talk about it at first 15:40.560 --> 15:45.440 them that's time it was called leave my campaign or something or not from Netflix which 15:45.440 --> 15:51.920 didn't really go anywhere but roughly same concept we might try and do a bit more work on that 15:52.480 --> 15:58.080 these days since future fusion we've recently got Alex on board and we've got a bit of time to do 15:58.080 --> 16:03.360 some low-level work we might try and get it done Lex CFS would be an obvious thing to post over 16:03.360 --> 16:09.040 to that as a way to make sure that all of those kind of APIs are all correctly supported and then 16:09.040 --> 16:14.320 we probably start pushing for like specific run times probably like go in Java for like hey 16:14.320 --> 16:18.240 instead of doing your own thing which is always wrong basically right now and like every time 16:18.240 --> 16:23.120 you run Java in the container it's getting just slightly wrong data instead of doing 16:23.120 --> 16:27.360 maps use this library or use this demon effectively if you get that data if it's available 16:28.400 --> 16:32.320 and see if things get better and if it is then hopefully you can slowly get more and more 16:32.320 --> 16:37.120 at least more more of the programming languages and run times to handle that and that will 16:37.200 --> 16:42.400 leave us with all of the other tools like you know top h.ps whatever all of those guys will still 16:43.440 --> 16:47.840 need to be bought it over which is why it's going to take a long long time like we need to find 16:48.560 --> 16:52.480 what's the best way to get a bit of adoption and kind of try and get the ball rolling and hopefully 16:52.480 --> 16:59.200 it's no balls and more and more folks not using this nobody likes passing truck it's really tricky 16:59.200 --> 17:05.040 it's often wrong so I think people would actually welcome a clean library to this kind of stuff 17:05.360 --> 17:10.320 another point and thankfully that is not in the room but otherwise you would have asked me 17:10.320 --> 17:15.840 it like hey why what about using you know system discourse using the system the API to figure 17:15.840 --> 17:19.600 that stuff out and the answer to that is like yeah that works great if you're on a system with 17:19.600 --> 17:25.280 system the running system the things but if you are like a static binary running inside of an 17:25.280 --> 17:30.720 OCI container will actually be on the binary on disk and you want to know about your limits 17:31.280 --> 17:34.880 you're not going to have a system to talk to like you don't have it in its system you're just 17:34.880 --> 17:40.720 a single binary running in a in a container so in that scenario being able to talk to that 17:40.720 --> 17:45.680 unique sockets that would just be re-explosed by darker re-explosed by anything else will get 17:45.680 --> 17:51.360 to the information you want or if you've written in a language that makes that easy you could do 17:51.360 --> 17:57.200 the query yourself by using either the rest rate or the library so that's kind of the idea 17:57.200 --> 18:03.920 if we're getting pretty close to out of time we can maybe try one question maybe 18:03.920 --> 18:10.160 the question yeah if doesn't it 18:13.360 --> 18:19.600 I do agree to give a read on any plans for dealing with LXF's crashes I'm going to come a lot closer 18:19.600 --> 18:31.040 sorry hey it's on have you read them any plans about dealing with LXF's crashes 18:31.040 --> 18:39.360 oh LXF's crashes yeah the main so yeah the occasionally LXF's crashes we are looking at the 18:39.360 --> 18:44.480 crashes these days the main the main difference is that LXF used to work for canonical and have 18:44.480 --> 18:48.320 maybe five minutes a month to look at those is now working for me and it's going to get a bit more 18:48.320 --> 18:55.840 time to look at those but like and we've had some ideas to handle LXF's crashes better with things 18:55.840 --> 19:01.280 like effectively being able to handle crashes and re-attach the fuel smart system I think you did 19:01.280 --> 19:06.800 some work on that right effectively you're supporting like a some kind of resume of fused so that 19:06.800 --> 19:13.440 you can if there's a crash you can still keep the mounts as they are and through start LXF's re-attach 19:13.440 --> 19:17.280 it's all of the mounts and you don't need to restart all the containers at least so we've done 19:17.280 --> 19:22.720 some work towards that which we would make things all there we're done