WEBVTT

00:00.000 --> 00:14.600
Hello everyone, my name is Wilson and I work for Feature Fusion doing mostly continuous stuff

00:14.600 --> 00:19.000
and Linux run on stuff around that together with Stefan Grabber.

00:19.000 --> 00:22.000
Hello, so Stefan Grabber, I'm the city of Feature Fusion.

00:22.000 --> 00:26.360
I do some other stuff but I'm indeed also the particular for all Linux containers, one of the

00:26.360 --> 00:31.520
main trainers of Incas alongside Christian, Max, but for the folks and we're going to be talking

00:31.520 --> 00:36.200
about resource management in containers.

00:36.200 --> 00:42.480
So first of all, let's start from quickly looking at the ways to limit resources and Linux.

00:42.480 --> 00:47.040
We have plenty of ways, first of all, people may configure the Linux kernel through the

00:47.040 --> 00:52.360
using the kernel parameters, for example, let's CPUs or limit the number of CPUs that can

00:52.360 --> 00:53.360
be used.

00:53.360 --> 01:01.640
Obviously, CSFS can be used to configure stuff, some CSCOs that get a set affinity or obviously

01:01.640 --> 01:08.840
Ulimits and Cgroups for containers, probably Cgroups is the main one and sometimes because

01:08.840 --> 01:15.520
of these kernel at the same time doesn't provide any clear way to tell the user space

01:15.520 --> 01:19.720
about, for example, how many CPUs I can use right now.

01:19.720 --> 01:25.160
The user space application developers need to understand a lot about how stuff works inside

01:25.160 --> 01:29.960
to fight an answer to questions like that.

01:29.960 --> 01:39.960
And in system containers, we take care of that and we try to, in LexiPers, we try to make

01:39.960 --> 01:45.960
any application to run inside the container and feel like it's the full host kind of access,

01:45.960 --> 01:46.960
right?

01:46.960 --> 01:52.320
And we obviously need to emulate a virtualize a lot of things, we need to make sure that

01:52.320 --> 02:01.280
the user issues the common like uptime or free or H-top user can see the right stuff appears

02:01.280 --> 02:07.040
in the numbers and it can be really, really complex, problem, sometimes even unsolvable

02:07.040 --> 02:12.400
because we have no, not enough kernel API, in this case, we need to develop that APIs

02:12.400 --> 02:19.120
for us and here are the list of things, it's not a full-of-course that we need to virtualize

02:19.120 --> 02:26.120
in Lexi to make the system containers to work properly.

02:26.120 --> 02:31.640
And how it works really quickly, so basically, application runs inside the container,

02:31.640 --> 02:37.520
of course, it's a philosophical term because nobody knows what the container is, right?

02:37.520 --> 02:41.800
But in case of Lexi, it's well defined, so we have all the name, we use all the name spaces,

02:41.800 --> 02:49.000
we have C groups, and let's say process inside the container issues, wants to read some

02:49.000 --> 02:51.480
file from ProCAP time.

02:51.480 --> 02:57.840
So what we have right now, we usually over mount ProCAP time, file with file from specially

02:57.840 --> 03:03.800
crafted, syntactical file system called LexiFS, which is a fused-based file system, so it

03:03.800 --> 03:09.940
managed by the fused-demon, the tool that works in user space and gets a request from the

03:09.940 --> 03:18.840
kernel, makes some actions on that request depending on which file it is, right, and

03:18.840 --> 03:21.680
provides the answer to the request, right?

03:21.680 --> 03:26.980
So basically, the key point here is that we basically over mounting every single file we

03:26.980 --> 03:31.320
need or dietary that we need to virtualize.

03:31.320 --> 03:39.160
And internally, on the fused-lexiFS fused-demon side, we need to solve, usually we need to follow

03:39.160 --> 03:44.420
this pattern, we need to determine which container process belongs to, and it can be

03:44.420 --> 03:49.380
challenging, because obviously, again, there is no such thing as container really in the

03:49.380 --> 03:51.480
kernel, on the kernel standpoint, right?

03:51.480 --> 03:57.020
So usually for us, it means that we need to determine the initial process of the container,

03:57.020 --> 04:03.940
so we rely on, we assume that the name space is used, of course, so we use some trickery

04:03.940 --> 04:10.940
to figure out who is the initial processing that container, so which means pin name space

04:10.940 --> 04:16.500
in this case, and so we usually using this, this is using a unique circuit, and circuit

04:16.500 --> 04:22.780
control message called SDM credentials to do that, so basically, we're creating a unique

04:22.780 --> 04:27.660
circuit, we're creating, we're working a process, we're entering a pin name space of that

04:27.660 --> 04:35.420
process, that issued the fused request, then we're sending, especially, crafted SCM message

04:35.420 --> 04:42.620
with the PAD number one, and we kind of are using, we know that kernel, what kernel does,

04:42.620 --> 04:47.820
is kernel does the proper PAD translation, so if you have a unique circuit, and one side

04:47.820 --> 04:54.300
of that, if you have a unique circuit player, and one side lives in one pin name space,

04:54.300 --> 04:59.420
and another side lives in another pin name space, then if you send the PAD through a circuit

04:59.420 --> 05:04.100
control message, then we know that the kernel will translate PAD for us, so basically, we kind

05:04.100 --> 05:12.380
of are using this thing to figure out which PAD the process has on the host, if that

05:12.380 --> 05:18.340
process has a PAD number one inside the pin name space, so this is how we solve the first

05:18.340 --> 05:25.340
thing, and second, and then we have the another very much more complex thing, because

05:25.340 --> 05:30.700
it's not always, again, it's not always even possible to figure out which effectively

05:30.700 --> 05:37.300
means process has, so we need to calculate that, sometimes it can be trivial, I can case

05:37.300 --> 05:41.820
of uptime, obviously, what we do, which has to go to the PAD number one in that continuum

05:41.820 --> 05:46.620
and figure out how for how long this process runs, and this is at that time of the

05:46.700 --> 05:51.700
continuum, but for some cases, like memory, it can be really not trivial, because depending

05:51.700 --> 05:57.940
on the C group version you rely on, there can be no direct mapping, because in C groups

05:57.940 --> 06:03.660
we have software, it's hard-limits, other things, and it's really complex thing, and then

06:03.660 --> 06:12.100
basically, once we figure out all the stuff, we can build the proper output, because all

06:12.100 --> 06:18.020
the files we emulate, they are textual format files, and then we just give the reply back

06:18.020 --> 06:25.140
to the kernel, which the user, I mean, the user request reply, and the kernel can answer

06:25.140 --> 06:35.220
the user space and block the process and give the answer, which, but what I covered

06:35.220 --> 06:42.940
is really basics, we have seen some difficulties, for a long time, for example, we

06:42.940 --> 06:48.660
just start views, because it's also complex thing, because we have no device in space,

06:48.660 --> 06:55.060
and it's hard to understand, it's hard to formally define which block devices I used

06:55.060 --> 06:59.420
by container and which are not, because there is no such thing in the kernel, right?

06:59.420 --> 07:06.940
And sometimes people struggle to see what they expect with regards to swap utilization

07:06.940 --> 07:13.740
inside the system container, sometimes people complain about H-top, not showing the right

07:13.740 --> 07:18.740
values, expected values, when people doing stress load, because for example, if you do a

07:18.740 --> 07:24.660
stress load on the CPU, on the host, then inside the container H-top, you'll show you that

07:24.660 --> 07:29.060
all the CPUs are busy, while inside the container you have nice thing, at the same time,

07:29.060 --> 07:33.580
but it's not something we can really fix just in user space, we obviously need some kernel

07:33.580 --> 07:38.860
support for that, and the problem is that previously in C-group 1, we had a special controller

07:38.860 --> 07:46.180
called CPU Act, which allowed to account the CPU time spent, but in C-group 2 unfortunately,

07:46.180 --> 07:51.100
we don't have any analog for that, probably because of performance reasons, because it's

07:51.100 --> 07:59.740
kind of costly thing, another challenge is that not everything goes from pro-cafers, right?

07:59.740 --> 08:07.980
So we have some C-scores that can allow user space to retrieve some information, like the

08:07.980 --> 08:13.420
easiest example, probably season for C-score, it allows to also retrieve a bunch of information

08:13.420 --> 08:19.020
like load, average, uptime, payment force, up and for and other things, and we cannot

08:19.020 --> 08:28.020
really hijack that using LX CFS, and also there are some other sources of information,

08:28.020 --> 08:33.680
like C-SFS, especially with regards to this information about CPU topology and a number

08:33.680 --> 08:41.020
of CPUs and CPUs states, like in case if they're a hot-hot-logobal, right?

08:41.020 --> 08:48.080
And I've seen just recently, maybe like a year ago, we have received a real world

08:48.080 --> 08:54.800
use case is not a synthetic L1 when inside the VM I don't remember maybe it was open stack VM

08:54.800 --> 09:01.920
but I'm not sure but anyways like it was a VM and the person was argue was not happy about seeing

09:01.920 --> 09:07.360
really weird results in the monitoring tool because the monitoring tool I think it was something

09:07.360 --> 09:11.920
is something that was not exported probably it was showing really weird values in terms of how many

09:11.920 --> 09:17.600
CPUs this container have because container had just one CPU but this thing was showing

09:17.680 --> 09:26.160
much more and after analyzing I figured out that actually the problem was that the VM the inside

09:26.160 --> 09:32.720
which this container is deployed was having like 12 CPU slots while only 3 CPUs were present

09:32.720 --> 09:37.360
and the interesting thing was that the kernel was handy this really in the final way in

09:37.360 --> 09:45.040
on CSFS the number of CPUs user might see on CSFS was less the number of CPUs in a

09:45.040 --> 09:52.800
C group you won CPU Act file so in CPU Act file there was like 12 CPUs while in the CSFS there was only

09:52.800 --> 09:58.240
3 CPUs and I was like okay that's weird let's go to the kernel code and it was kind of obvious

09:58.240 --> 10:05.440
because in the kernel code one thing which I mean CPU Act was using for each present CPU macro

10:05.440 --> 10:13.360
is the kernel iterator over the CPUs while the CSFS was using CPU Act was using for each possible

10:13.360 --> 10:19.920
CPU which larger set so again like usually if you just I don't know running stuff on your laptop

10:19.920 --> 10:25.200
probably you have just everything equal you you won't probably want to face that but on on on in some

10:25.200 --> 10:32.480
cases we see especially on the series we see you should like that that's it I will just give the

10:32.720 --> 10:34.720
microphone to Stefan

10:40.400 --> 10:45.920
hello it's me again all right so I just wanted to show the current state of things and then we can talk

10:45.920 --> 10:55.200
a little bit about the future depending on how much time have so let me just get a quick terminal up

10:56.160 --> 11:02.160
somewhere okay that's here

11:09.760 --> 11:15.600
all right so if we're just looking at the basics and we get ourselves the the then

11:15.600 --> 11:26.880
13 container and we give it two CPUs to the exam okay and then we want to go to the same thing

11:26.880 --> 11:35.360
with Alpine okay we go in the in the WN1

11:37.520 --> 11:43.360
3 and 4 are both happy they're both using plot files let's see if it's doing a job everyone's happy

11:45.600 --> 11:52.240
let's go look at the other one oops there's no bash in Alpine um free yeah 32 gigs of RAM

11:53.360 --> 11:59.840
and product yeah that one looks good um which one is wrong not a CPU some of the like effectively

11:59.840 --> 12:04.880
we get some inconsistent information there and the reason I is that some of the tools in Alpine

12:04.880 --> 12:12.000
are now using system for instead of using XFS well instead of using plot thankfully in Gus

12:12.000 --> 12:21.040
how to work around for that so we can do security this goes intercept system for sure

12:22.400 --> 12:27.040
let me start this thing

12:30.400 --> 12:34.880
and we're back to normal but now it means we've got two completely different systems that need to

12:34.880 --> 12:39.200
there with that we've got the system going toception code inside of Incus that needs to

12:39.200 --> 12:44.400
catch every single code it's info and then go and get the data and return the right thing

12:44.400 --> 12:50.080
as well as having LexiFS also do all of the podcast masking for all of that information

12:50.080 --> 12:53.920
that's gonna duplicate data we can share some amount of code here and there but that's kind of

12:53.920 --> 13:01.440
knowing it's also basically losing battle we know that you know there's more and more of

13:01.440 --> 13:06.960
those kind of things that are gonna keep popping up and there is no really appetite in the Linux

13:07.920 --> 13:14.560
internet external for having the probe vertices info view respect C groups or respect other

13:14.560 --> 13:20.480
kind of limits that's just not something that anyone wants to have an interest for and so we need to

13:20.480 --> 13:28.000
look at what can we do about this stuff and it's basically what I'll we're saying a continuous plan

13:28.000 --> 13:31.760
and it's because not because it's gonna take us 10 years to implement it's gonna take 10 years to

13:31.760 --> 13:38.080
have a big amount of adoption of this thing so basically what we need is to get to the point where

13:38.880 --> 13:45.120
user space no longer passes product to figure out their resource limits or no longer really tries

13:45.120 --> 13:51.680
to use this info to get their resource limits instead the the rough idea is to try and have a

13:52.640 --> 14:01.280
probably a rest rate with then a CIPI library as well as a volume small service

14:02.240 --> 14:06.160
that will provide the same information so if you're written in rust we can use the

14:06.160 --> 14:11.440
create and you can use that to get a very reliable answer as to what your resource limits actually

14:11.440 --> 14:17.520
look like as well as some other non-ambient system info whether it's like LSMs or that can

14:18.480 --> 14:24.720
and if you're not interested but you're in some dynamically linked language running on the system

14:24.720 --> 14:30.880
you can link to the library and get the same information either in both of those cases you can

14:30.880 --> 14:35.440
either query about yourself or any other process because you're typically doing that query in

14:35.440 --> 14:40.240
process so if there's any security checks or whatever they will care probably and then there's

14:40.240 --> 14:45.680
option number three which is for what happens if you're dealing with cross compiles that

14:45.680 --> 14:52.160
think binaries things that go binaries some drama stuff that kind of thing where it's not really practical

14:52.160 --> 14:59.360
to depend on a shared library that gets pretty quickly so the idea is to effectively have

14:59.360 --> 15:07.440
a circuit activated process running on the host system which you can quickly open a unique circuit

15:07.440 --> 15:12.800
too and then make the same query over over that and then we get the SEM thread on the other side

15:12.800 --> 15:17.680
and can figure out who the color process is where they are and then return the right response

15:17.680 --> 15:21.200
in this scenario because you're effectively doing your queries through a third-party process

15:21.200 --> 15:25.760
you wouldn't only be able to do it against yourself like you only get your own resource information

15:25.760 --> 15:30.560
you can't use that to try and trick the system into getting information about processes that

15:30.560 --> 15:34.880
you're not otherwise access allowed to access well that's kind of the rough idea

15:36.400 --> 15:40.560
we've been talking about that before it's the second time we talk about it at first

15:40.560 --> 15:45.440
them that's time it was called leave my campaign or something or not from Netflix which

15:45.440 --> 15:51.920
didn't really go anywhere but roughly same concept we might try and do a bit more work on that

15:52.480 --> 15:58.080
these days since future fusion we've recently got Alex on board and we've got a bit of time to do

15:58.080 --> 16:03.360
some low-level work we might try and get it done Lex CFS would be an obvious thing to post over

16:03.360 --> 16:09.040
to that as a way to make sure that all of those kind of APIs are all correctly supported and then

16:09.040 --> 16:14.320
we probably start pushing for like specific run times probably like go in Java for like hey

16:14.320 --> 16:18.240
instead of doing your own thing which is always wrong basically right now and like every time

16:18.240 --> 16:23.120
you run Java in the container it's getting just slightly wrong data instead of doing

16:23.120 --> 16:27.360
maps use this library or use this demon effectively if you get that data if it's available

16:28.400 --> 16:32.320
and see if things get better and if it is then hopefully you can slowly get more and more

16:32.320 --> 16:37.120
at least more more of the programming languages and run times to handle that and that will

16:37.200 --> 16:42.400
leave us with all of the other tools like you know top h.ps whatever all of those guys will still

16:43.440 --> 16:47.840
need to be bought it over which is why it's going to take a long long time like we need to find

16:48.560 --> 16:52.480
what's the best way to get a bit of adoption and kind of try and get the ball rolling and hopefully

16:52.480 --> 16:59.200
it's no balls and more and more folks not using this nobody likes passing truck it's really tricky

16:59.200 --> 17:05.040
it's often wrong so I think people would actually welcome a clean library to this kind of stuff

17:05.360 --> 17:10.320
another point and thankfully that is not in the room but otherwise you would have asked me

17:10.320 --> 17:15.840
it like hey why what about using you know system discourse using the system the API to figure

17:15.840 --> 17:19.600
that stuff out and the answer to that is like yeah that works great if you're on a system with

17:19.600 --> 17:25.280
system the running system the things but if you are like a static binary running inside of an

17:25.280 --> 17:30.720
OCI container will actually be on the binary on disk and you want to know about your limits

17:31.280 --> 17:34.880
you're not going to have a system to talk to like you don't have it in its system you're just

17:34.880 --> 17:40.720
a single binary running in a in a container so in that scenario being able to talk to that

17:40.720 --> 17:45.680
unique sockets that would just be re-explosed by darker re-explosed by anything else will get

17:45.680 --> 17:51.360
to the information you want or if you've written in a language that makes that easy you could do

17:51.360 --> 17:57.200
the query yourself by using either the rest rate or the library so that's kind of the idea

17:57.200 --> 18:03.920
if we're getting pretty close to out of time we can maybe try one question maybe

18:03.920 --> 18:10.160
the question yeah if doesn't it

18:13.360 --> 18:19.600
I do agree to give a read on any plans for dealing with LXF's crashes I'm going to come a lot closer

18:19.600 --> 18:31.040
sorry hey it's on have you read them any plans about dealing with LXF's crashes

18:31.040 --> 18:39.360
oh LXF's crashes yeah the main so yeah the occasionally LXF's crashes we are looking at the

18:39.360 --> 18:44.480
crashes these days the main the main difference is that LXF used to work for canonical and have

18:44.480 --> 18:48.320
maybe five minutes a month to look at those is now working for me and it's going to get a bit more

18:48.320 --> 18:55.840
time to look at those but like and we've had some ideas to handle LXF's crashes better with things

18:55.840 --> 19:01.280
like effectively being able to handle crashes and re-attach the fuel smart system I think you did

19:01.280 --> 19:06.800
some work on that right effectively you're supporting like a some kind of resume of fused so that

19:06.800 --> 19:13.440
you can if there's a crash you can still keep the mounts as they are and through start LXF's re-attach

19:13.440 --> 19:17.280
it's all of the mounts and you don't need to restart all the containers at least so we've done

19:17.280 --> 19:22.720
some work towards that which we would make things all there we're done