WEBVTT 00:00.000 --> 00:16.000 I will talk about what I learned by writing my own container manager from scratch, going 00:16.000 --> 00:17.000 ruthless. 00:17.000 --> 00:22.000 Some lessons that I learned I want to share with you. 00:22.000 --> 00:27.000 I'm looking at my own 18, I look at it on GitHub and I'm a software engineer in 00:27.000 --> 00:31.000 changing art, lots of open source I like it. 00:31.000 --> 00:35.000 These are my contacts of various things. 00:35.000 --> 00:40.000 So what's little about little is the container manager that I wrote, 00:40.000 --> 00:45.000 because I needed something for my other open source project that I'm on 00:45.000 --> 00:51.000 the time on thing, which is called Disrebox, where I needed a full-back solution 00:51.000 --> 00:55.000 from when our host doesn't have podman or Docker for some reasons. 00:55.000 --> 00:59.000 And I wanted something that was self-contained, so easy to install. 00:59.000 --> 01:04.000 Ideally, just a single binary without any other external dependencies, 01:04.000 --> 01:08.000 being like wait in the sites. 01:08.000 --> 01:14.000 And as fast as possible, I didn't need everything that podman and Docker does. 01:14.000 --> 01:22.000 So just the bare minimum that will let me boot up simple container for my 01:22.000 --> 01:24.000 other projects. 01:24.000 --> 01:28.000 And also I wanted to learn about containers, improve my go. 01:28.000 --> 01:30.000 So that's what I wrote. 01:30.000 --> 01:33.000 So what are containers? 01:33.000 --> 01:37.000 Containers are a way to visualize a system without 01:37.000 --> 01:43.000 recurring to emulation or ritualization of unhaul OS. 01:43.000 --> 01:48.000 So in a VM, you have your host operating system running an 01:48.000 --> 01:54.000 hypervisor, where you run other operating system from bootloader up, 01:54.000 --> 01:59.000 and then in that operating system, you run whatever service application that you need. 01:59.000 --> 02:03.000 With a container, you have your host operating system, 02:03.000 --> 02:11.000 and then you have, you use your operating system main kernel to isolate other root 02:11.000 --> 02:12.000 offenses. 02:12.000 --> 02:18.000 We will see what they are, but basically other file systems where you run your apps or services. 02:18.000 --> 02:24.000 As you see, it basically removes a whole layer of, 02:24.000 --> 02:27.000 between the host and the guest. 02:27.000 --> 02:33.000 That's a pro because then it's very fast, very light, and it's easy to 02:33.000 --> 02:36.000 dispose of containers, just scrap them, create them. 02:36.000 --> 02:39.000 Very easy on the security side. 02:39.000 --> 02:44.000 You're sharing your kernel between your host and workloads. 02:44.000 --> 02:49.000 It hasn't implications, but we don't care about that now. 02:49.000 --> 02:52.000 There are some building blocks of containers. 02:52.000 --> 02:55.000 So you have our root of us or base file system for the container. 02:55.000 --> 03:01.000 You have namespaces, which is how we separate it from the main operating system. 03:01.000 --> 03:07.000 Capabilities, which are what stuff inside the container can and cannot do. 03:07.000 --> 03:11.000 Seagroops, it's a way to conceal and separate resources. 03:11.000 --> 03:17.000 So I can say this container cannot take more rammed out a set value. 03:17.000 --> 03:20.000 Second filters, even more sandboxing. 03:20.000 --> 03:26.000 We can filter out a set of c-scoles there. 03:26.000 --> 03:29.000 We can deny to a certain workloads. 03:29.000 --> 03:34.000 And then integration with BIOS system modules, 03:34.000 --> 03:36.000 the Linux app, or whatever. 03:36.000 --> 03:39.000 The first building block we will see is the root of us. 03:39.000 --> 03:44.000 It's the BFI system that is used by a Linux user land. 03:44.000 --> 03:49.000 In the case of Lilipod, I wanted to tap in the OCI registers, 03:49.000 --> 03:54.000 because this is the most diffused and biggest one where I can find everything. 03:54.000 --> 03:59.000 There is Docker Hub, Clio, GHCR, blah blah blah blah. 03:59.000 --> 04:01.000 Many of them. 04:01.000 --> 04:06.000 When we interrogate OCI registry, what we get is a manifest. 04:06.000 --> 04:12.000 We can see with a little Docker manifest inspect of Ubuntu and Genx image. 04:12.000 --> 04:18.000 We have a set of layers as objects in this JSON. 04:18.000 --> 04:24.000 These set of layers will give you a way to download this layer. 04:24.000 --> 04:27.000 And layers are shipped as tarbles. 04:27.000 --> 04:32.000 And you will have the checks on the layer. 04:32.000 --> 04:37.000 So you can always verify that don't have downloaded a corrupted layer. 04:37.000 --> 04:40.000 This makes it also easy to do that. 04:40.000 --> 04:45.000 So what I did with Lilipod is to use Crane as a library. 04:45.000 --> 04:50.000 It's very handy to interface with OCI Container Registries. 04:50.000 --> 04:58.000 You can pull down the manifest and open it and basically go and download all layers. 04:58.000 --> 05:07.000 What I do is use the checksum as a way, obviously, to know that I downloaded something not corrupt or whatever it was. 05:07.000 --> 05:12.000 But also rename each layer as the checksum itself. 05:12.000 --> 05:20.000 So for example, if I have an Ubuntu image and Ubuntu and Genx image, the base Ubuntu layer will be basically named the same. 05:20.000 --> 05:27.000 I will know it and I can use something I like hard links to deduplicate between various same layers. 05:27.000 --> 05:31.000 So you have a storage advantage. 05:32.000 --> 05:35.000 So how we can use the root of fast. 05:35.000 --> 05:37.000 You can see it's root in that. 05:37.000 --> 05:41.000 See it's root is very old. 05:41.000 --> 05:48.000 Unix is called the lets you change the root file system for a set process. 05:48.000 --> 05:54.000 You can basically enter it and basically it's the new root of that process. 05:54.000 --> 05:57.000 It's good for recycling file system access. 05:57.000 --> 06:02.000 I don't want, I don't know, this process to access my whole host or file system. 06:02.000 --> 06:09.000 And it's useful to bring your own dependencies libraries and stuff like that for asset process. 06:09.000 --> 06:14.000 So we go and see a root, but that doesn't work without root. 06:14.000 --> 06:17.000 Permission the 9 in operation or permitted. 06:17.000 --> 06:21.000 Because see a root needs us to be root users. 06:21.000 --> 06:26.000 And for convenience, we also want to mount additional file system. 06:26.000 --> 06:32.000 For example, a C-Sopherex FFS, a ProcFS, various tempFS or stuff like that. 06:32.000 --> 06:35.000 Well, I want to mount something inside. 06:35.000 --> 06:43.000 What we can use is to root less is the other building block of the containers, 06:43.000 --> 06:45.000 which are namespaces. 06:45.000 --> 06:55.000 So namespaces are set of is a technology provided by the kernel itself, 06:55.000 --> 07:04.000 to have some sort of isolated views of resources only for us a set process. 07:04.000 --> 07:09.000 There are various ways, various types of namespaces. 07:09.000 --> 07:12.000 And it's basically how containers contain, basically. 07:12.000 --> 07:22.000 So we have the mount namespace, which basically gives the process a local copy of the mount tree of the file system. 07:22.000 --> 07:29.000 So it can be manipulated by the process without affecting the mount tree of the whole system, 07:29.000 --> 07:31.000 just for the process. 07:31.000 --> 07:33.000 Same is for the users. 07:33.000 --> 07:39.000 And UTS is for host names, PID, IPC network and timeless basis. 07:39.000 --> 07:41.000 I think the newest one. 07:41.000 --> 07:49.000 So what we can do is call the unshare C-School to fuck the process in a new namespace. 07:49.000 --> 07:56.000 In case of C-School, we need the username space and the mount namespace. 07:56.000 --> 08:06.000 And in that name space, the process that you are launching is able to change the mount tree and the user tree, 08:06.000 --> 08:11.000 because it's just a local modification that doesn't affect the rest of the system. 08:11.000 --> 08:13.000 This is a little example. 08:13.000 --> 08:20.000 When we unshare and where we clone, we can, in the new namespace, map the user to something else. 08:20.000 --> 08:31.000 Like, Alice can become UID 96, WLB data, or Bob can become FOO, number 1000 and stuff like that. 08:31.000 --> 08:38.000 For C-Truth, we just need, for us, Alice to become ROOT. 08:39.000 --> 08:45.000 We unshare the mount namespace, the username space, and map ourself to the ROOT user. 08:45.000 --> 08:49.000 We see it through to our thumb, sorry to our ROOTFS. 08:49.000 --> 08:58.000 And success, we are in our new, very, very rudimentary and simple container. 08:58.000 --> 09:08.000 In early part, we are using the C-Sprocassibutes to unshare the various namespaces and because it's configurable, 09:08.000 --> 09:11.000 we can share something unshare others. 09:11.000 --> 09:19.000 And you will see here that I'm not using C-Truth, but pivot route, which is another C-School, 09:19.000 --> 09:26.000 for similar userfulness, because C-Truth can be escaped easy. 09:27.000 --> 09:33.000 With that simple line of code, you can basically escape any C-Truth. 09:33.000 --> 09:39.000 Because the mount tree inside the C-Truth is not changing. 09:39.000 --> 09:44.000 With pivot route, it's different, because it can leverage the fact that we are in a mountain in this space, 09:44.000 --> 09:52.000 that we can manipulate as we want, to really switch the route instead of just changing it, 09:52.000 --> 09:56.000 and then remove the original route of FES, so it's not accessible anymore. 09:56.000 --> 10:02.000 So you can just escape by C-H-T up up, right? 10:02.000 --> 10:07.000 How it works, so we stop with our route of FES on slash. 10:07.000 --> 10:12.000 And we have a new route, which is what we want to pivot to. 10:12.000 --> 10:21.000 With the C-School, we switch the new route with the old route at the same time, 10:21.000 --> 10:28.000 and you're left with only the old route, and the new route FES is the new route that it was before. 10:28.000 --> 10:36.000 But we can leverage the fact that we are in a mountain in space, and we can unmount the old route, 10:36.000 --> 10:43.000 and it disappears, so it's not accessible anymore from the process inside these namespace. 10:43.000 --> 10:48.000 We go for it, and for example, in Ubuntu, we have problems. 10:48.000 --> 10:51.000 So what's happening here? 10:51.000 --> 10:56.000 It cannot set groups, it cannot set GID and set UID, 10:56.000 --> 11:01.000 and there is these underscore APT user inside here, 11:01.000 --> 11:08.000 because APT uses a sandbox of its own to download stuff for security purposes. 11:08.000 --> 11:19.000 And here we have set groups to deny it, and we have only one user in this UID map. 11:19.000 --> 11:25.000 We only mapped ourself to route, and we didn't do anything else. 11:25.000 --> 11:34.000 So what we need here, we need to be able to map multiple users and to have the set groups primitive enables. 11:34.000 --> 11:36.000 We cannot do that. 11:36.000 --> 11:39.000 The only way to do that is to be route. 11:39.000 --> 11:45.000 So we have to use a little trick, which is using these new GID map, 11:45.000 --> 11:53.000 a new UID map tools that are in the shadow packages. 11:53.000 --> 11:58.000 Those are being that only route can do this stuff. 11:58.000 --> 12:05.000 Those are all set UID binaries, so they actually run as route. 12:05.000 --> 12:13.000 But it's just for a brief moment, only for one thing, which is only map something for the child process. 12:13.000 --> 12:15.000 This is the unshared process. 12:15.000 --> 12:22.000 It's a very small tool with very little codebase, so it's easily auditable. 12:22.000 --> 12:34.000 And it has security checks, where for example, new UID map and GID map can be called only from a father 12:34.000 --> 12:37.000 to its own child and knocking else. 12:37.000 --> 12:43.000 So I cannot just change the map single of whatever PID that I want, 12:43.000 --> 12:47.000 just the father from the father to the child. 12:47.000 --> 12:52.000 So what we will do in the host name space, we have our main. 12:52.000 --> 12:58.000 We clone and share to a new name space, and we have the child process. 12:58.000 --> 13:04.000 Immediately we use our helper that launches new UID map to it. 13:04.000 --> 13:10.000 So it has the right maps and mapings for users, set groups and whatever. 13:10.000 --> 13:15.000 Then we do the pivot route and run our entry point. 13:15.000 --> 13:22.000 It's very basic, so we are literally calling a shell command to it. 13:22.000 --> 13:25.000 That's the command that is run. 13:25.000 --> 13:31.000 But when we do that, the mapings here are changed. 13:31.000 --> 13:34.000 What are these numbers? Let's take a look. 13:34.000 --> 13:44.000 So the first number is the start of the range of ID inside the namespace. 13:44.000 --> 13:51.000 This other number is the start of the ID outside of namespace, so in our host. 13:51.000 --> 13:55.000 And this is the range, so how many of them? 13:55.000 --> 14:07.000 The first line is whatever we also had before, so it's mapping from user zero inside the ID, so route. 14:07.000 --> 14:16.000 To the user 1000 outside of the namespace, so ourselves most likely, for our range of one. 14:16.000 --> 14:19.000 So it's just one of one mapping. 14:19.000 --> 14:27.000 User zero to 1000 outside, so we are actually route inside the namespace, even if we are not route outside. 14:27.000 --> 14:34.000 And then it says from user 1 to 65,000, whatever. 14:34.000 --> 14:39.000 It maps them to the user 100,000 and more. 14:39.000 --> 14:46.000 User 1 inside the container will be 101 outside of the container and so on. 14:46.000 --> 14:53.000 So it's very far apart, so it doesn't interfere with real users on the host. 14:53.000 --> 15:00.000 But we have actually a mapping of all the users to real users on the host. 15:00.000 --> 15:03.000 And now APT works, we have set groups. 15:03.000 --> 15:12.000 We, the underscore APT user can run, so you can download packages and it's happy. 15:12.000 --> 15:28.000 The same thing is done for, can be done for the PID namespace, so I want to make sure that the process inside our unshared namespace cannot look to other PIDs outside. 15:28.000 --> 15:39.000 So it can interfere, I don't know, with my own processes, or with the real PID1 or P22 or P23 or whatever. 15:39.000 --> 15:45.000 So it clones the process tree and maps it back to 1. 15:45.000 --> 15:55.000 So then all the trials of 1 will be the trials of 65 here and our map to 2, 3 and 4 in the PID namespace. 15:56.000 --> 15:59.000 And the same can be done for the network namespace. 15:59.000 --> 16:07.000 It's a little bit tricky here because when we unshared the network namespace, we are just left with localhost and nothing else. 16:07.000 --> 16:24.000 So for now, for now, LilyPod doesn't do this type of networking, but you can create then bridges and interfaces here and here and then control the network access of the namespace. 16:24.000 --> 16:37.000 But that's not enough. Then we have capabilities. So what our capabilities is, there were introducing kernel 2.2, so it's quite old. 16:37.000 --> 16:48.000 But basically, before kernel 2.2, Ripsuser can do everything and non-Rootfuser can do just a subset of things. 16:48.000 --> 17:01.000 After that, they basically split all the privileges that the root user can do in little capabilities that you can also have. 17:01.000 --> 17:12.000 And right now, what's happening is I have this set of capabilities and when the container starts, it's interacting all of them. 17:12.000 --> 17:20.000 And that's dangerous because capabilities is a way to access a set of C-Schools. 17:20.000 --> 17:30.000 For example, with the C-School module, you can mod probe stuff. With C-School, you can see it through it. 17:30.000 --> 17:46.000 In P-Trace, you can P-Trace, or you can see it through own and dangerous stuff. And we don't want that because this can lead to escaping a container pretty easily. 17:46.000 --> 17:52.000 And so we want to drop whatever it's not needed to run our own container. 17:52.000 --> 18:00.000 This is a list of capabilities that I very much copied from Docker. 18:00.000 --> 18:12.000 It's very restricted set of stuff, but that lets you do whatever it's needed to enter a container or start whatever you have inside. 18:12.000 --> 18:20.000 And we can drop them, and we have less now, so it's more secure. 18:20.000 --> 18:31.000 We're almost there. So we can pull from our registry, we get our own TIGGs, we create our wrong root of us, we can run it, we are rooting side it. 18:31.000 --> 18:42.000 We cannot see process outside of it. We have the right mapping, we have the right set groups, we have the right capabilities, and we don't have network. 18:42.000 --> 18:49.000 Like we wanted at the beginning, but that's not enough to have a real secure container. 18:49.000 --> 19:05.000 We are still missing C groups, so we can limit the resource access of a container, we are still missing second filters to limit the access to C schools that the container can do. 19:05.000 --> 19:13.000 We are missing a Selinux or a Parmer integration, so that we can go even further beyond. 19:13.000 --> 19:20.000 This is the repo on GitHub, it's located on LilyPod, and thanks. 19:20.000 --> 19:30.000 How'd that any questions? 19:30.000 --> 19:38.000 There's one. 19:38.000 --> 19:50.000 Okay, so I'm pretty curious, you mentioned that you needed root access for this helper, you know, for a group mapping, and so do you know how projects like Podman do that? 19:50.000 --> 19:51.000 The same thing. 19:51.000 --> 20:04.000 So they use new idea to map them, just for that brief moment, and then they are rootless, others otherwise. 20:04.000 --> 20:06.000 Actually, it's not true, we released. 20:06.000 --> 20:11.000 It's rootless as much as it can. 20:11.000 --> 20:25.000 Actually, just making a quick comment on that one because it's kind of funny, so Alex was standing there and myself, we've been working for probably a couple of years now on the fully isolated privileged username spaces. 20:25.000 --> 20:35.000 So with that, every single user on the system will be able to get an entire, well, an entire 42 bit UID and GID map, we're not needing the privileged helper at all. 20:35.000 --> 20:45.000 It will still be a tricky for the file system access because you need to map that, but like you'll be able to do a few, or time affairs, or whatever, and get removed that path completely. 20:45.000 --> 20:50.000 I look forward to it. 20:50.000 --> 20:53.000 Hey, I have a question. 20:53.000 --> 21:03.000 Basically, do you plan to add any distrobox functionality that can be used only by LilyPod instead of Podman? 21:03.000 --> 21:09.000 I want to keep this to box container manager agnostic. 21:09.000 --> 21:14.000 So there won't be a special treatment for LilyPod. 21:14.000 --> 21:25.000 It will just be the fallback because it's like a 8 megabyte binary without external dependency outside of the basic c-schools that we are making. 21:25.000 --> 21:32.000 So this is the fallback solution for when I cannot find Docker or Podman. 21:32.000 --> 21:33.000 Thank you. 21:33.000 --> 21:36.000 Thank you. 21:36.000 --> 21:38.000 Anyone else?