WEBVTT 00:00.000 --> 00:14.600 Okay, let's start. 00:14.600 --> 00:21.200 My name is Marig Washer, and I'm going to talk about booting blocks between U-Boot and Linux. 00:21.200 --> 00:23.320 The talk actually has two parts. 00:23.320 --> 00:29.180 I figured at the beginning I'll do some sort of a definition of terms, a couple of 00:29.180 --> 00:30.180 terms. 00:30.180 --> 00:36.180 I'll go through what is what, so we have that idea, and then once I'm through to this, 00:36.180 --> 00:40.540 I hope it will also become clear what the problem is, and in the second half I'll tell you 00:40.540 --> 00:41.860 how to solve it. 00:41.860 --> 00:47.300 So first thing is bootloader stack, what is this? 00:47.300 --> 00:52.700 So this is all the software that runs when you power on your system, and between that 00:52.700 --> 00:56.820 point and when the operating system can not take over. 00:56.820 --> 00:59.020 Traditionally, it was a very small piece of 00:59.020 --> 01:02.740 software, or no software at all, because you just turned on your system, and maybe 01:02.740 --> 01:07.100 they cannot boot it right away, but this was the 80s. 01:07.100 --> 01:11.980 These days, embedded systems are much more complex. 01:11.980 --> 01:17.140 When those started, the bootloader stack was usually single project, U-Boot, Corb, whatever, 01:17.140 --> 01:20.540 this was the 2000s give or take. 01:20.540 --> 01:25.420 The system was powered up, started booting from a reset vector, there was a parallel 01:25.660 --> 01:31.660 flash at the reset vector, and the CPU was just executing instructions from that parallel 01:31.660 --> 01:32.660 flash. 01:32.660 --> 01:39.660 The software that initialized the RAM, loaded the kernel into the RAM, and the kernel 01:39.660 --> 01:41.380 was running at that point. 01:41.380 --> 01:46.020 Now, it got a little bit more complicated over time, because parallel, the flash has 01:46.020 --> 01:47.020 way too many pins. 01:47.020 --> 01:49.500 It's difficult to route and so on. 01:49.500 --> 01:54.260 So the hardware vendor figured, okay, well, let's put a piece of software into the 01:54.340 --> 01:58.580 sok itself, that's nowadays called bootroom. 01:58.580 --> 02:03.940 This is a responsible for implementing complex protocols to communicate with storage, 02:03.940 --> 02:10.260 and ultimately for loading small piece of software into internal memory, usually sok 02:10.260 --> 02:16.660 internal SRAM, that small piece of software is the first bootloader, which will run 02:16.660 --> 02:22.660 on your machine, and it will initialize the RAM, and it will load the next stage, whatever 02:22.660 --> 02:29.460 is running in the SRAM, this could be either you would as PL, it could be TFABL2 or something else. 02:29.460 --> 02:36.020 What is loading next is usually loaded into the RAM, that could be potentially T implementation 02:36.020 --> 02:41.500 trusted execution environment, it could be you would itself, it could also be PSCI provider 02:41.500 --> 02:42.940 and then you would. 02:42.940 --> 02:48.540 So as you can see, at this point in the present, where we are kind of now, the bootloader 02:48.540 --> 02:53.980 next stage, much more complex, it's a composite of multiple bootloader projects, some 02:53.980 --> 02:59.220 of them close source, like the bootroom, some of them open source, like the TFABL, some 02:59.220 --> 03:06.380 of their free software, like your boot, and it didn't stop here, it's actually now slowly 03:06.380 --> 03:12.460 moving to the future where the bootloader's stage is no longer running on a single 03:12.460 --> 03:17.020 core, it's running on multiple cores, when the system powers up, what actually comes 03:17.020 --> 03:25.740 up is a safety core, which is small CPU core, Cortex Air, Cortex, that runs some sort 03:25.740 --> 03:33.580 of a bootloader owner, it does have very initialization fully, it starts the RAM, it does 03:33.580 --> 03:38.220 initialization of the core 3 and so on, then only then it starts providing some sort 03:38.220 --> 03:44.740 of services to the application cores, it brings up the application cores and runs some sort 03:44.740 --> 03:49.140 of a bootloader on the application cores, which then starts the kernel on the application 03:49.140 --> 03:50.140 core. 03:50.140 --> 03:55.580 But I don't want to talk about this, today I'll go one step back to the present and I'll 03:55.580 --> 04:05.860 return to the situation where we are running bootloader on one single CPU core entirely. 04:05.860 --> 04:11.420 And the next thing I would like to explain is exception levels, this is R&V8 specific, 04:11.420 --> 04:20.900 but other R&CPUs also, hmm, what's up, yeah, other R&CPUs, the over cores, R&V7 and 04:20.900 --> 04:25.080 stuff like that, they also have exception levels, it just looks slightly different, risk 04:25.080 --> 04:31.540 5 also has similar concept, I'll talk about R&V8 because this is what I was working with 04:31.540 --> 04:34.980 recently and it's easy to understand. 04:34.980 --> 04:41.220 So the situation is this, R&V8 has four exception levels, or you can call them preveled 04:41.220 --> 04:49.940 chains basically, which allows some sort of a layering of hardware access effectively. 04:49.940 --> 04:56.060 The most privileged ring or exception level is EL-3, this is basically, so for that, that's 04:56.060 --> 05:03.220 full unrestricted access to everything, all memory, all peripherals, so for in that exception level 05:03.220 --> 05:12.700 can set up memory protection, they will also, it can set up IP, it can set up R&V8 basically 05:12.700 --> 05:20.660 to prevent software in less privileged exception levels from accessing peripherals and memory, 05:20.660 --> 05:24.740 that's basically the whole idea behind this. 05:24.740 --> 05:29.900 The most privileged E-O-3, this is where secure monitor is running, usually some sort of a basic 05:29.900 --> 05:38.620 firmware, which has the highest access, E-O-2 that's where nowadays bootloader is run, 05:38.620 --> 05:44.900 Linux kernel, but with all runs, EL-1 is virtualized kernel and EL-0 is applications on 05:44.900 --> 05:47.780 privilege to use it applications. 05:47.780 --> 05:52.420 The one thing which happens here is that it is possible to switch between these exception 05:52.420 --> 05:57.900 levels, they can switch from more privileged exception level to less privileged one, simply 05:57.900 --> 06:03.600 by setting up CPU context and doing exception return, that's easy because, hey, you are more 06:03.600 --> 06:07.660 privileged so you can switch to less privileged, that's trivial, but you can also go the other 06:07.660 --> 06:13.140 way around and that requires for the less privileged software to trigger an exception in 06:13.140 --> 06:16.940 the more privileged exception level. 06:16.940 --> 06:21.740 For that, there are two instructions, as I'm seeing HVC, when that instruction is issued in 06:21.740 --> 06:27.840 the less privileged software, it actually triggers, it switches the CPU state to a more 06:27.840 --> 06:35.080 privileged state and it triggers exception handler in the more privileged ring. 06:35.080 --> 06:38.480 The exception handler is supposed to do some sort of a permission checking in so on and then 06:38.480 --> 06:48.400 maybe does some sort of an action, like something for which the system needs higher privileges. 06:48.400 --> 06:54.120 By something, I mean, maybe enable clock, maybe bring up CPU core. 06:54.120 --> 07:00.840 The thing is, these exceptions were less privileged software can request stuff from more privileged 07:00.840 --> 07:02.840 software. 07:02.840 --> 07:15.080 They serve, or, yeah, these exceptions were, yeah, less privileged software can request 07:15.080 --> 07:22.080 services from more privileged software, serve as a way for the more privileged software to provide 07:22.080 --> 07:28.600 services to the less privileged software with some sort of an access control. 07:28.600 --> 07:36.080 But the issue with that is that when there is this kind of a contract between less and 07:36.080 --> 07:43.520 more privileged software, there has to be some sort of an ABI effectively between those two software 07:43.520 --> 07:44.520 components. 07:44.600 --> 07:45.520 And that exists. 07:45.520 --> 07:53.400 There are actually some standardized ABI's, already, PSCI and SCMI, but, unfortunately, vendor 07:53.400 --> 08:01.360 BSP's, they extend them in all kinds of weird ways, that they introduce new shiny ABI's 08:01.360 --> 08:10.640 which then suddenly in the next version, change ABI, so we have a problem, and the problem 08:10.680 --> 08:12.640 is this effectively. 08:12.640 --> 08:17.840 If your bootloader is providing some sort of a service, say, if you're operating system 08:17.840 --> 08:24.320 kernel, or some sort of a previous stage block provides services to the next stage block, 08:24.320 --> 08:26.320 it becomes an ABI. 08:26.320 --> 08:30.320 Once you change one of these blocks and it provides different ABI, you have an ABI, 08:30.320 --> 08:33.360 but you can everything falls apart, that happens. 08:33.360 --> 08:37.600 It can happen in two ways, traditionally. 08:37.600 --> 08:42.160 If the bootloader is providing ABI and you update the kernel and the kernel expects some 08:42.160 --> 08:47.600 sort of a new ABI which they provide and no one, which they bootloader doesn't provide, 08:47.600 --> 08:50.640 then the kernel will fail to boot most likely. 08:50.640 --> 08:55.680 If you have an ABI, this is still recoverable because the kernel will fail to boot, okay, 08:55.680 --> 08:59.760 bootloader will pick the other kernel, boot the Bcopy, and you can recover. 08:59.760 --> 09:01.840 This is the better case. 09:01.840 --> 09:07.600 If you have the right idea, then to say, okay, well, my kernel is a new ABI, 09:07.600 --> 09:11.040 what's up with the bootloader as well, and suddenly your kernel doesn't boot, 09:11.920 --> 09:17.520 then also your bootloader provides new ABI, the Bcopy doesn't boot, and your system is basically 09:17.520 --> 09:18.640 unusable. 09:18.640 --> 09:20.640 That's a game over. 09:21.440 --> 09:27.280 So the issue I will talk about next is how to solve this ABI problem. 09:27.680 --> 09:33.520 But before I get into it, there is still one thing which was mentioned to me while I was 09:33.520 --> 09:40.960 preparing the work and slides by a colleague and he mentioned, okay, well, you should also say 09:40.960 --> 09:47.280 that this isolation stuff and the privilege rings, they are not necessarily evil, that's through 09:48.160 --> 09:56.800 the thing is, the higher privilege ring software can set up memory isolation, it can set up 09:57.360 --> 10:02.560 ABI isolation to prevent the less privilege software from accessing these parts. 10:03.680 --> 10:08.240 If the higher privilege software is some sort of a proprietary cluster blob and you do not know what 10:08.240 --> 10:13.760 it's doing, then that's probably a bad thing because your less privilege, say, bootloader, then 10:13.760 --> 10:20.320 cannot be used as a debug tool and analyze what the system is doing and you cannot effectively 10:20.400 --> 10:28.080 debug the system, but if you are more privileged software is something which you set up yourself, 10:28.080 --> 10:36.160 then you can set up the isolation and IP, access restrictions in such a way that what you can do 10:36.160 --> 10:41.040 in the more privileged software is handled, for example, access faults and actually have that act 10:41.040 --> 10:42.560 as a better debug tool. 10:42.880 --> 10:49.200 So, yeah, memory isolation, not necessarily evil, it can be used for a very good stuff, 10:50.640 --> 10:57.440 but back to the ABI problem, the solution to the ABI problem is actually super simple and that's 10:57.440 --> 11:06.560 let's reorder the bootloader and blobs. The stack looks give or take like the one on top, right now, 11:06.880 --> 11:15.440 so we power up the system, bootroom rounds, TFA, T, and potentially rounds in EL3, but TFA 11:16.720 --> 11:23.120 most likely switch is the exception level to EL2, that's you but in EL2 and the Linux run in EL2. 11:23.120 --> 11:30.000 What we will do is we will move the blobs, TFA and T after you boot and then you would 11:30.000 --> 11:35.520 run in EL3, it will have access to everything, it can potentially even set up the memory restrictions, 11:35.520 --> 11:44.560 memory protection if it desires to do so, we will start the TFA BL31 which is the PSI provider from 11:44.560 --> 11:52.560 you boot and then we will start the kernel. The benefit of this is that if we start the TFA BL31 from 11:52.560 --> 11:59.120 you boot, we can start both TFA BL31 and the kernel in lockstep, basically keep these two 12:00.960 --> 12:07.920 blobs which have some sort of an ABI contract between them, started together and if something 12:07.920 --> 12:15.120 fails we can take the other copy. Implementation is effectively simple, we have to do two steps, 12:15.120 --> 12:20.320 first step, we need to make sure that you boot doesn't depend on any services provided by the 12:20.320 --> 12:28.560 blobs, that means TFA BL31 or maybe T, means you would not have, you would must not depend 12:28.560 --> 12:35.360 on either PSI or SCMI interfaces and in the second step we need to teach you boot to boot the 12:35.360 --> 12:43.920 TFA BL31. The first step is easy because if you look at how the services are actually implemented 12:43.920 --> 12:52.160 when you call some sort of a PSI or SCMI function or exception, ultimately what the blob does 12:52.160 --> 12:58.080 internally is that it programs some sort of registers. So, you boot can also access these registers 12:58.080 --> 13:04.320 if it runs in EL3 and in order to remove the dependency on PSI or SCMI all you have to do is 13:04.320 --> 13:09.360 make you boot pop these same registers from some sort of a you boot driver model driver. So, 13:09.920 --> 13:16.640 you would driver which matches that functionality and you are effectively done. 13:18.080 --> 13:22.400 But there is a catch. Once you remove the PSI provider from before you boot, 13:25.680 --> 13:33.200 when Linux kernel boots on our V8 it mandatorally depends on PSI being available. So, 13:33.520 --> 13:41.520 if the PSI provider is not there you would, the Linux kernel will fail to boot. It will just 13:41.520 --> 13:49.760 not do anything. Luckily, you boot can act as a PSI provider. But the PSI provider implementation 13:49.760 --> 13:55.680 is effectively architecture or board specific. So, what can be done on the you boot side is 13:55.680 --> 14:02.960 enable mando of some sort of a very basic very initial rudimentary PSI provider which 14:03.040 --> 14:08.480 will only provide Linux kernel the ability to say okay here is a PSI. It will not be able to 14:08.480 --> 14:13.600 start any CPU cores it will not be able to do any system reset or anything but it will be there. 14:14.320 --> 14:21.120 So, in case the user has any problems they can still at least boot the kernel if they for example fail 14:21.120 --> 14:27.680 to start the PSI provider and do some correct reaction. Otherwise, at some later point, 14:28.320 --> 14:37.200 the PSI provider will be started. So, about starting the PSI provider, there are again two options. 14:37.200 --> 14:44.960 One of them kind of the easy is to just load the TFABL 31 into memory, disable caches for that 14:44.960 --> 14:50.400 you would ask commands and then jump into it. The only thing which you have to be careful about 14:50.400 --> 14:57.520 in that case is to make sure that the exit point where the TFABL 31 will jump after it's done 14:57.600 --> 15:05.360 doing it's job will be again the you would entry point because then the TFABL 31 runs 15:05.360 --> 15:14.480 drops EL from 322 and restarts you boot which then runs in EL2 and then you have 15:14.480 --> 15:20.880 you boot running in EL2 with PSI finally implies and you can stop Linux. The other option is to use 15:21.680 --> 15:28.960 fit image which is I believe the more practical option for deployment and I'll talk about that. 15:29.520 --> 15:36.160 So, fit image is a multi-component image based on device 3 just briefly it's container which 15:36.160 --> 15:43.120 basically allows you to bundle together multiple blocks, kernels device 3s, firmware, FPGA bits 15:43.120 --> 15:50.240 3s into a single file. It can contain all these images it also has configurations section which allows 15:50.320 --> 15:55.840 you to tie together different images within the fit image and instruct you boot which of these 15:55.840 --> 16:02.960 images to use for booting. You boot is capable of booting the fit images open and built for example 16:02.960 --> 16:11.280 is capable of generating the fit images for you. Now, when you boot is booting a fit image 16:12.800 --> 16:18.160 of course you have to instruct it which images from the fit image is supposed to pick usually 16:18.240 --> 16:25.840 done by selecting a configuration which selects the kernels, selects the device 3s, selects 16:25.840 --> 16:33.680 potentially another loadable. You boot revocates these images which are in the fit image into their 16:33.680 --> 16:42.800 target memory addresses when that's done it runs loadable handler for all these images and then 16:43.760 --> 16:48.480 at the very end it boots the kernels. That's basically how it works. 16:58.400 --> 17:08.080 So, the loadable handler for the TFABL 31 unfortunately has to be board specific because 17:08.080 --> 17:14.800 that loadable handler may have to do some additional configuration. The TFABL 31 forks from vendors 17:14.800 --> 17:20.720 stand to have some special requirements may require some relocation tables, hand-off tables, 17:20.720 --> 17:30.160 whatever it set up. So, these TFABL 31 loadable handler say they unfortunately will have to be 17:30.480 --> 17:39.360 board specific. Before you boot jumps to the Linux kernel there is now an extension which is called 17:39.360 --> 17:48.000 the jump prep handler right this is called just before you boot jumps into the Linux kernel 17:48.560 --> 17:55.120 this can be used as a hook to finally jump into the TFABL 31. I'll show you the implementation 17:55.200 --> 18:04.000 of this whole thing now because it's actually rather simple and how the TFABL 31 was the support 18:04.000 --> 18:09.040 was added into your boot. It had three parts basically and it fits on three slides. 18:09.840 --> 18:19.200 First part was extend the footage loader so that it would actually recognize the BL 31 image type 18:19.280 --> 18:24.160 that's literally three entries in the right places in your boot that's how the patch looks like. 18:25.520 --> 18:33.040 The next part is the loadable handler. This is board specific and it basically is executed 18:33.040 --> 18:40.080 after the TFABL 31 loadable place somewhere in memory. Specifically for this platform all it does 18:40.080 --> 18:46.400 is it checks whether the system is actually running in the BL 3 at all because you would might 18:46.400 --> 18:51.680 be running in the BL 2 and at that point you don't want to be able to start the TFABL 31. 18:51.680 --> 18:58.480 If it is running in the BL 3 then it marks the load address of the block and it stores it in some sort 18:58.480 --> 19:05.200 of a global variable. The most interesting part actually happens at the very end of this stuff 19:06.080 --> 19:14.080 which is the jump prep handler that one checks whether the TFABL 31 load it again if we are in the 19:14.080 --> 19:23.520 BL 3 for that I think Quentin and finally after that after we are through all this stuff it 19:23.520 --> 19:33.680 setups board specific hand-of-table for the TFABL 31 which instructs that one specific TFABL 31 to 19:35.440 --> 19:42.320 actually return back into your boot into this function which is called unviates which to 19:42.320 --> 19:49.840 EL 2 now why do we do that? This is actually an assembler function. We do that because we want the TFABL 31 19:49.840 --> 19:57.520 to run the drop EL 2 and then return to your boot just before the assembler code in your boot 19:57.520 --> 20:04.800 which sets up the system for starting Linux. So essentially what happens just before the 20:04.800 --> 20:08.800 TFABL 31 load is that the TFABL 31 load is the same as the TFABL 31 load. It makes this 20:08.800 --> 20:15.680 detour into the TFABL 31 then returns into your boot sets up the system for booting Linux and 20:15.680 --> 20:27.360 then jumps into Linux. The integration of TFABL 31 both into a phyramage looks like this this is 20:27.360 --> 20:36.960 phyramage source all you have to do is add another image into the phyramage source this time 20:36.960 --> 20:45.760 of TFABL 31 this is the image type which was defined like 3 slides before and it has to be marked 20:45.760 --> 20:50.640 as a loadable in the configuration section so that you would load it for you run the loadable 20:50.640 --> 20:58.400 handler and then do the jump prep handler. And finally I actually have kind of a demo here in 20:58.400 --> 21:04.960 the slides what happens when you build that phyramage using mkmage and start this kind of a phyramage 21:04.960 --> 21:12.880 is this you will see literally almost no change when you boot such a phyramage. You will see that 21:12.880 --> 21:17.760 you boot that start one more log but there will be no additional output you will just seal 21:17.760 --> 21:26.480 in external booting and that's it. The reason for that is because the TFABL 31 doesn't print anything 21:26.480 --> 21:33.520 and as you enable debug output but you will notice that things actually do work when the Linux 21:33.520 --> 21:38.800 kernel is booting because at the very beginning it will report oh look I found some more advanced 21:38.800 --> 21:46.000 PSI version 1.1 I think and if you look further you will notice that the Linux kernel actually 21:46.000 --> 21:52.720 managed to bring up all the CPU cores not just CPU 0 and all the CPU cores bring up this is done 21:52.720 --> 22:02.960 by calling into the PSI so hey cool yeah we could boot all 4 CPUs and it's great and to wrap it up 22:02.960 --> 22:11.680 in some way I have this one last slide so what did we achieve basically we achieved the ability 22:11.840 --> 22:19.120 to tie together Linux kernel device trees and the TFABL 31 which is the PSI provider 22:19.120 --> 22:25.920 into one single phyramage basically into one file which we can boot all together this fits well 22:25.920 --> 22:32.960 with AB updates because then you boot is capable of picking this or that kernel image based on for 22:32.960 --> 22:40.160 example boot counter and if we can now pick this or that not just kernel image but kernel image 22:40.240 --> 22:48.320 device 3 and the TFABL 31 PSI provider then we can safely update all these things together 22:49.120 --> 22:59.200 as one single file or as one single roof file system image now if the phyramage fails to boot then 22:59.840 --> 23:04.400 you would just boot the other phyramage which even if it contains older kernel version it also 23:04.480 --> 23:12.000 contains older TFABL 31 with the old ABI and this way we do not break the 23:16.480 --> 23:22.320 yeah this way we do not end up with having any incompatible ABI between the kernel and the PSI provider 23:24.000 --> 23:30.320 and that's very much the point of this talk so what I want to show you and this is all 23:30.400 --> 23:46.960 half thank you for your attention and do you want to get it back and then I'll have that one. 23:46.960 --> 23:57.360 Hey Fuzzy this is really cool super exciting to see and at two release more questions one is the 23:57.360 --> 24:04.400 Ubout SPL code already has support for configuring like the BL 31 parameters and jumping 24:04.400 --> 24:13.200 at the TFA and launching up to even and one question so I'll actually go into that one yes 24:13.200 --> 24:20.960 Ubout SPL can start TFABL 31 that's correct and then it can start Ubout yes so my question is if 24:20.960 --> 24:28.880 like I did some similar kind of hacking on this and I just moved that code into the generic 24:28.880 --> 24:35.360 like Bruton framework and it seemed to like I don't get why you have it in the board specific code 24:37.120 --> 24:44.080 because the TFABL 31 blobs they have all kinds of vendor weirdness is that's why this is 24:44.080 --> 24:49.440 board specific unfortunately okay the end of structures which you have to set up to start the 24:49.440 --> 24:55.840 TFABL 31 they are not necessarily standardized I know there there was a discussion actually if you 24:55.840 --> 25:03.120 look at these batches that yes some of it seems almost standardized but then you run the 25:03.120 --> 25:10.000 CCL and start TFABL 31