WEBVTT

00:00.000 --> 00:14.600
Okay, let's start.

00:14.600 --> 00:21.200
My name is Marig Washer, and I'm going to talk about booting blocks between U-Boot and Linux.

00:21.200 --> 00:23.320
The talk actually has two parts.

00:23.320 --> 00:29.180
I figured at the beginning I'll do some sort of a definition of terms, a couple of

00:29.180 --> 00:30.180
terms.

00:30.180 --> 00:36.180
I'll go through what is what, so we have that idea, and then once I'm through to this,

00:36.180 --> 00:40.540
I hope it will also become clear what the problem is, and in the second half I'll tell you

00:40.540 --> 00:41.860
how to solve it.

00:41.860 --> 00:47.300
So first thing is bootloader stack, what is this?

00:47.300 --> 00:52.700
So this is all the software that runs when you power on your system, and between that

00:52.700 --> 00:56.820
point and when the operating system can not take over.

00:56.820 --> 00:59.020
Traditionally, it was a very small piece of

00:59.020 --> 01:02.740
software, or no software at all, because you just turned on your system, and maybe

01:02.740 --> 01:07.100
they cannot boot it right away, but this was the 80s.

01:07.100 --> 01:11.980
These days, embedded systems are much more complex.

01:11.980 --> 01:17.140
When those started, the bootloader stack was usually single project, U-Boot, Corb, whatever,

01:17.140 --> 01:20.540
this was the 2000s give or take.

01:20.540 --> 01:25.420
The system was powered up, started booting from a reset vector, there was a parallel

01:25.660 --> 01:31.660
flash at the reset vector, and the CPU was just executing instructions from that parallel

01:31.660 --> 01:32.660
flash.

01:32.660 --> 01:39.660
The software that initialized the RAM, loaded the kernel into the RAM, and the kernel

01:39.660 --> 01:41.380
was running at that point.

01:41.380 --> 01:46.020
Now, it got a little bit more complicated over time, because parallel, the flash has

01:46.020 --> 01:47.020
way too many pins.

01:47.020 --> 01:49.500
It's difficult to route and so on.

01:49.500 --> 01:54.260
So the hardware vendor figured, okay, well, let's put a piece of software into the

01:54.340 --> 01:58.580
sok itself, that's nowadays called bootroom.

01:58.580 --> 02:03.940
This is a responsible for implementing complex protocols to communicate with storage,

02:03.940 --> 02:10.260
and ultimately for loading small piece of software into internal memory, usually sok

02:10.260 --> 02:16.660
internal SRAM, that small piece of software is the first bootloader, which will run

02:16.660 --> 02:22.660
on your machine, and it will initialize the RAM, and it will load the next stage, whatever

02:22.660 --> 02:29.460
is running in the SRAM, this could be either you would as PL, it could be TFABL2 or something else.

02:29.460 --> 02:36.020
What is loading next is usually loaded into the RAM, that could be potentially T implementation

02:36.020 --> 02:41.500
trusted execution environment, it could be you would itself, it could also be PSCI provider

02:41.500 --> 02:42.940
and then you would.

02:42.940 --> 02:48.540
So as you can see, at this point in the present, where we are kind of now, the bootloader

02:48.540 --> 02:53.980
next stage, much more complex, it's a composite of multiple bootloader projects, some

02:53.980 --> 02:59.220
of them close source, like the bootroom, some of them open source, like the TFABL, some

02:59.220 --> 03:06.380
of their free software, like your boot, and it didn't stop here, it's actually now slowly

03:06.380 --> 03:12.460
moving to the future where the bootloader's stage is no longer running on a single

03:12.460 --> 03:17.020
core, it's running on multiple cores, when the system powers up, what actually comes

03:17.020 --> 03:25.740
up is a safety core, which is small CPU core, Cortex Air, Cortex, that runs some sort

03:25.740 --> 03:33.580
of a bootloader owner, it does have very initialization fully, it starts the RAM, it does

03:33.580 --> 03:38.220
initialization of the core 3 and so on, then only then it starts providing some sort

03:38.220 --> 03:44.740
of services to the application cores, it brings up the application cores and runs some sort

03:44.740 --> 03:49.140
of a bootloader on the application cores, which then starts the kernel on the application

03:49.140 --> 03:50.140
core.

03:50.140 --> 03:55.580
But I don't want to talk about this, today I'll go one step back to the present and I'll

03:55.580 --> 04:05.860
return to the situation where we are running bootloader on one single CPU core entirely.

04:05.860 --> 04:11.420
And the next thing I would like to explain is exception levels, this is R&V8 specific,

04:11.420 --> 04:20.900
but other R&CPUs also, hmm, what's up, yeah, other R&CPUs, the over cores, R&V7 and

04:20.900 --> 04:25.080
stuff like that, they also have exception levels, it just looks slightly different, risk

04:25.080 --> 04:31.540
5 also has similar concept, I'll talk about R&V8 because this is what I was working with

04:31.540 --> 04:34.980
recently and it's easy to understand.

04:34.980 --> 04:41.220
So the situation is this, R&V8 has four exception levels, or you can call them preveled

04:41.220 --> 04:49.940
chains basically, which allows some sort of a layering of hardware access effectively.

04:49.940 --> 04:56.060
The most privileged ring or exception level is EL-3, this is basically, so for that, that's

04:56.060 --> 05:03.220
full unrestricted access to everything, all memory, all peripherals, so for in that exception level

05:03.220 --> 05:12.700
can set up memory protection, they will also, it can set up IP, it can set up R&V8 basically

05:12.700 --> 05:20.660
to prevent software in less privileged exception levels from accessing peripherals and memory,

05:20.660 --> 05:24.740
that's basically the whole idea behind this.

05:24.740 --> 05:29.900
The most privileged E-O-3, this is where secure monitor is running, usually some sort of a basic

05:29.900 --> 05:38.620
firmware, which has the highest access, E-O-2 that's where nowadays bootloader is run,

05:38.620 --> 05:44.900
Linux kernel, but with all runs, EL-1 is virtualized kernel and EL-0 is applications on

05:44.900 --> 05:47.780
privilege to use it applications.

05:47.780 --> 05:52.420
The one thing which happens here is that it is possible to switch between these exception

05:52.420 --> 05:57.900
levels, they can switch from more privileged exception level to less privileged one, simply

05:57.900 --> 06:03.600
by setting up CPU context and doing exception return, that's easy because, hey, you are more

06:03.600 --> 06:07.660
privileged so you can switch to less privileged, that's trivial, but you can also go the other

06:07.660 --> 06:13.140
way around and that requires for the less privileged software to trigger an exception in

06:13.140 --> 06:16.940
the more privileged exception level.

06:16.940 --> 06:21.740
For that, there are two instructions, as I'm seeing HVC, when that instruction is issued in

06:21.740 --> 06:27.840
the less privileged software, it actually triggers, it switches the CPU state to a more

06:27.840 --> 06:35.080
privileged state and it triggers exception handler in the more privileged ring.

06:35.080 --> 06:38.480
The exception handler is supposed to do some sort of a permission checking in so on and then

06:38.480 --> 06:48.400
maybe does some sort of an action, like something for which the system needs higher privileges.

06:48.400 --> 06:54.120
By something, I mean, maybe enable clock, maybe bring up CPU core.

06:54.120 --> 07:00.840
The thing is, these exceptions were less privileged software can request stuff from more privileged

07:00.840 --> 07:02.840
software.

07:02.840 --> 07:15.080
They serve, or, yeah, these exceptions were, yeah, less privileged software can request

07:15.080 --> 07:22.080
services from more privileged software, serve as a way for the more privileged software to provide

07:22.080 --> 07:28.600
services to the less privileged software with some sort of an access control.

07:28.600 --> 07:36.080
But the issue with that is that when there is this kind of a contract between less and

07:36.080 --> 07:43.520
more privileged software, there has to be some sort of an ABI effectively between those two software

07:43.520 --> 07:44.520
components.

07:44.600 --> 07:45.520
And that exists.

07:45.520 --> 07:53.400
There are actually some standardized ABI's, already, PSCI and SCMI, but, unfortunately, vendor

07:53.400 --> 08:01.360
BSP's, they extend them in all kinds of weird ways, that they introduce new shiny ABI's

08:01.360 --> 08:10.640
which then suddenly in the next version, change ABI, so we have a problem, and the problem

08:10.680 --> 08:12.640
is this effectively.

08:12.640 --> 08:17.840
If your bootloader is providing some sort of a service, say, if you're operating system

08:17.840 --> 08:24.320
kernel, or some sort of a previous stage block provides services to the next stage block,

08:24.320 --> 08:26.320
it becomes an ABI.

08:26.320 --> 08:30.320
Once you change one of these blocks and it provides different ABI, you have an ABI,

08:30.320 --> 08:33.360
but you can everything falls apart, that happens.

08:33.360 --> 08:37.600
It can happen in two ways, traditionally.

08:37.600 --> 08:42.160
If the bootloader is providing ABI and you update the kernel and the kernel expects some

08:42.160 --> 08:47.600
sort of a new ABI which they provide and no one, which they bootloader doesn't provide,

08:47.600 --> 08:50.640
then the kernel will fail to boot most likely.

08:50.640 --> 08:55.680
If you have an ABI, this is still recoverable because the kernel will fail to boot, okay,

08:55.680 --> 08:59.760
bootloader will pick the other kernel, boot the Bcopy, and you can recover.

08:59.760 --> 09:01.840
This is the better case.

09:01.840 --> 09:07.600
If you have the right idea, then to say, okay, well, my kernel is a new ABI,

09:07.600 --> 09:11.040
what's up with the bootloader as well, and suddenly your kernel doesn't boot,

09:11.920 --> 09:17.520
then also your bootloader provides new ABI, the Bcopy doesn't boot, and your system is basically

09:17.520 --> 09:18.640
unusable.

09:18.640 --> 09:20.640
That's a game over.

09:21.440 --> 09:27.280
So the issue I will talk about next is how to solve this ABI problem.

09:27.680 --> 09:33.520
But before I get into it, there is still one thing which was mentioned to me while I was

09:33.520 --> 09:40.960
preparing the work and slides by a colleague and he mentioned, okay, well, you should also say

09:40.960 --> 09:47.280
that this isolation stuff and the privilege rings, they are not necessarily evil, that's through

09:48.160 --> 09:56.800
the thing is, the higher privilege ring software can set up memory isolation, it can set up

09:57.360 --> 10:02.560
ABI isolation to prevent the less privilege software from accessing these parts.

10:03.680 --> 10:08.240
If the higher privilege software is some sort of a proprietary cluster blob and you do not know what

10:08.240 --> 10:13.760
it's doing, then that's probably a bad thing because your less privilege, say, bootloader, then

10:13.760 --> 10:20.320
cannot be used as a debug tool and analyze what the system is doing and you cannot effectively

10:20.400 --> 10:28.080
debug the system, but if you are more privileged software is something which you set up yourself,

10:28.080 --> 10:36.160
then you can set up the isolation and IP, access restrictions in such a way that what you can do

10:36.160 --> 10:41.040
in the more privileged software is handled, for example, access faults and actually have that act

10:41.040 --> 10:42.560
as a better debug tool.

10:42.880 --> 10:49.200
So, yeah, memory isolation, not necessarily evil, it can be used for a very good stuff,

10:50.640 --> 10:57.440
but back to the ABI problem, the solution to the ABI problem is actually super simple and that's

10:57.440 --> 11:06.560
let's reorder the bootloader and blobs. The stack looks give or take like the one on top, right now,

11:06.880 --> 11:15.440
so we power up the system, bootroom rounds, TFA, T, and potentially rounds in EL3, but TFA

11:16.720 --> 11:23.120
most likely switch is the exception level to EL2, that's you but in EL2 and the Linux run in EL2.

11:23.120 --> 11:30.000
What we will do is we will move the blobs, TFA and T after you boot and then you would

11:30.000 --> 11:35.520
run in EL3, it will have access to everything, it can potentially even set up the memory restrictions,

11:35.520 --> 11:44.560
memory protection if it desires to do so, we will start the TFA BL31 which is the PSI provider from

11:44.560 --> 11:52.560
you boot and then we will start the kernel. The benefit of this is that if we start the TFA BL31 from

11:52.560 --> 11:59.120
you boot, we can start both TFA BL31 and the kernel in lockstep, basically keep these two

12:00.960 --> 12:07.920
blobs which have some sort of an ABI contract between them, started together and if something

12:07.920 --> 12:15.120
fails we can take the other copy. Implementation is effectively simple, we have to do two steps,

12:15.120 --> 12:20.320
first step, we need to make sure that you boot doesn't depend on any services provided by the

12:20.320 --> 12:28.560
blobs, that means TFA BL31 or maybe T, means you would not have, you would must not depend

12:28.560 --> 12:35.360
on either PSI or SCMI interfaces and in the second step we need to teach you boot to boot the

12:35.360 --> 12:43.920
TFA BL31. The first step is easy because if you look at how the services are actually implemented

12:43.920 --> 12:52.160
when you call some sort of a PSI or SCMI function or exception, ultimately what the blob does

12:52.160 --> 12:58.080
internally is that it programs some sort of registers. So, you boot can also access these registers

12:58.080 --> 13:04.320
if it runs in EL3 and in order to remove the dependency on PSI or SCMI all you have to do is

13:04.320 --> 13:09.360
make you boot pop these same registers from some sort of a you boot driver model driver. So,

13:09.920 --> 13:16.640
you would driver which matches that functionality and you are effectively done.

13:18.080 --> 13:22.400
But there is a catch. Once you remove the PSI provider from before you boot,

13:25.680 --> 13:33.200
when Linux kernel boots on our V8 it mandatorally depends on PSI being available. So,

13:33.520 --> 13:41.520
if the PSI provider is not there you would, the Linux kernel will fail to boot. It will just

13:41.520 --> 13:49.760
not do anything. Luckily, you boot can act as a PSI provider. But the PSI provider implementation

13:49.760 --> 13:55.680
is effectively architecture or board specific. So, what can be done on the you boot side is

13:55.680 --> 14:02.960
enable mando of some sort of a very basic very initial rudimentary PSI provider which

14:03.040 --> 14:08.480
will only provide Linux kernel the ability to say okay here is a PSI. It will not be able to

14:08.480 --> 14:13.600
start any CPU cores it will not be able to do any system reset or anything but it will be there.

14:14.320 --> 14:21.120
So, in case the user has any problems they can still at least boot the kernel if they for example fail

14:21.120 --> 14:27.680
to start the PSI provider and do some correct reaction. Otherwise, at some later point,

14:28.320 --> 14:37.200
the PSI provider will be started. So, about starting the PSI provider, there are again two options.

14:37.200 --> 14:44.960
One of them kind of the easy is to just load the TFABL 31 into memory, disable caches for that

14:44.960 --> 14:50.400
you would ask commands and then jump into it. The only thing which you have to be careful about

14:50.400 --> 14:57.520
in that case is to make sure that the exit point where the TFABL 31 will jump after it's done

14:57.600 --> 15:05.360
doing it's job will be again the you would entry point because then the TFABL 31 runs

15:05.360 --> 15:14.480
drops EL from 322 and restarts you boot which then runs in EL2 and then you have

15:14.480 --> 15:20.880
you boot running in EL2 with PSI finally implies and you can stop Linux. The other option is to use

15:21.680 --> 15:28.960
fit image which is I believe the more practical option for deployment and I'll talk about that.

15:29.520 --> 15:36.160
So, fit image is a multi-component image based on device 3 just briefly it's container which

15:36.160 --> 15:43.120
basically allows you to bundle together multiple blocks, kernels device 3s, firmware, FPGA bits

15:43.120 --> 15:50.240
3s into a single file. It can contain all these images it also has configurations section which allows

15:50.320 --> 15:55.840
you to tie together different images within the fit image and instruct you boot which of these

15:55.840 --> 16:02.960
images to use for booting. You boot is capable of booting the fit images open and built for example

16:02.960 --> 16:11.280
is capable of generating the fit images for you. Now, when you boot is booting a fit image

16:12.800 --> 16:18.160
of course you have to instruct it which images from the fit image is supposed to pick usually

16:18.240 --> 16:25.840
done by selecting a configuration which selects the kernels, selects the device 3s, selects

16:25.840 --> 16:33.680
potentially another loadable. You boot revocates these images which are in the fit image into their

16:33.680 --> 16:42.800
target memory addresses when that's done it runs loadable handler for all these images and then

16:43.760 --> 16:48.480
at the very end it boots the kernels. That's basically how it works.

16:58.400 --> 17:08.080
So, the loadable handler for the TFABL 31 unfortunately has to be board specific because

17:08.080 --> 17:14.800
that loadable handler may have to do some additional configuration. The TFABL 31 forks from vendors

17:14.800 --> 17:20.720
stand to have some special requirements may require some relocation tables, hand-off tables,

17:20.720 --> 17:30.160
whatever it set up. So, these TFABL 31 loadable handler say they unfortunately will have to be

17:30.480 --> 17:39.360
board specific. Before you boot jumps to the Linux kernel there is now an extension which is called

17:39.360 --> 17:48.000
the jump prep handler right this is called just before you boot jumps into the Linux kernel

17:48.560 --> 17:55.120
this can be used as a hook to finally jump into the TFABL 31. I'll show you the implementation

17:55.200 --> 18:04.000
of this whole thing now because it's actually rather simple and how the TFABL 31 was the support

18:04.000 --> 18:09.040
was added into your boot. It had three parts basically and it fits on three slides.

18:09.840 --> 18:19.200
First part was extend the footage loader so that it would actually recognize the BL 31 image type

18:19.280 --> 18:24.160
that's literally three entries in the right places in your boot that's how the patch looks like.

18:25.520 --> 18:33.040
The next part is the loadable handler. This is board specific and it basically is executed

18:33.040 --> 18:40.080
after the TFABL 31 loadable place somewhere in memory. Specifically for this platform all it does

18:40.080 --> 18:46.400
is it checks whether the system is actually running in the BL 3 at all because you would might

18:46.400 --> 18:51.680
be running in the BL 2 and at that point you don't want to be able to start the TFABL 31.

18:51.680 --> 18:58.480
If it is running in the BL 3 then it marks the load address of the block and it stores it in some sort

18:58.480 --> 19:05.200
of a global variable. The most interesting part actually happens at the very end of this stuff

19:06.080 --> 19:14.080
which is the jump prep handler that one checks whether the TFABL 31 load it again if we are in the

19:14.080 --> 19:23.520
BL 3 for that I think Quentin and finally after that after we are through all this stuff it

19:23.520 --> 19:33.680
setups board specific hand-of-table for the TFABL 31 which instructs that one specific TFABL 31 to

19:35.440 --> 19:42.320
actually return back into your boot into this function which is called unviates which to

19:42.320 --> 19:49.840
EL 2 now why do we do that? This is actually an assembler function. We do that because we want the TFABL 31

19:49.840 --> 19:57.520
to run the drop EL 2 and then return to your boot just before the assembler code in your boot

19:57.520 --> 20:04.800
which sets up the system for starting Linux. So essentially what happens just before the

20:04.800 --> 20:08.800
TFABL 31 load is that the TFABL 31 load is the same as the TFABL 31 load. It makes this

20:08.800 --> 20:15.680
detour into the TFABL 31 then returns into your boot sets up the system for booting Linux and

20:15.680 --> 20:27.360
then jumps into Linux. The integration of TFABL 31 both into a phyramage looks like this this is

20:27.360 --> 20:36.960
phyramage source all you have to do is add another image into the phyramage source this time

20:36.960 --> 20:45.760
of TFABL 31 this is the image type which was defined like 3 slides before and it has to be marked

20:45.760 --> 20:50.640
as a loadable in the configuration section so that you would load it for you run the loadable

20:50.640 --> 20:58.400
handler and then do the jump prep handler. And finally I actually have kind of a demo here in

20:58.400 --> 21:04.960
the slides what happens when you build that phyramage using mkmage and start this kind of a phyramage

21:04.960 --> 21:12.880
is this you will see literally almost no change when you boot such a phyramage. You will see that

21:12.880 --> 21:17.760
you boot that start one more log but there will be no additional output you will just seal

21:17.760 --> 21:26.480
in external booting and that's it. The reason for that is because the TFABL 31 doesn't print anything

21:26.480 --> 21:33.520
and as you enable debug output but you will notice that things actually do work when the Linux

21:33.520 --> 21:38.800
kernel is booting because at the very beginning it will report oh look I found some more advanced

21:38.800 --> 21:46.000
PSI version 1.1 I think and if you look further you will notice that the Linux kernel actually

21:46.000 --> 21:52.720
managed to bring up all the CPU cores not just CPU 0 and all the CPU cores bring up this is done

21:52.720 --> 22:02.960
by calling into the PSI so hey cool yeah we could boot all 4 CPUs and it's great and to wrap it up

22:02.960 --> 22:11.680
in some way I have this one last slide so what did we achieve basically we achieved the ability

22:11.840 --> 22:19.120
to tie together Linux kernel device trees and the TFABL 31 which is the PSI provider

22:19.120 --> 22:25.920
into one single phyramage basically into one file which we can boot all together this fits well

22:25.920 --> 22:32.960
with AB updates because then you boot is capable of picking this or that kernel image based on for

22:32.960 --> 22:40.160
example boot counter and if we can now pick this or that not just kernel image but kernel image

22:40.240 --> 22:48.320
device 3 and the TFABL 31 PSI provider then we can safely update all these things together

22:49.120 --> 22:59.200
as one single file or as one single roof file system image now if the phyramage fails to boot then

22:59.840 --> 23:04.400
you would just boot the other phyramage which even if it contains older kernel version it also

23:04.480 --> 23:12.000
contains older TFABL 31 with the old ABI and this way we do not break the

23:16.480 --> 23:22.320
yeah this way we do not end up with having any incompatible ABI between the kernel and the PSI provider

23:24.000 --> 23:30.320
and that's very much the point of this talk so what I want to show you and this is all

23:30.400 --> 23:46.960
half thank you for your attention and do you want to get it back and then I'll have that one.

23:46.960 --> 23:57.360
Hey Fuzzy this is really cool super exciting to see and at two release more questions one is the

23:57.360 --> 24:04.400
Ubout SPL code already has support for configuring like the BL 31 parameters and jumping

24:04.400 --> 24:13.200
at the TFA and launching up to even and one question so I'll actually go into that one yes

24:13.200 --> 24:20.960
Ubout SPL can start TFABL 31 that's correct and then it can start Ubout yes so my question is if

24:20.960 --> 24:28.880
like I did some similar kind of hacking on this and I just moved that code into the generic

24:28.880 --> 24:35.360
like Bruton framework and it seemed to like I don't get why you have it in the board specific code

24:37.120 --> 24:44.080
because the TFABL 31 blobs they have all kinds of vendor weirdness is that's why this is

24:44.080 --> 24:49.440
board specific unfortunately okay the end of structures which you have to set up to start the

24:49.440 --> 24:55.840
TFABL 31 they are not necessarily standardized I know there there was a discussion actually if you

24:55.840 --> 25:03.120
look at these batches that yes some of it seems almost standardized but then you run the

25:03.120 --> 25:10.000
CCL and start TFABL 31