WEBVTT 00:00.000 --> 00:10.680 Hi everyone, I'm Jake Hillian. I was hoping to present this with Johannesburg for a good 00:10.680 --> 00:15.160 but he hasn't quite made it yet due to Deutsche Bahn delays mainly but he should be 00:15.160 --> 00:19.920 here hopefully by the end to say hello. We're here to talk today about a side project 00:19.920 --> 00:25.680 we're working on involving concurrency testing using custom Linux schedulers. I work at 00:25.680 --> 00:31.520 Meta and I work on schedulers mainly custom Linux schedulers so this is pretty related 00:31.520 --> 00:35.480 to what I do. Johannes is an open JDK developer who recently had to spend a lot of 00:35.480 --> 00:40.120 time debugging a race condition so we hope we could put those two things together and we've 00:40.120 --> 00:44.920 got a bit of a proof of concept today that we can show you and explain how it works 00:44.920 --> 00:51.200 to attempt to make these a little bit more likely to occur which should make them easier to 00:51.200 --> 00:59.040 debug. So, Heisenbergs, I imagine lots of us are familiar. You've got the same input, you 00:59.040 --> 01:04.200 hope for the same output but instead you get a crash. This is not great and it's especially 01:04.200 --> 01:10.560 not great when it happens 1 in 10,000 or 1 in 100,000 or 1 in a million invocations. As an 01:10.560 --> 01:15.720 application owner debugging that from reports is very tricky. We'll go for a simple example 01:15.800 --> 01:21.440 now. Very simple because these things get complex in reality but imagine we've got some 01:21.440 --> 01:26.000 data being produced from a producer thread and we're consuming it in a consumer thread. 01:26.000 --> 01:30.640 In our case and in our example later on there's a explicit expiry date on that data which 01:30.640 --> 01:35.640 isn't what really happens in production. More likely you've got some reference to a pointer 01:35.640 --> 01:39.640 that you might clear in some other thread. All of these expiry reasons that that data 01:39.640 --> 01:46.360 might no longer be valid at some point in the future. It doesn't crash, the vast majority 01:46.360 --> 01:54.520 of the time. The reason for this is that schedulers are pretty good but when that interaction 01:54.520 --> 01:57.840 happens, when your machine gets a bit busy, when some processes get in the way that you 01:57.840 --> 02:02.760 weren't expecting, when the network gets slow, all of these things can just add extra delay. 02:02.760 --> 02:07.920 So, a large reason for these conditions is scheduling. We see this, for example, our 02:07.920 --> 02:12.200 rather debugger has a chaos mode that's supposed to make these a lot more likely to, but 02:12.200 --> 02:18.240 that has its own issues. What is scheduling then? What do we actually do? In this case 02:18.240 --> 02:22.800 we're talking about CPU scheduling. It's one of the more common types. The problem we have, 02:22.800 --> 02:27.800 we've got many processes, likely in the order of thousands and some number of CPUs, like 02:27.800 --> 02:33.160 the in the order of tens nowadays, and we need to somehow make sure all those processes 02:33.160 --> 02:38.760 work successfully on that CPU to share the system. The simplest way we might do this is 02:38.760 --> 02:43.640 just a schedule process A, and whenever it stops, we'll schedule process B. Unfortunately, 02:43.640 --> 02:48.360 that really does not work. There are classes of non-preemptive schedulers, sometimes it 02:48.360 --> 02:51.480 makes sense, but the vast majority of the time we're going to need to schedule be a bit 02:51.480 --> 02:55.000 sooner, or the issues we were talking about before, they'll happen more often, network 02:55.000 --> 03:00.040 time outs, all that sort of thing. So instead, we go over time and we'll stop scheduling 03:00.120 --> 03:04.440 A for a bit. B might not be ready the next time. You might schedule A again. We'll schedule 03:04.440 --> 03:08.120 B, we'll schedule A, we'll flip back and forth, and we're doing this on the scale of many 03:08.120 --> 03:13.720 thousands of processes, likely we have several ready at any point in time. On an actual system, 03:13.720 --> 03:17.160 it might look something like this. We're not going to be able to look at any of the B-thirl 03:17.160 --> 03:22.360 on this chart, but on the left, in the y-axis, we have which CPU we're looking at. As we go 03:22.360 --> 03:27.080 across, we're looking at what's happening on that CPU, whether a process is scheduled, the different 03:27.080 --> 03:31.480 colors, the different processes. It all gets quite complex, but these sort of charts super 03:31.480 --> 03:37.400 interesting. This is a 612 Linux system, just running a VDF that default scheduler. We can see 03:37.400 --> 03:40.760 processes are darting around all over the place, they're coming in, they're running for a short 03:40.760 --> 03:45.240 time, some of them are long running, they move about a bit, there's all sorts of complexity, 03:45.240 --> 03:50.760 and this is even on a pretty quiet system. When you start looking at big systems, hundreds of CPUs, 03:50.760 --> 03:57.000 all the interactions just get way more complicated. So when we look at our race condition, 03:57.000 --> 04:01.960 you're replicating this on your lovely, deaf machine, you've got a 32-core processor, 04:01.960 --> 04:05.240 it's nice and quiet, you don't want anything getting in the way of your testing, 04:05.240 --> 04:11.320 the bug never happens, ever. It's a nightmare. You know it's happening, people are reporting it, 04:11.320 --> 04:15.000 you're running the standard scheduler, and the bug never happens. You can try running 04:15.000 --> 04:19.560 stresses in the background, and that might make it a little bit more likely, but the bug still never 04:19.560 --> 04:24.920 happens. Working on custom scheduler is a matter, I got to work with some schedulers that are 04:25.000 --> 04:29.240 not too good, which is great. It turns out when you write a scheduler yourself, and you 04:29.240 --> 04:33.640 had loads of configuration options, there are many ways to configure that scheduler badly. 04:35.000 --> 04:38.520 And I found what I heard a while ago, I was working on a service, I was writing a scheduler, 04:38.520 --> 04:43.320 I got the configuration terribly wrong, and the service failed, it really didn't work. 04:43.880 --> 04:48.120 But there were three parts to the service, two of them hit massive time-out errors, 04:48.120 --> 04:55.400 but they came back to life, one of them crashed, so 250 hosts died, all at once, because of my 04:55.400 --> 04:59.960 scheduler. It turns out this was a race condition, and someone was storing the value of a shared 04:59.960 --> 05:04.680 pointer instead of copying the shared pointer and C++ for efficiency reasons, and they got it wrong. 05:04.680 --> 05:09.000 So if there was a long enough delay in scheduling the service with crash, this does happen 05:09.000 --> 05:13.800 in the real application, but it's so rare that nobody would ever look at it. Or even if you did, 05:14.440 --> 05:20.600 really struggle to replicate it. So what if we wrote a scheduler that was deliberately bad? 05:20.600 --> 05:25.560 What if it was deliberately erratic and got us into these states more often, where these errors 05:25.560 --> 05:31.320 are likely to happen? That's what we got a demo of today, but how would you write an erratic scheduler? 05:31.320 --> 05:35.560 Well, there's some options here. You could write it in the Linux kernel. 05:37.720 --> 05:41.800 You might have a hard time doing that in general. The scheduler is very sensitive, 05:41.800 --> 05:46.120 if you get it slightly wrong, your system will hit the soft lock-up detector and immediately reboot, 05:46.120 --> 05:51.320 which is a bit of a pain. It's hard to do. If you get it wrong in memory on safe ways, 05:51.320 --> 05:56.600 your system will crash even more quickly. And if you get it right, you're still waiting, 05:56.600 --> 06:01.320 well, maybe in the tens of seconds for a K exact every time you want to change your kernel. 06:01.320 --> 06:06.520 This is a bit awkward. Nowadays, we can do it in user space. You're hunting so 06:06.520 --> 06:10.040 to talk about Java. He's got a project that I'm not going to give enough credit in this 06:10.040 --> 06:13.960 presentation, because I don't know enough about it. Where you can write these schedulers in Java. 06:13.960 --> 06:17.880 I'm more familiar with the Rust ones. You can also do it in C. If that's what you like to. 06:18.760 --> 06:25.400 And it's all because of BPS, which is the B, scared-ext, which is the, I think we're supposed to 06:25.400 --> 06:32.760 call it a sex-depress, which is an interesting choice of logo we got there. And then this is the 06:34.120 --> 06:38.760 additional logo on the right. This is a photo of Brendan Gregg, 06:38.760 --> 06:43.560 supposedly shouting at hard drives. But there's a quote about putting JavaScript into the 06:43.560 --> 06:49.720 Linux kernel here. Many similarities between the EBPS, the way it runs in the kernel, and a virtual 06:49.720 --> 06:54.760 machine for JavaScript you might have in your browser. Here's a photo of him looking slightly 06:54.760 --> 07:00.680 more normal. I think he'd prefer that one. EBPS, we're not going to go into it. It's not super 07:00.680 --> 07:04.680 important how it works, but that there are a few details that we need to cover just for the 07:04.680 --> 07:08.520 understanding. When you develop an EBPS program, you're going to write your source code in 07:08.520 --> 07:15.160 some language. There's a few options we have. It sees the standard one. Rust works reasonably 07:15.160 --> 07:20.520 well. There's some sort of academic languages you can choose as well. And then there's the Java 07:20.520 --> 07:25.640 Transfiler that your analysis got, which is quite exciting. You can pull that into BPS bike code. 07:25.640 --> 07:31.080 It's like assembly, but it sits own language that works on all the Linux systems effectively. 07:31.800 --> 07:37.640 We make a CISC call to BPS to ask it to load our program into the kernel. You need a lot of 07:37.640 --> 07:42.280 privilege for this. It's a root only operation. Again, now I think it kind of got user for a while, 07:42.280 --> 07:48.760 but now it's all root. It goes through the verifier. The verifier is a magic black box that's 07:48.760 --> 07:54.200 supposed to make sure your program is safe in certain ways. So you can't remember it in a 07:54.200 --> 07:59.400 bad way that will cause your system to crash. You can't have unbounded loops that don't terminate 07:59.400 --> 08:03.560 because we're running this in the schedule a hop-off. If you have non-terminating code there, 08:04.120 --> 08:11.160 you're in trouble. Stuff like that. It's a bit of a beast to work with. But once you're verified, 08:11.720 --> 08:17.480 you get jet compiled, loaded into the kernel. You can look at sockets, network interfaces scheduling now, 08:18.040 --> 08:24.360 you're loaded as an X86 program. There's no or arm, whichever system you're on. There's no 08:24.360 --> 08:28.760 further run time basically attached. Then you communicate with that. Most of the using 08:28.760 --> 08:32.360 Cisco's at the minute, we're getting some new stuff called arenas, which are more like mat 08:32.360 --> 08:37.800 memory, and you can communicate back to user space. We can write an application across user space 08:37.800 --> 08:42.920 and kernel space, which is pretty cool. The general way we write our production schedule is 08:42.920 --> 08:48.520 we write some rust that talks to the BPS, and then the BPS runs in the kernel and makes quick scheduling 08:48.680 --> 08:55.240 decisions. So, that's BPS, how do we use that for scheduling? Recently, 08:55.240 --> 09:00.200 I mentioned that it's kernel 612, we're now 613, so it's pretty recent. Schedule XT, 09:00.200 --> 09:06.200 Schedule XT, it's the extension framework for jumping in as a scheduler from BPS space. 09:07.560 --> 09:13.160 This is a tasian, the creator, a few key features. I mentioned some of the troubles we've 09:13.160 --> 09:17.560 working in the kernel before, and the idea is that Schedule X makes them better. There's a 09:17.560 --> 09:22.920 magnum perfect, but it certainly makes them better. So, ease of experimentation, we have a repo with 09:23.800 --> 09:29.160 in the order of 10 Schedulers. Now, maybe a few more. The Linux kernel has two-ish 09:29.160 --> 09:33.640 Schedulers. Even the old one has to be ripped out to make way for the new one. So, we don't, 09:33.640 --> 09:39.080 we don't have a lot of optionality in the kernel, but you can run many different SCX Schedulers on 09:39.080 --> 09:43.080 your machine switching between them just by running a program and pressing Ctrl C. It's 09:43.080 --> 09:50.280 super easy. Customization, too. We can talk to user space. You can do basically anything you want 09:50.280 --> 09:54.920 in these Schedulers. Sure, some of it has to avoid the hot path, and you've got to communicate with 09:54.920 --> 09:59.720 user space a little bit. Turns out that's not as bad as we might think, but you can make loads of 09:59.720 --> 10:04.120 choices. You can use information from Nvidia RESTMI that the Linux kernel is never going to do 10:04.120 --> 10:09.080 and stuff like that. And finally, rapid schedule of deployments, deploying a new kernel at scale 10:09.160 --> 10:15.080 is tricky. We have to get it to millions of machines, and it takes in the order of weeks to get 10:15.080 --> 10:20.680 that kernel out. Deploying a new Schedulet can take a day. It's really easy, and running it and 10:20.680 --> 10:25.160 stopping it is also easy. You don't have to reboot. So, if we find out weeks later that our 10:25.160 --> 10:29.320 Schedulers kind of bad, we can just stop it and we go back to the default and everyone's safe. 10:29.320 --> 10:32.760 We don't have to worry about how much we've broken all the systems. 10:33.160 --> 10:41.080 In the SCX Scheduler, maybe no worries too much about the D-cells, but we have a few bits that we 10:41.080 --> 10:46.680 have to worry about. On each CPU, we have a local FIFO queue, it's just first in first out, 10:46.680 --> 10:51.320 and that's effectively read from by the kernel. If you've put stuff in that queue, the kernel 10:51.320 --> 10:58.360 side of SCX will make sure it gets run. On that CPU, in that order, quite convenient. In SCX, 10:58.440 --> 11:03.640 we generally have global queues as well when we write our own Schedulers. In this picture, 11:03.640 --> 11:08.040 we've got one. You can have a dozen, you can have as many as you like, and within those queues, 11:08.040 --> 11:12.120 you can mean different things. On some Schedulers, we might have a different Q per LLC. 11:13.000 --> 11:17.240 Some Schedulers, we have a different Q per how much we want to prioritise the workflow, 11:17.240 --> 11:23.000 and various different things like this. The Schedulers job that we write in SCX is to move things 11:23.000 --> 11:28.520 from global queues into local queues and accept new processes, make decisions based on them, 11:28.520 --> 11:34.760 and let them run in the order we like. To view a super simple Scheduler in the job aside of this 11:34.760 --> 11:40.360 framework, first step, well, first step is to license everything as GPL. That's an absolute 11:40.360 --> 11:45.960 requirement with BPS, which is pretty cool. License it's that we're going to share this this 11:45.960 --> 11:52.600 constant of a shared DSQ ID, nice and easy. We'll create a shared DSQ, which we need to be able 11:52.600 --> 11:58.040 to handle tasks in a more uniform way, handling it per CPU would end up with separate scheduling 11:58.040 --> 12:02.520 issues. So we'll create that DSQ, and now we've got our Q, and that's it. Next one, 12:03.640 --> 12:09.800 NQ. This happens when you receive a task that is now runnable. You've got a task, ideally, 12:09.800 --> 12:14.520 you want to put it on a CPU, but if you can't put it on a CPU, we're going to NQ it. 12:14.520 --> 12:18.760 In this case, we're using another K-Funk SCX BPS dispatch. We're taking our task, 12:19.320 --> 12:23.560 we're putting it in our shared DSQ. We're saying next time it runs, it can have up to five 12:23.560 --> 12:28.200 milliseconds, and then we're just passing through these flights. It's pretty simple too, so far. 12:30.040 --> 12:35.160 And the final one is dispatch. This is what's called on a CPU goes idle. You have your CPU, 12:35.160 --> 12:38.360 it's finished doing whatever it was doing. It doesn't know what to run next, because it's little 12:38.360 --> 12:44.200 Q is empty. So we just run SCX BPS consumed from the shared Q, which takes the task from the shared 12:44.200 --> 12:49.720 Q, and just runs it on that CPU for us. That's it. That's a whole schedule. It's not a very 12:49.720 --> 12:55.560 good schedule. We're using Firefox everywhere. There's no priority for any processes. Everything is 12:55.560 --> 13:00.600 completely equivalent, which it turns out doesn't work very well. There's also only one global 13:00.600 --> 13:05.400 Q, so if you're in any sort of complicated CPU, that will really struggle. If you've got two 13:05.400 --> 13:11.000 sockets on certain Intel machines, this will kill the machine, because they're cross socket communication 13:11.080 --> 13:14.760 is so slow, that if you try and run the scheduler, you hit the soft lock up detector, 13:15.720 --> 13:19.960 before the schedules thing can get kicked out. It's normally very safe. Normally, 13:19.960 --> 13:24.120 schedules, if you don't schedule stuff, it just gets kicked out, and you go back to normal. But the 13:24.120 --> 13:29.000 Intel machines are so slow, you can't actually get kicked out, because that bit of kernel code can't 13:29.000 --> 13:34.760 run in time. So that's quite interesting. But in the general case, you're pretty safe. This will run, 13:35.240 --> 13:41.400 and then you can extend it as you like. Producing erratic scheduling orders. That's what 13:41.400 --> 13:47.320 this was all about. How can we make our race condition fun more likely? We have a, let's see, 13:48.360 --> 13:53.400 let's go first of all, this is the example. We have an example here written, I believe in Java 13:53.400 --> 14:00.520 again. It's a super simple thing to crash. We just consume things from a Q that are only valid for 14:00.520 --> 14:04.680 a certain amount of time. It's missing a little bit of the code here, and I won't find it, 14:04.680 --> 14:09.320 because it gets, you always need a bit of plumbing to make these things work. But effectively, we get 14:09.320 --> 14:13.640 a task come from this producer thread. We've set a time on it that's just a limit, and if we 14:13.640 --> 14:18.920 try and read it beyond that, we're going to crash, and then we just keep reading it. On a quiet system, 14:18.920 --> 14:24.200 this is fine. This will run for days at a time, and it will never crash. Even on a busy system, 14:24.200 --> 14:29.560 we haven't yet seen a crash, but it can theoretically happen. We had to get quite simple with these 14:29.560 --> 14:36.600 examples to make them fit. Basically, I've got a video to show you from your harness that I'm 14:36.600 --> 14:46.200 going to have to talk over, I believe. Thank you. Okay, so we started our schedule. We've got schedule 14:46.200 --> 14:51.800 over the SH, which just launches our schedule, with the correct arguments. Some pools run Q.SH. 14:51.800 --> 14:56.040 Here's our sample script, the Java Registro G, and we're also getting some extra 14:56.280 --> 14:59.960 of the bossity out of it. Every time we make a scheduling decision, we're printing it here. 15:00.680 --> 15:05.880 And the way we've set this up is it's going to sleep things for just way longer than it needs to. 15:06.520 --> 15:12.200 We'll take run-able tasks that would get a CPU immediately on a normal scheduler and not schedule them 15:12.200 --> 15:16.200 for whatever amount of time this is saying. So this is going between kind of half a second and a second 15:16.200 --> 15:20.680 and a half of not scheduling our tasks that could be scheduled. Then when it runs it, we run it 15:20.680 --> 15:29.160 for 80 milliseconds, something like that, and it normally finishes. And we saw it crashed, which is 15:29.160 --> 15:34.280 pretty good. This program, again, hours of the time we've left it running on these machines, 15:34.280 --> 15:38.840 it doesn't crash. It just never hits these edge cases, but those edge cases are there, and they 15:38.840 --> 15:45.000 can be hit, and if you scaled this sufficiently, if that Fred with the random delay was actually 15:45.000 --> 15:50.040 a network request, then your system's got slow, that was curing on those. This would crash. 15:50.760 --> 15:55.400 So we've got a lot of times where this could happen. If someone reported it, you wouldn't be 15:55.400 --> 16:03.000 able to debug it comfortably on your machine. And that's it, we've got a crash. I think we've got a 16:03.000 --> 16:09.640 few minutes where I can briefly scroll through the code. There isn't too much of it, surprisingly, 16:09.640 --> 16:28.600 again. So we go back to, before I do that, have we got any questions? 16:29.400 --> 16:42.680 So the question was, are we running that in the CI or just locally at the minute? At this point, 16:42.680 --> 16:48.120 this is very local. It's quite constrained, this scheduler, it only works on small machines, 16:48.120 --> 16:53.960 at the minute, we use a lot of those Python cues. It's very new, it's very early. What we were 16:54.040 --> 16:59.240 excited about seeing was whether we could make it crash, and we can. So the next steps with this 16:59.240 --> 17:04.280 would be to production, I said a bit more, get it able to run on a big service. That example, 17:04.280 --> 17:07.960 I mentioned earlier, if I tried to, that machine, it's one of the new AMD chips, so it's got loads 17:07.960 --> 17:12.120 of LLCs, it's all a bit complicated. If we try and run this schedule on there, it just doesn't work. 17:12.920 --> 17:17.640 It gets kicked out. The machine survives, but it doesn't work. So we need a more complex hierarchy in 17:17.640 --> 17:22.440 the scheduler, and then to interject the randomness, and we also need some seeding and bits like 17:22.440 --> 17:27.080 that to try and get it more consistent, and probably a bit of searching to find the right 17:27.080 --> 17:31.560 conditions to make it crash. So this is still very early. We'll be happy about contributions too. 17:35.560 --> 17:40.600 Have we tried it on an arm machine? No, but it does work. So schedule, it's in general, we have 17:40.600 --> 17:46.200 tried on arm machines. It works fine, there's nothing super worrying about it, which is great, 17:46.200 --> 17:49.960 so yeah, we're a bit concerned, we've got a lot of arm to cover at the minute. 17:52.440 --> 17:57.160 Can you do from the schedule? Can you, for example, go to the process memory to see if I 17:57.160 --> 18:02.440 hope it has, what it has to improve, it has these kind of bits, now I know that I have to 18:04.040 --> 18:12.360 schedule it to get a CPU slash another process. So the question was about looking at the 18:12.360 --> 18:16.760 memory of the processes and what more information we can use from them to make a scheduling decision. 18:16.760 --> 18:21.720 That's super interesting. We haven't looked at that yet. We can do the filtering at the minute, 18:21.720 --> 18:27.480 it's based on parent pids effectively. So we, we were a little bit, the way we're running the 18:27.480 --> 18:32.520 schedule is we schedule the whole machine, but we only care about messing up this specific process, 18:32.520 --> 18:37.480 because otherwise we'll start finding race conditions in the shell and we'll be in trouble. 18:38.520 --> 18:44.200 So the reason we do it like that is just easier. We use p-pid at the minute, you can filter on 18:44.200 --> 18:49.800 in other schedules, we filter on things like calm, process name, thread groups, all these things. 18:49.800 --> 18:54.040 So that, but I've never seen the option of actually looking at process memory, that's super 18:54.040 --> 18:59.880 interesting. We are doing some stuff where the application can tell the scheduler what it wants 18:59.880 --> 19:06.040 in a more fine-grained way. Currently we just use niceness, which is a bit weak, it's not very 19:06.040 --> 19:11.320 rich. So we're doing more communication from the process to the scheduler in our production 19:11.320 --> 19:16.680 schedule. And I think that would be possible here too. We're running a BFF, it's four root privileges, 19:16.680 --> 19:18.680 you can kind of do what you want, which is cool. 19:26.040 --> 19:31.960 It's an excellent question. Their question was about reproducing the crashes and how we can 19:31.960 --> 19:37.000 make that happen. The answer at the minute is no, we don't have that. It's a lot of the goal of this 19:37.000 --> 19:41.720 project, but when we were running through it, we started looking at how we could build this scheduler 19:41.720 --> 19:50.120 to get rid of a huge amount of the non-determinance in the process. The way we saw we're scheduling 19:50.120 --> 19:54.360 it, we were going to slow it down too much if we tried to get rid of too much of the non-determinance, 19:54.360 --> 19:58.200 because we want to be in the position long term where we can run this on production applications 19:58.200 --> 20:04.040 without slowing them down to the point where they stop serving traffic. And that meant we 20:04.040 --> 20:10.600 made some compromises. The main one is we schedule things on course pretty quickly now when the 20:10.600 --> 20:15.000 original plan was to go one thread per process, but it just isn't scalable at that point. 20:15.000 --> 20:19.240 So there's definitely some work to be done to getting the seeded and making it more reproducible. 20:23.240 --> 20:27.240 I was wondering about this scheduling point where you can decide to make a schedule. 20:27.240 --> 20:32.360 Sounds like the only one that process was the kernel, is the point that you can do the schedule, 20:32.360 --> 20:36.760 but if you have something bad, let's say you recommend the variable, you should be on the 20:37.080 --> 20:42.680 criminally with a fashion head, so how do you explain the meaning of anything? 20:42.680 --> 20:49.160 Yes. So the question was about effectively summarized when we can preempt things. Any time, 20:49.160 --> 20:55.240 it's good. We have full control over it basically. So we can cut the slices down which helps, 20:55.240 --> 21:00.680 but we also have a K-fun called SCXBFF Kick CPU that kicks the CPU quickly, 21:01.560 --> 21:05.080 which is pretty cool. How we'd integrate that as a different question. We haven't done it yet. 21:05.080 --> 21:09.080 We're purely working with slices in a minute, so that does, sorry. 21:10.440 --> 21:14.760 That does get the preemptive scheduler to kick in and stop the process, and we will get these 21:14.760 --> 21:19.000 into leaving eventually, but you're having a more, if you could look at the memory and see a bit 21:19.000 --> 21:24.440 that's flipped, and then kick it, we have that option, which would be very exciting in the future. 21:24.440 --> 21:29.080 I'm hoping that we've opened SCXBFF's for the world of testing, and that everyone will have 21:29.080 --> 21:34.040 great ideas now, because I'm a scheduler developer. Your harnesses of open JDK developer that 21:34.120 --> 21:39.720 likes schedulers, so it would be really cool for other people to see that schedulers are 21:39.720 --> 21:45.560 available to them, and not completely impossible to write now, and use that in testing more widely. 21:45.640 --> 21:56.760 Can you prevent this soft lock from killing Linux, if let's say you want to explore all 21:56.760 --> 22:02.680 possible schedules which can occur in systems than it will crash? 22:02.680 --> 22:07.640 So there's two parts to that. We have two of these lock-up detectors, I got stover it earlier, 22:07.640 --> 22:12.680 but Skedox itself, if you have, we'll give it a task that's runable, and you wait more than 30 seconds 22:12.680 --> 22:17.000 and don't run it, the SCX scheduler will get kicked, and all those tasks will move back to the 22:17.000 --> 22:21.560 fair scheduler in the kernel. There's also the soft lock-up detector, which happens a bit later, 22:21.560 --> 22:25.000 that if the machine isn't, I'm not too, super sure on the details, I just know we hit it. 22:25.000 --> 22:29.960 If the machine isn't making reasonable progress, and it's not much later, I think it's maybe 40 seconds, 22:29.960 --> 22:36.920 45, and then it just repeats the machine. And that one, I would say turning that off probably isn't 22:36.920 --> 22:42.040 super productive, because if you were to hit that with any normal schedule without your scheduler, 22:42.040 --> 22:48.360 the machine would do the same thing. The SCX one we haven't needed to turn it off, because 22:48.360 --> 22:53.560 30 seconds is such a long time. Technically, if we were making kind of network requests, 22:53.560 --> 22:57.240 they could take longer to come back than that, but for the vast majority of bugs, 22:57.240 --> 23:01.400 30 seconds should be plenty. If you do want to change it, there's a number in the kernel, 23:01.400 --> 23:05.960 and you can always recompile it, and that will get longer, but you've got to be, there's many 23:05.960 --> 23:07.960 systems that come in to make that stop. 23:24.920 --> 23:28.200 Yeah, it's a good question. The question was about more erratic behavior, instead of just 23:29.080 --> 23:35.080 scheduling timings. The show answer is no, basically. There's stuff where we're interested in 23:35.080 --> 23:40.600 a memory latency change on systems as they get more loaded. We haven't done any work to 23:40.600 --> 23:45.480 train calls like that, and it's not easy to do that with the SCX scheduler. There are ways to do it. 23:45.480 --> 23:49.880 You can kind of force things to mess up their caches more often with scheduling decisions, 23:50.600 --> 23:55.320 and introduce extra processes that do that too. But I think those races are a lot finer-grained, 23:56.040 --> 24:01.080 and we haven't started looking about, yeah. That's great. Thank you very much.