WEBVTT 00:00.000 --> 00:13.840 So the next talk, my question is about the next talk by illadruse us about making the 00:13.840 --> 00:16.360 C-O-Mex con and so on. 00:16.360 --> 00:18.360 So, thanks a lot. 00:18.360 --> 00:20.360 Hello. 00:20.360 --> 00:22.760 Thank you for the introduction. 00:22.760 --> 00:33.320 Today I will be talking about what happens if you decide that working with your data structures 00:33.320 --> 00:40.440 in a convenient way is simply to boring. 00:40.440 --> 00:48.480 The agenda of this talk is, first one, I will talk about what was actually the trigger 00:48.480 --> 00:50.920 to study all these. 00:50.920 --> 00:59.600 And then we will briefly talk about the land of the lock free because all of this is 00:59.600 --> 01:03.280 actually inspired in the land of the lock free. 01:03.280 --> 01:05.360 And then I will outline the solution. 01:05.360 --> 01:14.040 If the time permits, I will also out mention news in the C-O-Mex C-O-Mex kernel since 01:14.040 --> 01:16.120 the last was them. 01:16.120 --> 01:17.120 Okay. 01:17.120 --> 01:18.920 So let's start. 01:18.920 --> 01:23.880 What was the problem which made me to think about this? 01:23.880 --> 01:32.120 The C-O-Mex kernel does its best to make life inside the interrupt service handler as hard 01:32.120 --> 01:33.120 as possible. 01:33.120 --> 01:39.480 The reason for this is that we run on very limited hardware where there is a rule that interrupt 01:39.480 --> 01:44.440 service handler has to run in kernel mode or in supervisor mode. 01:44.440 --> 01:50.520 So it essentially has kernel privileges and it can do a lot of bad stuff. 01:50.520 --> 01:56.840 So we only allow two calls to kernel so you are very motivated to defer your work to 01:56.840 --> 02:07.720 into user space, but this by this, you still can buy these two calls modify pretty much 02:07.720 --> 02:11.240 every data structure that is there. 02:11.240 --> 02:21.320 And while it is done from interrupt context, you have to somehow guard how to eliminate 02:21.320 --> 02:29.000 concurrent modification so still things remain consistent. 02:29.000 --> 02:33.920 You can't use mutex's because if you log mutex and you get interrupt, it's incentive that 02:33.920 --> 02:34.920 lock. 02:34.920 --> 02:38.480 What kernels do is they disable interrupts? 02:39.440 --> 02:47.360 As I said, it's safe, it's good, but it's boring and it introduces a lot of latency and 02:47.360 --> 02:56.040 also jitter if the thing you do with interrupt disabled varies in time. 02:56.040 --> 03:05.160 So I've been thinking how to avoid this and I did some investigation. 03:05.400 --> 03:14.680 I investigated how other kernels do it and I realized that many RTOs simply lock disabled 03:14.680 --> 03:17.480 interrupts and call it a day. 03:17.480 --> 03:24.920 Then I have a couple of friends and we have board game evenings with them and at one of 03:24.920 --> 03:33.160 these board game evenings, I mentioned this problem, the constraints and what I want to achieve 03:33.160 --> 03:40.680 and one of them told me, you know, this sounds a lot like you need a lock free approach. 03:40.680 --> 03:46.440 Okay, so I started studying lock free algorithms. 03:46.440 --> 03:53.000 I already knew some very basics of lock free and I guess you also know, but for example, 03:53.000 --> 04:02.840 the case of single producer, single consumer queue, that thing can run without mutex's. 04:04.040 --> 04:10.120 And it's inherently third safe because it's a lock free data structure in this one specific case. 04:12.200 --> 04:13.880 Unfortunately, it cannot be used here. 04:14.840 --> 04:21.960 So I've been searching for some other solution and while it was right around the time when LLMs 04:21.960 --> 04:31.000 became very popular, I asked GPP. Okay, show me some experience which satisfies these conditions. 04:31.160 --> 04:34.920 And it came bit roughly the similar, very similar code to these. 04:36.520 --> 04:44.840 And what this code essentially does is that it creates a local copy of shared data under mutex. 04:45.560 --> 04:52.920 But this is a one section, basically. Then doesn't modification on the local copy and then 04:52.920 --> 05:01.240 commits it into the shared data, which satisfies all my criteria, except with one. 05:01.240 --> 05:09.560 It has a lot of overhead. And it also looks quite a lot like database transaction. 05:11.000 --> 05:21.720 Because you have read, modify, and commit stage. Therefore, I decided I will use this concept 05:21.720 --> 05:31.240 general, but I will make it lightweight. Now, a bit about the lock free programming, what it is actually. 05:32.280 --> 05:36.440 A lot of people think that if you have lock free programming, it means that you are completely 05:37.400 --> 05:44.120 free of any locking. And nothing in the system ever gets blocked. Well, there's no true. 05:45.080 --> 05:52.760 The definition of lock free is that at least one execution context does not block. 05:53.480 --> 05:59.160 And here in the specific case, when we are talking kernel versus interrupt, 06:00.200 --> 06:05.000 this context which you don't want to interrupt SD interrupt service handler. 06:06.040 --> 06:11.480 So what we want to achieve here is essentially a fast line for interrupts, 06:12.440 --> 06:22.200 which kind of allows us to cheat. If you find this guarantee that you only have one context, 06:22.200 --> 06:28.840 which has fast line, to boring, then for you, there's a way to free programming. We actually 06:28.840 --> 06:37.080 does the part that nobody uses locks, and nobody also gets blocked. But that's a lot more complicated 06:37.720 --> 06:44.360 and you get a lot more overhead. Which is actually the reason why not everything is lock free, 06:44.360 --> 06:52.120 because lock free comes with a lot of baseline overhead, which kind of gets paid for 06:52.120 --> 06:58.120 if you do massive concurrency, where the overhead of locking is much higher than baseline overhead 06:58.120 --> 07:05.080 of lock free algorithms. Now, when we know roughly what the lock free is, 07:06.040 --> 07:13.560 and we know that the baseline idea was something like transaction, let's look to 07:15.400 --> 07:22.840 how the transacted part of kernel looks like compared to standard critical section. 07:22.840 --> 07:30.200 The thing on the left is a critical section, modeled by mutexes. So at the beginning of 07:30.920 --> 07:37.400 the access of shared data, you log a mutex, or internal, it could be 07:38.840 --> 07:46.520 you disable interrupts. Then you find what you want to modify, and then you modify it. 07:46.520 --> 07:54.040 And once you are done, you reenable interrupts. What transaction this does is that at the 07:54.120 --> 08:01.720 beginning, you open the transaction. This is a read on the action. Then you find the entry 08:01.720 --> 08:11.160 which you want to modify. This is not an exclusive action. You can be raised by someone else, 08:11.160 --> 08:16.200 both opening transaction and accessing the data. Then you try to commit this transaction, 08:16.760 --> 08:22.680 and you specify that you are modifying the data, that you are going to modify the data. 08:23.320 --> 08:31.240 And this transaction may either pass or fail. If it passes, then you get exclusive access 08:31.960 --> 08:40.120 to this data, and you can modify it. If it fails, you shall not modify it because you will 08:40.120 --> 08:46.120 break the contract. And then once you are done modifying it, you call the x and then 08:46.440 --> 08:53.240 you signalize, you are done modifying the data. So the difference is that the critical 08:53.240 --> 09:01.560 section is runs executably the whole time. Transaction only runs exclusively for the time you 09:01.560 --> 09:09.720 are modifying the entry. It gets even more interesting if we consider the read only case, 09:09.720 --> 09:23.080 because this is a read wide case. So, if you read some non atomic data from shared data 09:23.080 --> 09:30.520 storage in concurrent system, you also use critical section to get the data in consistent state. 09:30.520 --> 09:42.920 With transactions, you perform the wall data access in a shared state, so you can be raised. 09:43.880 --> 09:48.680 And you can be raised even by someone who will modify the data. And the question is, okay, 09:50.440 --> 09:56.600 why would you do that if you have synchronization primitive, which doesn't actually guarantee 09:56.920 --> 10:05.480 consistency of the data? Well, it doesn't guarantee you that nobody will modify the data 10:05.480 --> 10:14.920 structure, but at least if you survive it, it will tell you that it was modified. And the 10:14.920 --> 10:22.360 beginning I mentioned that transaction start is a read only operation. And this read only 10:22.360 --> 10:28.920 transaction has a nice feature that as a wall, it is a read only operation, including the 10:28.920 --> 10:38.600 overhead of transactions up system. And that means that readers don't block each other, readers don't 10:38.600 --> 10:46.200 block writers. And if you write something which someone else is trying to read in between, 10:46.280 --> 10:57.240 readers will know that the data has been modified. And also there's if, which means that you can choose 10:57.240 --> 11:13.400 how you react to the event that you've been raised by someone else. Okay, to detail a bit about 11:13.480 --> 11:21.160 the transactions. What do you usually know as a transaction? Is an acid compliance transaction 11:21.160 --> 11:31.000 from the database systems? These transactions provide a serialization isolation level. 11:32.120 --> 11:40.440 So you can essentially take a list of transactions, replay them in the same order, and you get the same 11:40.440 --> 11:48.200 state each time. Which is actually used by the database systems to do Keter Schroek Recovery. 11:49.400 --> 11:59.800 And this level of isolation is nice for a database, but a bit of an overkill here, 11:59.800 --> 12:07.960 because we would have to do the coppy from shared to local modify and then commit. And it will 12:08.040 --> 12:14.600 also have to provide some temporary storage. And for microcorner running on micro control, 12:14.600 --> 12:20.360 that's not really viable. And I believe it's not really viable for any microcorner anywhere, 12:20.360 --> 12:27.080 because that's really high level of overhead. But there are other over separation levels. 12:29.800 --> 12:35.560 And one of them is so called read committed. In this separation level, 12:36.440 --> 12:42.200 all readers only see the data, which was committed by someone else. 12:44.040 --> 12:52.840 And it turns out that this is just enough separation level, you need to live in kernel 12:52.840 --> 12:58.440 in this weird partially synchronized or partially consistent state. 12:59.160 --> 13:07.880 And therefore, if you run unicore, this framework offers this separation level, 13:08.520 --> 13:18.440 which means that the data structures, as a wall, if you take some hash table or let's say 13:18.840 --> 13:31.160 scheduling table, they may develop some unusual properties, like you have two threats running 13:31.160 --> 13:38.680 on the same core at the same time. But the entry for individual threat is consistent. 13:38.680 --> 13:50.200 Then, as I mentioned, the fact that read-only transactions are purely read-only actions as a 13:50.200 --> 13:59.560 wall, they don't block each other. And you only pay for the overhead of transactions once you get 13:59.560 --> 14:11.000 content shown. And for the just for sake of terminology, if a transaction 14:13.400 --> 14:19.560 if you commit a read-bright transaction, it will invalidate any other transaction which is 14:19.560 --> 14:27.720 implied. So it was opened, but it wasn't committed yet. And when transaction is invalidated, 14:27.800 --> 14:38.120 it will fail to commit. Read-only transactions don't invalidate anything, which is the point of 14:38.120 --> 14:47.640 non-booking each other. So, now, it isn't all roses, of course. 14:47.640 --> 15:03.480 While this is all based on walk-free, we are essentially trading walking for iteration. So when 15:03.480 --> 15:13.080 you get content, the thing you probably want to do is retry. But there are cases, as I mentioned, 15:13.160 --> 15:20.040 you are free to decide what to do. And there is one case. It is actually the first case where 15:20.040 --> 15:24.360 we implemented this, and it's scheduler. Because it may happen that you have normal 15:26.520 --> 15:32.200 scheduler run. So we are deciding the next threat. And then interrupts comes, we change this 15:34.760 --> 15:42.200 state of threat. And it fires another scheduler run, and they run concurrent, one on top of 15:42.200 --> 15:50.360 that another. And in this case, what happens is that the scheduler from interrupt changes state of 15:50.360 --> 15:58.920 threat table, decides new threat, set seat, and returns. Then kernel finishes, and his transaction 15:58.920 --> 16:09.240 for scheduler fails. So what does it do? It gives up. Because someone else already did scheduling for it. 16:12.680 --> 16:21.880 That's a nice feature. For the interesting properties, it means that your algorithms 16:21.880 --> 16:34.760 have to be resilient. You cannot assume that there will be only one max value in something which 16:34.760 --> 16:42.920 has unique values. Which is still good if you are running a unit processor, because there's 16:42.920 --> 16:50.200 this read committed separation. So like internals of the data entry itself will be consistent. 16:50.200 --> 16:57.160 Because whatever race to you did the change automatically because interrupts were disabled. 16:57.880 --> 17:07.320 But it gets very funny when you switch to a multi core because there this assumption doesn't 17:07.320 --> 17:14.680 hold any more. While interrupts are disabled on one core, someone else may read on another core. 17:15.880 --> 17:22.360 But you also need to another problem. It's actually next slide. 17:22.360 --> 17:35.800 That, as it is implemented right now, it only works in Unicorn setups. Because currently, 17:37.240 --> 17:41.560 we commit the transaction at the beginning when you call an transaction commit. 17:43.000 --> 17:51.320 And if you were running multi core, there would be an ambiguity that if anyone opens 17:51.320 --> 17:55.640 transaction while you are inside commit, which is technically possible. 17:58.040 --> 18:06.440 You couldn't say if you have to abort this transaction or not, as an effect of the transaction 18:06.440 --> 18:10.920 which is currently being committed. And there are two possible solutions to this. 18:11.480 --> 18:15.240 Synchronize opening transactions, starting transactions with commit. 18:15.880 --> 18:23.160 Well, it will remove the feature that your read only transactions are read only operations. 18:25.320 --> 18:29.480 But it's the safer way because you get this synchronization and it's clear. 18:30.120 --> 18:40.440 And another option is that we will mark the transaction as committed or the internals synchronization 18:40.440 --> 18:46.120 point will be marked at the end of commit, at the point when you call transaction done, 18:46.920 --> 18:53.320 which also provides the synchronization, but you don't need to use mute access. 18:54.360 --> 18:56.600 So this is the next update which will be done. 19:02.920 --> 19:08.520 Maybe why do you consider transactions? Maybe you think that this must be like 19:08.600 --> 19:13.960 crazy complicated framework. Actually, it's like 95 lines including commands. 19:15.640 --> 19:24.120 And the memory overhead is like you need one field for to remember which was the last committed 19:24.120 --> 19:30.680 transaction ID. It's really trivial. And as long as your data structures and your algorithms 19:30.680 --> 19:39.400 can survive this violent, inconsistent, internalized, internal state, you may benefit from it. 19:42.600 --> 19:47.640 I don't know what will be the general performance of this for two reasons. 19:48.200 --> 19:57.000 Some external case is kind of specific. Our kernel is designed that many actions which we 19:57.080 --> 20:05.080 transacted have quite a long phase where they are finding the entry to modify. Like we are 20:05.080 --> 20:13.320 searching, okay, you want to change the thread state. So search for a thread, take some time. 20:13.320 --> 20:19.160 It's not a one operation and then you change one flag which is one operation. It's a one. 20:19.880 --> 20:31.240 Your mileage may vary if you use different code organization. And then, which we don't know yet, 20:31.240 --> 20:40.040 also, is how funny it will become when we go multi-core. Because there's a lot of unknowns, 20:42.200 --> 20:47.720 which I cannot predict what will happen, what synchronization problems will pop up. 20:49.240 --> 21:00.040 Okay, and we have some time. So let me to list out some news in the kernel, 21:00.680 --> 21:08.120 kernel itself, got support for FPU, which is nice because we can use more performance. 21:08.120 --> 21:13.800 New Texas were moved out completely into user space. For the happy path, 21:13.880 --> 21:22.680 the thing you are watching right now is actually running inside CMRX. So it's not the presentation. 21:22.680 --> 21:28.360 It's an actual application, running on an actual operating system, running on top of Linux. 21:28.360 --> 21:35.720 It's an automation. We are using Linux as a hardware platform. Another platform which was 21:35.720 --> 21:44.200 added is Rmv8 and community slowly growing, with more ports coming, namely risk 5 port, 21:45.160 --> 21:53.720 which is contributed by another guy. So thank you. Are there any questions? 21:53.800 --> 21:57.800 Questions? 22:01.800 --> 22:08.600 Correctively, you introduce an intentional race possibility in the transaction period, and I'm just wondering, 22:08.600 --> 22:14.120 so isn't that technically undefined behavior, and does that product problems in practice, or was? 22:14.120 --> 22:21.400 Uh, depends on what you ask. If you ask something from academic sphere, they will tell you yes. 22:22.360 --> 22:32.280 Uh, if you ask someone from the field, they will tell you yes, but if you survive it, 22:32.840 --> 22:39.720 you will be, you will be told that it was an, and you, you went through the field of undefined 22:40.600 --> 22:47.080 circumstances, so you should probably throw the result away. And it's up to you to decide what you 22:47.160 --> 22:55.320 actually do, because you may say, okay, whatever I computed is a garbage, but the conditions, 22:55.320 --> 23:00.840 which led to this garbage, which you actually control in that application, because you decides 23:00.840 --> 23:07.240 the transaction domain, with using it, what, what is the semantics of the data you cover by 23:07.240 --> 23:13.880 transaction, like in the scheduler case, it may mean, okay, someone did it for me, so I just did up. 23:17.320 --> 23:23.080 Have you looked into read copy update, because that's a scheme that sort of should provide you with 23:23.080 --> 23:29.080 some more properties like readers, not being synchronized among each other, writers not being 23:29.080 --> 23:34.520 brought by the readers, readers not being brought by the writers, and I would say the main difference 23:34.520 --> 23:39.800 is that compared to transactions where you have to tolerate failed transactions, in the read copy 23:39.800 --> 23:46.120 update, you have to tolerate that the readers might see still data sometimes. So would that serve you 23:47.160 --> 23:53.560 well or not, and if not, why? Okay, the question was if I check the read copy update, 23:54.600 --> 24:01.480 I think it's from Linux, or, okay, yeah, about the Linux is it, the answer is I haven't, 24:03.000 --> 24:14.600 because somehow when doing my research, it led me to some obscure mechanism for very specific 24:15.080 --> 24:23.240 situations, and then I switch to book free, so I skip this, but it's not read as well, 24:23.240 --> 24:29.320 it's a different approach to the same problem, and you have different trade movements. 24:29.320 --> 24:34.120 Yeah, so now I haven't studied, but you are actually the second person who mentioned it, so I will 24:34.120 --> 24:42.600 definitely study it. In the forward row, yep, when you say like, 24:42.680 --> 24:47.960 lock free decision, like you introduce what I would call intersectional memory, or in this case, 24:47.960 --> 24:54.360 of intersectional memory in the front row, and when you talk about lock free, you talk about 24:56.360 --> 25:02.200 that's not very interesting, I'm sorry, I don't know, it's not fair if you're talking about 25:02.200 --> 25:10.360 the good meaning of lock free, or something else. Yeah, so like it's not completely lock free, 25:10.440 --> 25:16.600 it's inspired by lock free, and you actually have the good, you have good points, and the question 25:16.600 --> 25:26.360 was, if it is lock free, if it actually locks, right, kind of, and the truth is that it actually, 25:27.400 --> 25:35.480 you know, when you go multicore, this locking will actually be quite useless, because another 25:35.560 --> 25:44.360 core can bypass you, unless you do interprocess, signaling, and lock everything down, and when 25:44.360 --> 25:51.960 you are resilient enough to survive the hell before the comet, then you actually don't need to lock. 25:53.960 --> 25:58.600 I think I'll just suggest that you take a look at the sort of transactional memory. 25:58.600 --> 26:07.080 I think that we are out of time, so we will take questions outside. Thank you.