WEBVTT 00:00.000 --> 00:18.980 So I'm Dorian, I did that project as part of my studies in HPC in the city of San 00:18.980 --> 00:28.000 Chagre de Compostela and nowadays I'm a programmer for XWiki on totally unrelated 00:28.000 --> 00:36.360 project. So the topic of today is implementing OpenMP for Python. So it's a new library called 00:36.360 --> 00:43.680 OpenMPY, I'm going to talk about another library as well, that does the same thing. 00:43.680 --> 00:51.440 So in Python we have a built-in writing library that's called a setting. That introduces 00:51.440 --> 00:59.780 a thread class and functions together about other things about threads and their other threads, 00:59.780 --> 01:05.240 the current threads and all that threads a lot are pretty running. It introduces, it introduces 01:05.240 --> 01:15.600 the concept of local memory for threads. We have features of synchronizations, but we had 01:15.600 --> 01:24.600 a long time. The problem of having a single thread being able to run at a given moment 01:24.600 --> 01:33.280 I'm sure many of you are aware of the gale. I'm going to come back to that. OK, yes sure. I'll speak 01:33.280 --> 01:43.600 up. So there are some use case for using threading. Despite that limitation, which is 01:43.600 --> 01:52.600 the parallelizing input output, maybe having a GUI where you can have an event-based system. 01:52.600 --> 02:05.400 And also you can still run parallelized code using threading by releasing the gale in functions 02:05.400 --> 02:12.520 implemented in other languages. So we already have some implementations of OpenMPY 02:12.520 --> 02:21.080 Python that are PIOMP and Python, maybe more exists, but those are the ones I came across. 02:21.080 --> 02:29.880 So PIOMP is an extension of a number and it compiles the Python code on the fly to 02:30.120 --> 02:41.440 L to an LL then intermediate representation. And then to machine code. But again, this 02:41.440 --> 02:54.400 comes with limitations. Because we can't run any code written in Python, it's not transparent. 02:54.400 --> 03:00.400 We need to use supported libraries by embed. You need to implement libraries that are implemented 03:00.400 --> 03:09.400 in other languages for them to be compatible. Python is another solution that has built in support 03:09.400 --> 03:19.360 for OpenMPY. However, it's highly limited and it's support of Python itself. Because we 03:19.360 --> 03:28.080 call create classes, we call create classes in Python. So we're limited to, I think, my understanding 03:28.080 --> 03:39.720 of the project was prototyping for mathematical usage. We have ways of parallelizing code in Python, 03:39.720 --> 03:46.400 name for instance, multiprocessing, running different processes, different instances of the Python 03:46.640 --> 03:52.720 operator. And transferring data is an interprocess communication. There is the library 03:52.720 --> 04:00.120 component of futures, which wraps threading and multiprocessing as independent implementations 04:00.120 --> 04:10.840 of the same concept. And I can mention I think you, which is an even based approach similar 04:10.840 --> 04:18.400 to what you can have in JavaScript to have a multi threading, not multi threading, to have 04:18.400 --> 04:28.640 a parallelization of tasks and input output. So recently, some growth in engineer at 04:28.640 --> 04:41.840 Metap, a year ago, I think, opened a map 733. And we're going to fork of Python to remove 04:41.840 --> 04:49.040 the global interpreter lock and make threading in Python actually useful for improving the 04:49.040 --> 04:58.360 performance of programs. So we started with a fork in 3.9 and rebased it in 3.12. 04:58.760 --> 05:14.400 And merged it in 3.30. So the paths have an on 3, it finds that removal of the guild. It was 05:14.400 --> 05:24.120 implemented in 3.30, as I said, which was released recently in October, 2020. So you need 05:24.160 --> 05:30.440 to compile Python with a specific option, which is disabled guild. And at runtime, you need 05:30.440 --> 05:35.880 to also specify that you are running without the guild, because the removal of the guild 05:35.880 --> 05:46.160 has impacts on existing programs. If you're going to run a library that wasn't a program 05:46.160 --> 05:53.560 that wasn't thought with the guild in mind, you're going to run into concurrences issue, 05:53.960 --> 06:00.960 that you didn't have without the guild. Because there is more concurrency than before, 06:02.960 --> 06:11.960 the guild is a global interpreter lock. So when removing it, some instructions, you could 06:15.560 --> 06:20.360 consider that an atomic despite the fact that you are using threads and existing libraries 06:20.360 --> 06:25.640 already use threads and Python, and those will break while when removing the guild. So there 06:25.640 --> 06:31.800 is a lot of work to do to make Python support that removal of the guild. 06:34.200 --> 06:41.800 So I'm introducing a library that's called OMP. You can find it on IP on the Python package index, 06:43.000 --> 06:49.000 not under the name of OMP unfortunately. This one was removed because of AI, 06:49.080 --> 06:53.160 has inading and researchers thinking it's a great idea to claim the name. 06:55.400 --> 07:01.960 So that library implements, sorry, the name for the library on the Python package index 07:01.960 --> 07:10.040 is OPEN and PY. If you're looking for it, I'll show the links later. So the library implements 07:11.080 --> 07:16.840 the OpenMP API. I'm going to go over the different direct chips supported. It's a usage, 07:16.920 --> 07:23.480 that's similar to OpenMP and C. But taking into account the specific cities of Python. 07:25.640 --> 07:31.880 And one of the requirements that we had in that project was to use only Python standard libraries. 07:31.880 --> 07:34.280 So the library doesn't have any dependency. 07:38.280 --> 07:45.880 For the usage of the library, once imported, you can call the OpenMP class. 07:47.160 --> 07:59.160 With the directive as a string parameter. For simple directives and for constructs, 07:59.720 --> 08:08.440 you can use that same class as a context manager. And we also implement the primitives 08:11.000 --> 08:15.880 simply under the OMP name space. So I'm going to show some example of code. 08:15.960 --> 08:18.760 So that you can have an idea of what it looks like. 08:32.920 --> 08:44.680 Okay. So here I'm importing the library on P. I need to decorate my main function 08:45.640 --> 08:53.000 with OMP.enable. And what that does is it's going to read the source code of that function 08:53.000 --> 09:04.760 and do a sort of pre-processing for it to do the correct translation of the OMP directives 09:06.680 --> 09:13.000 of the OpenMP directives so that it's doing the threading and managing the virus. 09:15.000 --> 09:21.560 So in that example, it's a pretty simple one. I'm just summing the numbers from one to 09:24.040 --> 09:28.120 number and at this defined here. That's big so that we can see the performance. 09:31.720 --> 09:40.440 So I'm using a reduction and the schedule parameter is 09:45.400 --> 09:49.400 sorry. The schedule parameter is 09:51.800 --> 09:58.120 something that's important to tweak for performance especially in Python. And that's a big difference 09:58.120 --> 10:07.400 that will seems with the usage of OpenMP you can have in C because you need to take into account 10:08.040 --> 10:14.920 the base performance of Python. So this is a native Python implementation of OpenMP. 10:16.600 --> 10:21.160 So we're still going at the pace of Python. We're not going faster, we're just 10:21.160 --> 10:31.720 paralyzing what's happening. So that's the usage of the library. When removing the 10:31.800 --> 10:42.440 decorator, OpenMP.inable, the code is run seamlessly as if we didn't have the OpenMP directives. 10:44.120 --> 10:50.840 Those don't do anything when the decorator is not present. 10:54.280 --> 11:00.360 So on the performance here I tried running that code with the different 11:00.440 --> 11:09.000 tranche sizes. And what we can see is that we do have a speed up. So that's great news. 11:12.280 --> 11:18.840 But it's not great. So here you have the time on a log scale and here are the number of threats. 11:18.840 --> 11:27.480 I was able to run my code on a super computer node at a finished array 11:28.280 --> 11:37.320 three in in Galicia and says gap. So thanks to them. I don't have access to that computer anymore. 11:38.360 --> 11:44.040 I can't run more advanced examples, but I sure you have I'm sure you have many ideas of 11:44.600 --> 11:54.440 usage for that library that fits your purpose. So about the chunk size, what's interesting 11:57.480 --> 12:05.800 is that there is a sweet spot where we gain some time. So we're using a dynamic scheduling and 12:07.880 --> 12:15.000 Python's syntax is a tricky to manage because this is a for loop and so this is an iterator 12:15.000 --> 12:19.880 and the way implemented the library is that it works for any iterator here, not only range. 12:19.960 --> 12:28.200 So so far the dynamic scheduling is going to unpack that iterator and distribute it to 12:28.200 --> 12:38.360 the different threats as we're going. So when running with a short body like this one that's 12:38.360 --> 12:48.920 only doing one operation, it's going to be we're going to lose time simply unpacking the range itself. 12:50.520 --> 13:00.760 And unpacking the range takes most of the time compared to simply updating the accumulator. 13:05.240 --> 13:15.560 So I have another example where I defined a function a very simple one that checks if a number is 13:15.720 --> 13:30.680 prime. So it's meant as a heavier computation than just adding a number to an accumulator 13:30.680 --> 13:40.120 since I have to go prime first on bigger and bigger numbers. So here we can see the difference again 13:41.080 --> 13:53.080 in runtime depending on the number of threat. And we achieve a much better speed up 13:53.800 --> 14:07.560 which I think showcases the time we lose unpacking the range. So there are some optimizations 14:08.040 --> 14:18.600 that can be done to that unpacking by specifically having an implementation for the 14:19.560 --> 14:30.280 further ranges and still fold back to that unpacking distribution for other iterators. 14:30.440 --> 14:43.240 Yeah, it uses multi threading under the hood, nothing fancy here. But we do have to manage 14:44.600 --> 14:54.360 variable scopes. So the nice thing about sorry. So the nice thing about that library 14:54.840 --> 15:02.680 compared to other techniques of multi threading is you don't have to care much about the 15:02.680 --> 15:11.880 variable scope. It's exactly like an MP. So that accumulator I can access it here. I can 15:11.880 --> 15:20.680 access it outside of the thread when I print my result. When I outside of the 15:21.560 --> 15:34.040 threat. So that's that. On the supported directives from OpenMP, not all of them. We have 15:34.040 --> 15:40.440 parallel with a private clause. We have a four with no a private reduction in schedule. We have 15:40.760 --> 15:56.440 a barrier directed. We have a critical, single and that's it. So that's the support that's currently 15:56.440 --> 16:06.360 implemented. I'm interested if you have ideas of usage or how it could be useful to you and what 16:06.360 --> 16:24.040 could be missing in that library. And yeah, that's it. 16:24.600 --> 16:31.000 Question is the door in. Yeah, all right. There are a number of clauses you 16:31.000 --> 16:49.880 call OpenMP. Yes. So the question is, is there a documentation for what the clauses do? 16:51.720 --> 16:59.800 So we're just following the OpenMP specification here. Is this so new? 16:59.800 --> 17:07.880 No, sorry. It's not ready to go. It's blinking. 17:19.560 --> 17:28.760 Let me find it. Yeah. So you can read the OpenMP API and you have the explanation of what 17:30.280 --> 17:38.120 the clauses do. I'm just following this. I did do some modification for the four 17:39.160 --> 17:46.200 directive for the scheduling. I changed the default. I switched it to dynamic. Because of how 17:50.200 --> 17:55.720 the fact that I don't have a specific range implementation affects the performance. So I decided 17:55.720 --> 18:04.120 that dynamic was a better default. All right. So we continue to see C8 here. Yes, other than 18:04.920 --> 18:14.520 that you need to call OpenMP. Yes, another question. Go ahead. 18:26.600 --> 18:35.560 So it depends on the task. Yes, sorry. Good point. So about the runtime that I showed, 18:35.560 --> 18:41.560 it seems that it doesn't scale above six cores. We have that here. Yes. 18:42.840 --> 18:49.240 Maybe I'll put that document online. So you can access it on the presentation page. 18:49.720 --> 19:00.280 Yes, it depends on the kind of task. On that task, specifically. Yes, we do reach a minimum, 19:01.080 --> 19:09.560 pretty quickly, where adding more cores doesn't bring any benefit. And that's because of the 19:09.560 --> 19:16.760 unpacking of the range. So what I get from that is we really need to have an implementation 19:16.840 --> 19:24.520 specifically for the range and not unpacking the range as any iterator and distributing the elements. 19:30.200 --> 19:34.920 So the question is, with the different workload, could I scale to 60 cores? 19:36.440 --> 19:40.680 The answer is yes. Here is the different workload. And you can see here, 19:41.640 --> 19:52.600 let's keep gaining time. And I do have the speed up chart here. So that talks more maybe. 19:53.320 --> 19:59.160 So for 64 cores, we're at about a speed up of 50. So that's a heavier task. It's a different one. 19:59.160 --> 20:03.160 We do achieve a speed up. We are able to compute different things at the same time. 20:04.120 --> 20:10.120 More questions. Go ahead. 20:10.120 --> 20:18.120 Okay. So what I'm going to do is I have to do something about what happens, 20:23.400 --> 20:34.200 so the question is, given that we read the iterator elements and distribute them across 20:34.200 --> 20:37.000 threads, what happens if the iterator is infinite? 20:41.080 --> 20:47.720 Pretty much the same thing as if you weren't using OpenMP. So you have an infinite iterator. 20:47.720 --> 20:51.880 Your fall loop is going to run infinitely. That's it. 20:53.720 --> 20:54.440 Other questions? 20:55.400 --> 21:07.080 How does it compare with the deficit? How does it compare with the deficit package? 21:07.880 --> 21:13.160 So the question is, how does it compare to multiprocessing? The library multiprocessing? 21:14.840 --> 21:20.440 That's an interesting question. The goal of that library is purely syntax sugar. 21:21.400 --> 21:29.720 It's not about making multi-threading performance better in Python. 21:32.280 --> 21:38.040 So I didn't do a comparison of the performance with multi-threading. I think it's more of a 21:38.040 --> 21:44.520 something that's interesting to work at Sandroza for the meaning of removing the meal itself. 21:45.480 --> 21:47.720 Sorry, I can't answer it. 21:51.320 --> 21:56.520 The more question, the more thing clearly on this lovely report, 21:56.520 --> 22:04.760 but this report, can you put it in the slide? I didn't understand the question, can you do it? 22:04.760 --> 22:06.840 Well, you're showing your report. 22:06.840 --> 22:09.240 Yes, I'm going to put it in the line. Sorry, I didn't know. 22:09.480 --> 22:16.200 Yes, I would be sorry if you Dialge The Passion is about the money. 22:23.800 --> 22:25.800 Great question. 22:34.600 --> 22:37.800 Okay?