WEBVTT

00:00.000 --> 00:18.980
So I'm Dorian, I did that project as part of my studies in HPC in the city of San

00:18.980 --> 00:28.000
Chagre de Compostela and nowadays I'm a programmer for XWiki on totally unrelated

00:28.000 --> 00:36.360
project. So the topic of today is implementing OpenMP for Python. So it's a new library called

00:36.360 --> 00:43.680
OpenMPY, I'm going to talk about another library as well, that does the same thing.

00:43.680 --> 00:51.440
So in Python we have a built-in writing library that's called a setting. That introduces

00:51.440 --> 00:59.780
a thread class and functions together about other things about threads and their other threads,

00:59.780 --> 01:05.240
the current threads and all that threads a lot are pretty running. It introduces, it introduces

01:05.240 --> 01:15.600
the concept of local memory for threads. We have features of synchronizations, but we had

01:15.600 --> 01:24.600
a long time. The problem of having a single thread being able to run at a given moment

01:24.600 --> 01:33.280
I'm sure many of you are aware of the gale. I'm going to come back to that. OK, yes sure. I'll speak

01:33.280 --> 01:43.600
up. So there are some use case for using threading. Despite that limitation, which is

01:43.600 --> 01:52.600
the parallelizing input output, maybe having a GUI where you can have an event-based system.

01:52.600 --> 02:05.400
And also you can still run parallelized code using threading by releasing the gale in functions

02:05.400 --> 02:12.520
implemented in other languages. So we already have some implementations of OpenMPY

02:12.520 --> 02:21.080
Python that are PIOMP and Python, maybe more exists, but those are the ones I came across.

02:21.080 --> 02:29.880
So PIOMP is an extension of a number and it compiles the Python code on the fly to

02:30.120 --> 02:41.440
L to an LL then intermediate representation. And then to machine code. But again, this

02:41.440 --> 02:54.400
comes with limitations. Because we can't run any code written in Python, it's not transparent.

02:54.400 --> 03:00.400
We need to use supported libraries by embed. You need to implement libraries that are implemented

03:00.400 --> 03:09.400
in other languages for them to be compatible. Python is another solution that has built in support

03:09.400 --> 03:19.360
for OpenMPY. However, it's highly limited and it's support of Python itself. Because we

03:19.360 --> 03:28.080
call create classes, we call create classes in Python. So we're limited to, I think, my understanding

03:28.080 --> 03:39.720
of the project was prototyping for mathematical usage. We have ways of parallelizing code in Python,

03:39.720 --> 03:46.400
name for instance, multiprocessing, running different processes, different instances of the Python

03:46.640 --> 03:52.720
operator. And transferring data is an interprocess communication. There is the library

03:52.720 --> 04:00.120
component of futures, which wraps threading and multiprocessing as independent implementations

04:00.120 --> 04:10.840
of the same concept. And I can mention I think you, which is an even based approach similar

04:10.840 --> 04:18.400
to what you can have in JavaScript to have a multi threading, not multi threading, to have

04:18.400 --> 04:28.640
a parallelization of tasks and input output. So recently, some growth in engineer at

04:28.640 --> 04:41.840
Metap, a year ago, I think, opened a map 733. And we're going to fork of Python to remove

04:41.840 --> 04:49.040
the global interpreter lock and make threading in Python actually useful for improving the

04:49.040 --> 04:58.360
performance of programs. So we started with a fork in 3.9 and rebased it in 3.12.

04:58.760 --> 05:14.400
And merged it in 3.30. So the paths have an on 3, it finds that removal of the guild. It was

05:14.400 --> 05:24.120
implemented in 3.30, as I said, which was released recently in October, 2020. So you need

05:24.160 --> 05:30.440
to compile Python with a specific option, which is disabled guild. And at runtime, you need

05:30.440 --> 05:35.880
to also specify that you are running without the guild, because the removal of the guild

05:35.880 --> 05:46.160
has impacts on existing programs. If you're going to run a library that wasn't a program

05:46.160 --> 05:53.560
that wasn't thought with the guild in mind, you're going to run into concurrences issue,

05:53.960 --> 06:00.960
that you didn't have without the guild. Because there is more concurrency than before,

06:02.960 --> 06:11.960
the guild is a global interpreter lock. So when removing it, some instructions, you could

06:15.560 --> 06:20.360
consider that an atomic despite the fact that you are using threads and existing libraries

06:20.360 --> 06:25.640
already use threads and Python, and those will break while when removing the guild. So there

06:25.640 --> 06:31.800
is a lot of work to do to make Python support that removal of the guild.

06:34.200 --> 06:41.800
So I'm introducing a library that's called OMP. You can find it on IP on the Python package index,

06:43.000 --> 06:49.000
not under the name of OMP unfortunately. This one was removed because of AI,

06:49.080 --> 06:53.160
has inading and researchers thinking it's a great idea to claim the name.

06:55.400 --> 07:01.960
So that library implements, sorry, the name for the library on the Python package index

07:01.960 --> 07:10.040
is OPEN and PY. If you're looking for it, I'll show the links later. So the library implements

07:11.080 --> 07:16.840
the OpenMP API. I'm going to go over the different direct chips supported. It's a usage,

07:16.920 --> 07:23.480
that's similar to OpenMP and C. But taking into account the specific cities of Python.

07:25.640 --> 07:31.880
And one of the requirements that we had in that project was to use only Python standard libraries.

07:31.880 --> 07:34.280
So the library doesn't have any dependency.

07:38.280 --> 07:45.880
For the usage of the library, once imported, you can call the OpenMP class.

07:47.160 --> 07:59.160
With the directive as a string parameter. For simple directives and for constructs,

07:59.720 --> 08:08.440
you can use that same class as a context manager. And we also implement the primitives

08:11.000 --> 08:15.880
simply under the OMP name space. So I'm going to show some example of code.

08:15.960 --> 08:18.760
So that you can have an idea of what it looks like.

08:32.920 --> 08:44.680
Okay. So here I'm importing the library on P. I need to decorate my main function

08:45.640 --> 08:53.000
with OMP.enable. And what that does is it's going to read the source code of that function

08:53.000 --> 09:04.760
and do a sort of pre-processing for it to do the correct translation of the OMP directives

09:06.680 --> 09:13.000
of the OpenMP directives so that it's doing the threading and managing the virus.

09:15.000 --> 09:21.560
So in that example, it's a pretty simple one. I'm just summing the numbers from one to

09:24.040 --> 09:28.120
number and at this defined here. That's big so that we can see the performance.

09:31.720 --> 09:40.440
So I'm using a reduction and the schedule parameter is

09:45.400 --> 09:49.400
sorry. The schedule parameter is

09:51.800 --> 09:58.120
something that's important to tweak for performance especially in Python. And that's a big difference

09:58.120 --> 10:07.400
that will seems with the usage of OpenMP you can have in C because you need to take into account

10:08.040 --> 10:14.920
the base performance of Python. So this is a native Python implementation of OpenMP.

10:16.600 --> 10:21.160
So we're still going at the pace of Python. We're not going faster, we're just

10:21.160 --> 10:31.720
paralyzing what's happening. So that's the usage of the library. When removing the

10:31.800 --> 10:42.440
decorator, OpenMP.inable, the code is run seamlessly as if we didn't have the OpenMP directives.

10:44.120 --> 10:50.840
Those don't do anything when the decorator is not present.

10:54.280 --> 11:00.360
So on the performance here I tried running that code with the different

11:00.440 --> 11:09.000
tranche sizes. And what we can see is that we do have a speed up. So that's great news.

11:12.280 --> 11:18.840
But it's not great. So here you have the time on a log scale and here are the number of threats.

11:18.840 --> 11:27.480
I was able to run my code on a super computer node at a finished array

11:28.280 --> 11:37.320
three in in Galicia and says gap. So thanks to them. I don't have access to that computer anymore.

11:38.360 --> 11:44.040
I can't run more advanced examples, but I sure you have I'm sure you have many ideas of

11:44.600 --> 11:54.440
usage for that library that fits your purpose. So about the chunk size, what's interesting

11:57.480 --> 12:05.800
is that there is a sweet spot where we gain some time. So we're using a dynamic scheduling and

12:07.880 --> 12:15.000
Python's syntax is a tricky to manage because this is a for loop and so this is an iterator

12:15.000 --> 12:19.880
and the way implemented the library is that it works for any iterator here, not only range.

12:19.960 --> 12:28.200
So so far the dynamic scheduling is going to unpack that iterator and distribute it to

12:28.200 --> 12:38.360
the different threats as we're going. So when running with a short body like this one that's

12:38.360 --> 12:48.920
only doing one operation, it's going to be we're going to lose time simply unpacking the range itself.

12:50.520 --> 13:00.760
And unpacking the range takes most of the time compared to simply updating the accumulator.

13:05.240 --> 13:15.560
So I have another example where I defined a function a very simple one that checks if a number is

13:15.720 --> 13:30.680
prime. So it's meant as a heavier computation than just adding a number to an accumulator

13:30.680 --> 13:40.120
since I have to go prime first on bigger and bigger numbers. So here we can see the difference again

13:41.080 --> 13:53.080
in runtime depending on the number of threat. And we achieve a much better speed up

13:53.800 --> 14:07.560
which I think showcases the time we lose unpacking the range. So there are some optimizations

14:08.040 --> 14:18.600
that can be done to that unpacking by specifically having an implementation for the

14:19.560 --> 14:30.280
further ranges and still fold back to that unpacking distribution for other iterators.

14:30.440 --> 14:43.240
Yeah, it uses multi threading under the hood, nothing fancy here. But we do have to manage

14:44.600 --> 14:54.360
variable scopes. So the nice thing about sorry. So the nice thing about that library

14:54.840 --> 15:02.680
compared to other techniques of multi threading is you don't have to care much about the

15:02.680 --> 15:11.880
variable scope. It's exactly like an MP. So that accumulator I can access it here. I can

15:11.880 --> 15:20.680
access it outside of the thread when I print my result. When I outside of the

15:21.560 --> 15:34.040
threat. So that's that. On the supported directives from OpenMP, not all of them. We have

15:34.040 --> 15:40.440
parallel with a private clause. We have a four with no a private reduction in schedule. We have

15:40.760 --> 15:56.440
a barrier directed. We have a critical, single and that's it. So that's the support that's currently

15:56.440 --> 16:06.360
implemented. I'm interested if you have ideas of usage or how it could be useful to you and what

16:06.360 --> 16:24.040
could be missing in that library. And yeah, that's it.

16:24.600 --> 16:31.000
Question is the door in. Yeah, all right. There are a number of clauses you

16:31.000 --> 16:49.880
call OpenMP. Yes. So the question is, is there a documentation for what the clauses do?

16:51.720 --> 16:59.800
So we're just following the OpenMP specification here. Is this so new?

16:59.800 --> 17:07.880
No, sorry. It's not ready to go. It's blinking.

17:19.560 --> 17:28.760
Let me find it. Yeah. So you can read the OpenMP API and you have the explanation of what

17:30.280 --> 17:38.120
the clauses do. I'm just following this. I did do some modification for the four

17:39.160 --> 17:46.200
directive for the scheduling. I changed the default. I switched it to dynamic. Because of how

17:50.200 --> 17:55.720
the fact that I don't have a specific range implementation affects the performance. So I decided

17:55.720 --> 18:04.120
that dynamic was a better default. All right. So we continue to see C8 here. Yes, other than

18:04.920 --> 18:14.520
that you need to call OpenMP. Yes, another question. Go ahead.

18:26.600 --> 18:35.560
So it depends on the task. Yes, sorry. Good point. So about the runtime that I showed,

18:35.560 --> 18:41.560
it seems that it doesn't scale above six cores. We have that here. Yes.

18:42.840 --> 18:49.240
Maybe I'll put that document online. So you can access it on the presentation page.

18:49.720 --> 19:00.280
Yes, it depends on the kind of task. On that task, specifically. Yes, we do reach a minimum,

19:01.080 --> 19:09.560
pretty quickly, where adding more cores doesn't bring any benefit. And that's because of the

19:09.560 --> 19:16.760
unpacking of the range. So what I get from that is we really need to have an implementation

19:16.840 --> 19:24.520
specifically for the range and not unpacking the range as any iterator and distributing the elements.

19:30.200 --> 19:34.920
So the question is, with the different workload, could I scale to 60 cores?

19:36.440 --> 19:40.680
The answer is yes. Here is the different workload. And you can see here,

19:41.640 --> 19:52.600
let's keep gaining time. And I do have the speed up chart here. So that talks more maybe.

19:53.320 --> 19:59.160
So for 64 cores, we're at about a speed up of 50. So that's a heavier task. It's a different one.

19:59.160 --> 20:03.160
We do achieve a speed up. We are able to compute different things at the same time.

20:04.120 --> 20:10.120
More questions. Go ahead.

20:10.120 --> 20:18.120
Okay. So what I'm going to do is I have to do something about what happens,

20:23.400 --> 20:34.200
so the question is, given that we read the iterator elements and distribute them across

20:34.200 --> 20:37.000
threads, what happens if the iterator is infinite?

20:41.080 --> 20:47.720
Pretty much the same thing as if you weren't using OpenMP. So you have an infinite iterator.

20:47.720 --> 20:51.880
Your fall loop is going to run infinitely. That's it.

20:53.720 --> 20:54.440
Other questions?

20:55.400 --> 21:07.080
How does it compare with the deficit? How does it compare with the deficit package?

21:07.880 --> 21:13.160
So the question is, how does it compare to multiprocessing? The library multiprocessing?

21:14.840 --> 21:20.440
That's an interesting question. The goal of that library is purely syntax sugar.

21:21.400 --> 21:29.720
It's not about making multi-threading performance better in Python.

21:32.280 --> 21:38.040
So I didn't do a comparison of the performance with multi-threading. I think it's more of a

21:38.040 --> 21:44.520
something that's interesting to work at Sandroza for the meaning of removing the meal itself.

21:45.480 --> 21:47.720
Sorry, I can't answer it.

21:51.320 --> 21:56.520
The more question, the more thing clearly on this lovely report,

21:56.520 --> 22:04.760
but this report, can you put it in the slide? I didn't understand the question, can you do it?

22:04.760 --> 22:06.840
Well, you're showing your report.

22:06.840 --> 22:09.240
Yes, I'm going to put it in the line. Sorry, I didn't know.

22:09.480 --> 22:16.200
Yes, I would be sorry if you Dialge The Passion is about the money.

22:23.800 --> 22:25.800
Great question.

22:34.600 --> 22:37.800
Okay?