WEBVTT 00:00.000 --> 00:07.000 Well, so I'm Felix, this is Vaseos. 00:07.000 --> 00:13.000 We're going to talk about the work we've done in reframe with open source tools to support 00:13.000 --> 00:15.000 for basic performance testing. 00:15.000 --> 00:18.000 And so what do we do at the video? 00:18.000 --> 00:22.000 We are both in the team called Applied Systems at Nvidia. 00:22.000 --> 00:29.000 And what we do is we build internal supercomputers with the latest GPUs and CPUs 00:29.000 --> 00:31.000 and meta-nox hardware. 00:31.000 --> 00:34.000 And we have created internal clusters like EOS, 00:34.000 --> 00:38.000 it was number nine on top 500 in November 2023. 00:38.000 --> 00:42.000 And we do that to enable internal users so that they can run 00:42.000 --> 00:45.000 that deep learning work load or the HPC workloads. 00:45.000 --> 00:49.000 And as such, we also help people from our customers 00:49.000 --> 00:52.000 to build their own cluster at a later point. 00:52.000 --> 00:55.000 And our internal clusters run benchmarks, 00:55.000 --> 00:58.000 such as well HPL and MLP of training, 00:58.000 --> 01:00.000 if you're familiar with that. 01:00.000 --> 01:02.000 And also, okay, of deep learning research, 01:02.000 --> 01:06.000 such as training the megachron model on the cylinder computer, 01:06.000 --> 01:09.000 that was the generation before EOS. 01:09.000 --> 01:13.000 And so that's the kind of things we are doing in our team. 01:13.000 --> 01:18.000 And we've been using reframe for a few years now 01:18.000 --> 01:22.000 for basically, if you're familiar with reframe, it's very useful for the 01:22.000 --> 01:26.000 performance validation of your cluster and regression testing. 01:26.000 --> 01:28.000 So reframe is an open source project, 01:28.000 --> 01:33.000 developed initially at CSES by some people around here. 01:33.000 --> 01:37.000 And you express your test in a declarative manner, 01:37.000 --> 01:39.000 you express your test in Python, 01:39.000 --> 01:43.000 and you can express dependencies, constraints on your test, 01:43.000 --> 01:48.000 and you define what you're expecting the output to be, 01:48.000 --> 01:52.000 and you define performance targets for each test. 01:52.000 --> 01:54.000 And reframe of generating the scripts, 01:54.000 --> 01:56.000 launch starts using slum automatically, 01:56.000 --> 01:57.000 you don't need to care about slum, 01:57.000 --> 01:59.000 and it will gather the results, 01:59.000 --> 02:01.000 execute everything concurrently as much as it can, 02:01.000 --> 02:02.000 respect the dependencies, 02:02.000 --> 02:05.000 and then you will get the results green, 02:05.000 --> 02:07.000 passing, red, it's failing, obviously. 02:07.000 --> 02:11.000 And yes, you open source, it's as great documentation, 02:11.000 --> 02:12.000 you can take a look. 02:12.000 --> 02:15.000 So platform and testing, but does it look like in reframes, 02:15.000 --> 02:18.000 it's a very simple test where we run the stream benchmark, 02:18.000 --> 02:22.000 and you declare methods, which are the decorator, 02:22.000 --> 02:24.000 performance function, and you say, 02:24.000 --> 02:27.000 I have a metric called copy bandwidth for stream. 02:27.000 --> 02:29.000 So I'm going to execute stream.x, 02:29.000 --> 02:30.000 the binary cost stream.x, 02:30.000 --> 02:32.000 and I'm declaring a regax here, 02:32.000 --> 02:34.000 and the first matching group, 02:34.000 --> 02:36.000 I'm saying the first matching group, 02:36.000 --> 02:38.000 basically cast that to float, 02:38.000 --> 02:40.000 and that's going to be my copy bandwidth, 02:40.000 --> 02:42.000 and same for triad bandwidth, 02:42.000 --> 02:45.000 because stream as multiple benchmarks. 02:45.000 --> 02:47.000 And then for the performance targets, 02:47.000 --> 02:50.000 you define dictionary of dictionary, 02:50.000 --> 02:52.000 and you say for the system, 02:52.000 --> 02:53.000 for the default system, 02:53.000 --> 02:55.000 you can say by system at different targets. 02:55.000 --> 03:00.000 I want my copy bandwidth to be between 23,000 megabytes per second, 03:00.000 --> 03:04.000 minus 10%, plus 13%, so you give percentage bounds 03:04.000 --> 03:07.000 around this middle value. 03:07.000 --> 03:11.000 And that's what you usually do for performance testing in reframe, 03:11.000 --> 03:14.000 and when you execute reframe, it looks like this. 03:14.000 --> 03:18.000 That's in a terminal that doesn't actually get lapsi-i. 03:18.000 --> 03:19.000 It looks like this. 03:19.000 --> 03:21.000 You have the name of the test. 03:21.000 --> 03:23.000 reframe tells you what it's doing. 03:23.000 --> 03:24.000 It's executing this. 03:24.000 --> 03:26.000 As I said, okay, the green minute, 03:26.000 --> 03:28.000 it's passes, and then you have this summary again. 03:28.000 --> 03:30.000 And where reframe is very useful, 03:30.000 --> 03:32.000 is that you have really the choice, 03:32.000 --> 03:34.000 or you send the logs. 03:34.000 --> 03:37.000 You can send the logs to a plastic search, 03:37.000 --> 03:38.000 to a gray log, 03:38.000 --> 03:41.000 and you can also configure exactly the format 03:41.000 --> 03:42.000 where you output so here. 03:42.000 --> 03:44.000 We're defining a format for the output, 03:44.000 --> 03:46.000 which is CSV. 03:46.000 --> 03:49.000 I mean, separated by pipe, but still CSV. 03:49.000 --> 03:51.000 And your output file here, 03:51.000 --> 03:53.000 by reframe after executing this, 03:53.000 --> 03:56.000 will be, so timestamp, 03:56.000 --> 03:58.000 the name of the test, 03:58.000 --> 04:00.000 the metric we see, 04:00.000 --> 04:03.000 like triad or red or right. 04:03.000 --> 04:04.000 I had to read that some of that, 04:04.000 --> 04:06.000 but that was a parallel value right here. 04:06.000 --> 04:09.000 And that was the performance we were targeting, 04:09.000 --> 04:12.000 and minus 3% plus 3% is the performance 04:12.000 --> 04:16.000 bounds around the value we were targeting. 04:16.000 --> 04:20.000 And how do we define those performance bounds 04:20.000 --> 04:22.000 usually in reframe? 04:22.000 --> 04:25.000 They have to be fixed a pair system, 04:25.000 --> 04:27.000 but so basically we're going to write 04:27.000 --> 04:28.000 on a few machines. 04:28.000 --> 04:30.000 We're going to see what the performance looks like. 04:30.000 --> 04:31.000 We're going to mean the deviation, 04:31.000 --> 04:32.000 and we're going to see, 04:32.000 --> 04:34.000 okay, how do I define my bounds? 04:34.000 --> 04:35.000 And the problem we're facing, 04:35.000 --> 04:37.000 if you bounds are too narrow, 04:37.000 --> 04:39.000 you might get spreaders failures, 04:39.000 --> 04:41.000 because you didn't really test the footwear 04:41.000 --> 04:42.000 population initially, 04:42.000 --> 04:44.000 and you're going to get for a positive, 04:44.000 --> 04:46.000 and admins do not like that, 04:46.000 --> 04:47.000 so as I will tell you, 04:47.000 --> 04:48.000 can you please increase the bounds? 04:48.000 --> 04:50.000 It's failing by the very, very small margin. 04:50.000 --> 04:52.000 We need it to pass, 04:52.000 --> 04:53.000 it's not a really performance problem. 04:53.000 --> 04:54.000 If it's too large, 04:54.000 --> 04:55.000 the problem, 04:55.000 --> 04:56.000 if it's too large, 04:56.000 --> 04:58.000 here I showed normal distribution, 04:58.000 --> 05:00.000 because it's often actually normal distribution, 05:00.000 --> 05:02.000 for performance results across the population, 05:02.000 --> 05:03.000 if it's too large, 05:03.000 --> 05:04.000 obviously like free sigma, 05:04.000 --> 05:05.000 free circulation, 05:05.000 --> 05:06.000 you can get, 05:06.000 --> 05:07.000 obviously, a regression 05:07.000 --> 05:09.000 that while it's still within your bounds, 05:09.000 --> 05:10.000 and that happens, 05:10.000 --> 05:11.000 you know, 05:11.000 --> 05:12.000 actually quite a lot. 05:14.000 --> 05:15.000 Thank you. 05:15.000 --> 05:17.000 And so, 05:17.000 --> 05:18.000 a common problem we, 05:18.000 --> 05:19.000 of course, 05:19.000 --> 05:21.000 we face is that we need to validate 05:21.000 --> 05:22.000 our clusters, 05:22.000 --> 05:24.000 the performance compared to 05:24.000 --> 05:25.000 one week ago, 05:25.000 --> 05:26.000 to weeks ago, 05:26.000 --> 05:28.000 or if we install a new software, 05:28.000 --> 05:29.000 we want to have confidence 05:29.000 --> 05:31.000 that we have the same performance 05:31.000 --> 05:32.000 as before, 05:32.000 --> 05:34.000 otherwise users will report 05:34.000 --> 05:35.000 problem, 05:35.000 --> 05:36.000 as they're going to take us a lot of time, 05:36.000 --> 05:37.000 because they're going to report 05:37.000 --> 05:38.000 problems, 05:38.000 --> 05:39.000 it's very complex application, 05:39.000 --> 05:40.000 as we go through much harder 05:40.000 --> 05:41.000 to find what's going on. 05:41.000 --> 05:43.000 So, we need a very robust test suite, 05:43.000 --> 05:45.000 and performance bounds are, 05:45.000 --> 05:46.000 honestly, it's a great. 05:46.000 --> 05:48.000 We find a lot of bugs with free frame 05:48.000 --> 05:49.000 and performance bounds, 05:49.000 --> 05:51.000 but for tests that have a wide, 05:51.000 --> 05:52.000 variation, 05:52.000 --> 05:54.000 you tend to have wide ranges, 05:54.000 --> 05:56.000 and you're going to miss some regressions. 05:56.000 --> 05:57.000 So, 05:57.000 --> 06:00.000 people usually doing their own tools 06:00.000 --> 06:01.000 on top of free frames, 06:01.000 --> 06:02.000 they wear developing their own 06:02.000 --> 06:03.000 pandascript, 06:03.000 --> 06:04.000 like we did, 06:04.000 --> 06:05.000 or they wear sending things to 06:05.000 --> 06:06.000 splank, 06:06.000 --> 06:07.000 or elastic search, 06:07.000 --> 06:08.000 and build tools on top of this. 06:08.000 --> 06:09.000 And we wanted to see, 06:09.000 --> 06:10.000 okay, 06:10.000 --> 06:11.000 but that's not very portable, 06:11.000 --> 06:13.000 that's really depends on your architecture, 06:13.000 --> 06:14.000 on your cluster. 06:14.000 --> 06:15.000 What can we do in 06:15.000 --> 06:16.000 reframe 06:16.000 --> 06:17.000 to have that built-in 06:17.000 --> 06:18.000 baked in? 06:18.000 --> 06:20.000 And Vasileo is going to take over 06:20.000 --> 06:22.000 and talk to you about this. 06:28.000 --> 06:29.000 Can you hear me? 06:29.000 --> 06:33.000 So, yeah, as Felix said, 06:33.000 --> 06:35.000 yeah, people did, 06:35.000 --> 06:37.000 oh, okay, sorry. 06:37.000 --> 06:39.000 As Felix said, 06:41.000 --> 06:42.000 yeah, 06:42.000 --> 06:44.000 it's been, 06:44.000 --> 06:46.000 users that have been using 06:46.000 --> 06:48.000 free frame that have been using 06:48.000 --> 06:49.000 ad hoc solutions, 06:49.000 --> 06:50.000 one lies that is often 06:50.000 --> 06:52.000 do historical analysis. 06:52.000 --> 06:53.000 So, 06:53.000 --> 06:54.000 I, 06:54.000 --> 06:55.000 we thought that could be useful 06:55.000 --> 06:57.000 for other users in the community as well, 06:57.000 --> 07:00.000 to be able to compare past results, 07:00.000 --> 07:01.000 inspect past, 07:01.000 --> 07:02.000 test results, 07:02.000 --> 07:05.000 get performance metrics, 07:05.000 --> 07:06.000 aggregate performance across 07:06.000 --> 07:07.000 different characteristics, 07:07.000 --> 07:08.000 like, 07:08.000 --> 07:09.000 not least, 07:09.000 --> 07:10.000 test parameters, 07:10.000 --> 07:11.000 time periods, 07:11.000 --> 07:13.000 also be able to compare 07:13.000 --> 07:15.000 performance between runs, 07:15.000 --> 07:17.000 between different configurations, 07:17.000 --> 07:19.000 between the current and versus, 07:19.000 --> 07:21.000 historical data. 07:21.000 --> 07:23.000 And, for example, 07:23.000 --> 07:25.000 also different time periods. 07:26.000 --> 07:27.000 And, 07:27.000 --> 07:29.000 we also wanted to, 07:29.000 --> 07:30.000 as a key goal, 07:30.000 --> 07:32.000 to store as much test information as we can, 07:32.000 --> 07:33.000 because, 07:33.000 --> 07:35.000 experience shows that, 07:35.000 --> 07:38.000 you later regret the information you haven't collected. 07:38.000 --> 07:39.000 So, 07:39.000 --> 07:42.000 it's better if you have the information already, 07:42.000 --> 07:44.000 all the test information that you can have. 07:44.000 --> 07:46.000 We still want to allow, 07:46.000 --> 07:48.000 want to allow external post-processing, 07:48.000 --> 07:49.000 because you never, 07:49.000 --> 07:53.000 we will do the post-processing that everybody else would like to do. 07:53.000 --> 07:54.000 So, 07:54.000 --> 07:56.000 it is to give a basic, 07:56.000 --> 07:57.000 analytics, 07:57.000 --> 07:58.000 let's say, 07:58.000 --> 07:59.000 layer. 07:59.000 --> 08:00.000 Also be backward compatible, 08:00.000 --> 08:01.000 we didn't want, 08:01.000 --> 08:02.000 like, 08:02.000 --> 08:04.000 users over frame to come back, 08:04.000 --> 08:05.000 complaining out, 08:05.000 --> 08:06.000 you change this option, 08:06.000 --> 08:07.000 you broke that interface, 08:07.000 --> 08:08.000 you broke my test, 08:08.000 --> 08:10.000 so we want backward compatibility. 08:10.000 --> 08:11.000 And, 08:11.000 --> 08:12.000 also, 08:12.000 --> 08:13.000 we want to provide an easy, 08:13.000 --> 08:15.000 command line interface, 08:15.000 --> 08:16.000 intuitive, 08:16.000 --> 08:19.000 to be able to do some basic analytics. 08:19.000 --> 08:22.000 So, we consider two options. 08:23.000 --> 08:24.000 As Felix said, 08:24.000 --> 08:28.000 one way of storing performance data is what we call, 08:28.000 --> 08:29.000 in the refrain perflogs, 08:29.000 --> 08:32.000 which are those usually CSV files, 08:32.000 --> 08:34.000 that they contain the, 08:34.000 --> 08:37.000 essential performance data of tests. 08:37.000 --> 08:38.000 But there is, 08:38.000 --> 08:40.000 although they are compact, 08:40.000 --> 08:43.000 there is two disadvantages to that. 08:43.000 --> 08:45.000 Important test information may be lost, 08:45.000 --> 08:46.000 because, 08:46.000 --> 08:47.000 yeah, 08:47.000 --> 08:49.000 they don't carry all the whole information. 08:49.000 --> 08:52.000 And information is really bound to the users, 08:52.000 --> 08:56.000 to that log format that the user defined, 08:56.000 --> 09:00.000 and basically select what information is important. 09:00.000 --> 09:01.000 So, 09:01.000 --> 09:03.000 then the second option, 09:03.000 --> 09:05.000 which refrain does internally, 09:05.000 --> 09:08.000 stores a full test case information 09:08.000 --> 09:10.000 in a JSON report, 09:10.000 --> 09:12.000 which then can dump to a file. 09:12.000 --> 09:14.000 But the advantage of this is that 09:14.000 --> 09:16.000 it contains the whole test information, 09:17.000 --> 09:19.000 the test parameters, test variables, 09:19.000 --> 09:21.000 where they turn and so on. 09:21.000 --> 09:22.000 On the other hand, 09:22.000 --> 09:24.000 it's quite verbose, 09:24.000 --> 09:26.000 and it's also unstructured data, 09:26.000 --> 09:29.000 because every test may have different variables. 09:29.000 --> 09:32.000 Those that they have potentially used refrain, 09:32.000 --> 09:34.000 it's test can define its own parameters, 09:34.000 --> 09:36.000 its own new variables, 09:36.000 --> 09:37.000 so, 09:37.000 --> 09:40.000 which could be important to, 09:40.000 --> 09:43.000 they are usually important to the performance you get. 09:43.000 --> 09:45.000 Nonetheless, we select the option too, 09:45.000 --> 09:47.000 more complete. 09:47.000 --> 09:48.000 And this is, 09:48.000 --> 09:51.000 I'm going to now describe briefly, 09:51.000 --> 09:54.000 a bit of the designer architecture of this feature. 09:54.000 --> 09:55.000 So, essentially, 09:55.000 --> 09:59.000 it's layered with interfaces between its layer, 09:59.000 --> 10:03.000 so that we can choose and along the next different 10:03.000 --> 10:05.000 implementations for its layer. 10:05.000 --> 10:06.000 So, on the top level, 10:06.000 --> 10:08.000 there is a new CLI interface, 10:08.000 --> 10:12.000 where we added some new command line options. 10:13.000 --> 10:16.000 There's a couple of list stored test cases and sessions, 10:16.000 --> 10:19.000 which will list present data of previous 10:19.000 --> 10:21.000 or specific test, 10:21.000 --> 10:24.000 in a tabular form. 10:24.000 --> 10:26.000 There is its counterpart, 10:26.000 --> 10:29.000 described stored test cases and sessions, 10:29.000 --> 10:31.000 which retains raw data, 10:31.000 --> 10:32.000 in JSON, 10:32.000 --> 10:35.000 which then you can ingest elsewhere, 10:35.000 --> 10:40.000 and process post-process yourself. 10:40.000 --> 10:41.000 There is a new option, 10:41.000 --> 10:42.000 performance compared, 10:42.000 --> 10:44.000 that compares past results, 10:44.000 --> 10:47.000 and there's also two other utility options, 10:47.000 --> 10:51.000 to attach new information to the session, 10:51.000 --> 10:53.000 with session extra, 10:53.000 --> 10:54.000 all control the, 10:54.000 --> 10:55.000 the data format. 10:55.000 --> 10:57.000 The analytics layer does, 10:57.000 --> 10:58.000 essentially, 10:58.000 --> 10:59.000 the test case grouping, 10:59.000 --> 11:01.000 the performance aggregations, 11:01.000 --> 11:03.000 and the performance differences, 11:03.000 --> 11:06.000 and retains either tabular data, 11:06.000 --> 11:08.000 or JSON data to the layer above, 11:08.000 --> 11:09.000 and at the bottom, 11:09.000 --> 11:11.000 there is a storage layer, 11:11.000 --> 11:12.000 where stores the results, 11:12.000 --> 11:13.000 the data base, 11:13.000 --> 11:15.000 and is also responsible for 11:15.000 --> 11:16.000 creating the raw results, 11:16.000 --> 11:17.000 out of the database, 11:17.000 --> 11:19.000 and doing also the filtering, 11:19.000 --> 11:21.000 based on the various criteria, 11:21.000 --> 11:24.000 and then gives the upper layer 11:24.000 --> 11:26.000 some JSON data. 11:26.000 --> 11:28.000 Now, 11:28.000 --> 11:30.000 some of the implementation details, 11:30.000 --> 11:31.000 so reframe, 11:31.000 --> 11:34.000 that's already in the reframe report, 11:34.000 --> 11:37.000 is a big JSON file that reframe 11:37.000 --> 11:39.000 produce with all the details, 11:39.000 --> 11:42.000 and this is its structure. 11:42.000 --> 11:44.000 So the structure is, 11:44.000 --> 11:46.000 it's a bit hierarchical, 11:46.000 --> 11:47.000 so you start with the session, 11:47.000 --> 11:49.000 which is essentially a reframe, 11:49.000 --> 11:51.000 minus run invocation, 11:51.000 --> 11:53.000 and it has a session info. 11:53.000 --> 11:54.000 Now, the session info, 11:54.000 --> 11:57.000 it has a unique identifier, 11:57.000 --> 12:00.000 plus an information about the session, 12:00.000 --> 12:02.000 which includes also an information 12:02.000 --> 12:04.000 now passed with session extras, 12:04.000 --> 12:06.000 and then session contains runs. 12:06.000 --> 12:08.000 Now, if you run, 12:08.000 --> 12:10.000 if you run a frame, 12:10.000 --> 12:12.000 your test may run multiple times, 12:12.000 --> 12:14.000 and that depends on the actual options, 12:14.000 --> 12:15.000 that you pass. 12:15.000 --> 12:17.000 For example, if you have max retries, 12:17.000 --> 12:18.000 your failing, 12:18.000 --> 12:20.000 your failing test will be retried a couple of times, 12:20.000 --> 12:23.000 or if you want to just rerun the test multiple times, 12:23.000 --> 12:25.000 that's why there are multiple runs within a session. 12:25.000 --> 12:26.000 Now, 12:26.000 --> 12:27.000 within a run, 12:27.000 --> 12:29.000 there is a set of test cases, 12:29.000 --> 12:31.000 which is actually the test have run, 12:31.000 --> 12:33.000 with all the information that your test has, 12:33.000 --> 12:35.000 but for variables, 12:36.000 --> 12:37.000 performance reference, 12:37.000 --> 12:38.000 threshold, 12:38.000 --> 12:39.000 actual performance, 12:39.000 --> 12:40.000 that you got, 12:40.000 --> 12:41.000 and so on. 12:41.000 --> 12:44.000 This is like the information we need. 12:44.000 --> 12:45.000 Now, 12:45.000 --> 12:47.000 we store the results in a, 12:47.000 --> 12:48.000 we started light, 12:48.000 --> 12:49.000 so we started using it, 12:49.000 --> 12:50.000 and I calculated the basis, 12:50.000 --> 12:52.000 but the deal with the layers is that 12:52.000 --> 12:54.000 if the need comes up in the future, 12:54.000 --> 12:56.000 this could be easily replaced. 12:56.000 --> 12:58.000 So, essentially, 12:58.000 --> 13:03.000 we do index test cases and sessions. 13:04.000 --> 13:06.000 So, practically, 13:06.000 --> 13:07.000 we store, 13:07.000 --> 13:08.000 in the database, 13:08.000 --> 13:11.000 the full JSON blog of the, 13:11.000 --> 13:13.000 of the report. 13:13.000 --> 13:15.000 And then, we index the session, 13:15.000 --> 13:16.000 also, by their UID, 13:16.000 --> 13:17.000 and their time. 13:17.000 --> 13:18.000 So, then, 13:18.000 --> 13:20.000 we can do easily time-based queries. 13:20.000 --> 13:22.000 And also, the test cases themselves, 13:22.000 --> 13:24.000 we gain index time, 13:24.000 --> 13:25.000 with, 13:25.000 --> 13:27.000 by their completion time, 13:27.000 --> 13:29.000 and also by a judo, 13:29.000 --> 13:30.000 UID, 13:30.000 --> 13:31.000 let's say, 13:31.000 --> 13:32.000 the session, 13:32.000 --> 13:33.000 you, 13:33.000 --> 13:34.000 unique identifier, 13:34.000 --> 13:35.000 the run index, 13:35.000 --> 13:36.000 and the test index, 13:36.000 --> 13:37.000 inside the session. 13:37.000 --> 13:39.000 So, then you have the unique coordinates, 13:39.000 --> 13:41.000 in a specific refrain run, 13:41.000 --> 13:43.000 and you can retrieve all the test case information, 13:43.000 --> 13:46.000 and then you can apply filtering stuff. 13:46.000 --> 13:49.000 So, time-based queries, 13:49.000 --> 13:51.000 use this index to retrieve, 13:51.000 --> 13:52.000 the sessions of interest, 13:52.000 --> 13:55.000 then the test cases are decoded, 13:55.000 --> 13:56.000 and then filtered, 13:56.000 --> 13:58.000 and they return to the upper layer, 13:58.000 --> 14:00.000 for analytics processing, 14:01.000 --> 14:02.000 and similarly, 14:02.000 --> 14:03.000 for sessions, 14:03.000 --> 14:06.000 where only the information of the session is decoded, 14:06.000 --> 14:07.000 to save space, 14:07.000 --> 14:10.000 because you don't want the whole session information. 14:10.000 --> 14:14.000 So, going on a bit with the syntax, 14:14.000 --> 14:16.000 the general syntax in all those options, 14:16.000 --> 14:18.000 has three parts, 14:18.000 --> 14:19.000 select, 14:19.000 --> 14:20.000 spec, 14:20.000 --> 14:21.000 and aggregation, spec, 14:21.000 --> 14:22.000 and the columns, 14:22.000 --> 14:23.000 spec, 14:23.000 --> 14:24.000 those like presentation, 14:24.000 --> 14:25.000 spec, somehow. 14:25.000 --> 14:26.000 So, this select, 14:26.000 --> 14:28.000 spec defines which results, 14:29.000 --> 14:30.000 we want to select. 14:30.000 --> 14:32.000 So, it can have different forms. 14:32.000 --> 14:33.000 One is like timestamps. 14:33.000 --> 14:34.000 So, you can say here, 14:34.000 --> 14:36.000 from 25th of January, 14:36.000 --> 14:38.000 and the 31st of January, 14:38.000 --> 14:39.000 give me all the results. 14:39.000 --> 14:41.000 Or with abbreviation, 14:41.000 --> 14:43.000 there is also the last seven days, 14:43.000 --> 14:44.000 till now, 14:44.000 --> 14:45.000 or by UUID. 14:45.000 --> 14:46.000 So, you just say, 14:46.000 --> 14:49.000 even the results from that specific session, 14:49.000 --> 14:52.000 or you can have through session properties, 14:52.000 --> 14:54.000 which usually you start with session extras. 14:54.000 --> 14:55.000 So, here, 14:55.000 --> 14:56.000 it says, 14:56.000 --> 14:58.000 all the tests that they have 14:58.000 --> 15:00.000 run with a driver version, 15:00.000 --> 15:02.000 576, 15:02.000 --> 15:05.000 on that host name. 15:05.000 --> 15:07.000 Then the aggregation spec, 15:07.000 --> 15:09.000 defines how we want to group 15:09.000 --> 15:12.000 and aggregate the performance results. 15:12.000 --> 15:14.000 Oh, yeah. 15:14.000 --> 15:17.000 We default, 15:17.000 --> 15:19.000 there is a default grouping by name system, 15:19.000 --> 15:20.000 partition environment, 15:20.000 --> 15:23.000 and the performance variables and units. 15:24.000 --> 15:27.000 And we can use custom groupings, 15:27.000 --> 15:31.000 and they're set of available aggregation that you can use. 15:31.000 --> 15:32.000 Then there is the column specs, 15:32.000 --> 15:33.000 we define what to show, 15:33.000 --> 15:36.000 but if all those are the fields that they are, 15:36.000 --> 15:39.000 we have grouped our results by, 15:39.000 --> 15:41.000 but you can add additional fields, 15:41.000 --> 15:45.000 or you can completely use custom columns. 15:45.000 --> 15:46.000 One, I think, 15:46.000 --> 15:49.000 is that some common filtering options from refrain, 15:49.000 --> 15:51.000 like minus n or minus e, 15:52.000 --> 15:54.000 they are used, 15:54.000 --> 15:57.000 they can be reused when you do 15:57.000 --> 15:59.000 different formats in a latex query. 15:59.000 --> 16:00.000 So here, 16:00.000 --> 16:01.000 I have some examples. 16:01.000 --> 16:02.000 So, for example, 16:02.000 --> 16:03.000 here, 16:03.000 --> 16:06.000 it's list me the main performance of a specific benchmark, 16:06.000 --> 16:08.000 like stream code for the last seven days, 16:08.000 --> 16:10.000 this is how you can do that. 16:10.000 --> 16:12.000 Then imagine you have a parameterized test, 16:12.000 --> 16:14.000 where your test has a mode, 16:14.000 --> 16:15.000 different modes, 16:15.000 --> 16:18.000 and also parameterized over the GPUs on the node, 16:18.000 --> 16:19.000 and you say, 16:19.000 --> 16:21.000 the main across all GPUs on the node, 16:21.000 --> 16:25.000 and I want all nodes that I have tested, 16:25.000 --> 16:27.000 and for all modes. 16:27.000 --> 16:29.000 So here is a query, 16:29.000 --> 16:30.000 when we can get like, 16:30.000 --> 16:32.000 for a specific driver version, 16:32.000 --> 16:34.000 we can get the information we want. 16:34.000 --> 16:35.000 Then, okay, 16:35.000 --> 16:37.000 I want you to compare all the benchmark data that you have 16:37.000 --> 16:41.000 between two driver versions, 16:41.000 --> 16:43.000 and yeah. 16:43.000 --> 16:44.000 And then, 16:44.000 --> 16:46.000 there are some also examples here, 16:46.000 --> 16:47.000 I'm going to skip them 16:47.000 --> 16:49.000 of getting like some information. 16:49.000 --> 16:52.000 And if you want to get like the raw JSON report, 16:52.000 --> 16:53.000 yeah, 16:53.000 --> 16:55.000 you can still get it with the describe storage session, 16:55.000 --> 16:58.000 and then you can post-process the way you want. 16:58.000 --> 17:00.000 Or just list it as CSV, 17:00.000 --> 17:02.000 and just get the information that you need. 17:02.000 --> 17:04.000 And here is an example. 17:04.000 --> 17:07.000 So this feature is available in refrain4.7, 17:07.000 --> 17:10.000 which is the latest version. 17:10.000 --> 17:12.000 It's by default disabled, 17:12.000 --> 17:13.000 so you have to enable it, 17:13.000 --> 17:16.000 and then you can also customize where you want the results 17:16.000 --> 17:17.000 to be stored. 17:17.000 --> 17:19.000 And here is an actual query, 17:19.000 --> 17:23.000 and you see how it shows up 17:23.000 --> 17:26.000 with performance table. 17:26.000 --> 17:29.000 So you have the value A, 17:29.000 --> 17:30.000 your first set, 17:30.000 --> 17:31.000 your second set, 17:31.000 --> 17:33.000 and also the difference between the two. 17:33.000 --> 17:35.000 So you can easily spot, 17:35.000 --> 17:36.000 you know, 17:36.000 --> 17:39.000 regressions that are smaller than, 17:39.000 --> 17:42.000 especially within the thresholds. 17:42.000 --> 17:44.000 And we have like two or three minutes, 17:44.000 --> 17:46.000 if you are very quickly. 17:46.000 --> 17:49.000 I'm going to really quickly describe all use reframing 17:49.000 --> 17:50.000 how we use this feature. 17:50.000 --> 17:52.000 So as I said, it's very important for us 17:52.000 --> 17:55.000 to check each hour with component, 17:55.000 --> 17:57.000 because if one of them is behaving anomaly, 17:57.000 --> 17:59.000 it can slow down your old HPC, 17:59.000 --> 18:00.000 or the planning training. 18:00.000 --> 18:02.000 So we need reframing test, 18:02.000 --> 18:03.000 basically, 18:03.000 --> 18:04.000 running on HPPU, 18:04.000 --> 18:05.000 which is available in each HCA, 18:05.000 --> 18:07.000 HPU memory, 18:07.000 --> 18:08.000 everything SSD. 18:08.000 --> 18:09.000 So every box, 18:09.000 --> 18:12.000 basically on the HDGX-800 diagram, 18:12.000 --> 18:15.000 needs to be properly checked for performance, 18:15.000 --> 18:16.000 and stability, 18:16.000 --> 18:18.000 and that's what we are using. 18:18.000 --> 18:19.000 That's why we're using reframing. 18:19.000 --> 18:22.000 So we're using SLAM to reframing SLAM, 18:22.000 --> 18:25.000 and we have our own container on time, 18:25.000 --> 18:26.000 that's also open source called EnRoot, 18:26.000 --> 18:28.000 and Pixys is the SLAM integration 18:28.000 --> 18:29.000 for this container on time. 18:29.000 --> 18:31.000 And we use a lot of open source projects 18:31.000 --> 18:33.000 for the testing, 18:33.000 --> 18:34.000 the black nickel, 18:34.000 --> 18:37.000 and then we bandwidths for HPU memory, 18:37.000 --> 18:39.000 or adMAPF test, 18:39.000 --> 18:40.000 or famous stream benchmark, 18:40.000 --> 18:41.000 FIO for disk, 18:41.000 --> 18:43.000 and we have single notice 18:43.000 --> 18:44.000 that we are going to test, 18:44.000 --> 18:45.000 as I said, each component, 18:45.000 --> 18:47.000 and we have kind of eye-o-level test 18:47.000 --> 18:48.000 that are closer, 18:48.000 --> 18:50.000 maybe to what users are running, 18:50.000 --> 18:52.000 but are very important for performance 18:52.000 --> 18:53.000 prediction, 18:53.000 --> 18:54.000 and also, 18:54.000 --> 18:56.000 things that are multi-node, 18:56.000 --> 18:57.000 because you cannot test multi-node, 18:57.000 --> 18:58.000 obviously the network, 18:58.000 --> 18:59.000 on just one node. 18:59.000 --> 19:00.000 So we have two types of tests, 19:00.000 --> 19:03.000 and actually the only people 19:03.000 --> 19:05.000 in our team that use the reframing SLAM, 19:05.000 --> 19:07.000 because reframing SLAM is me, 19:07.000 --> 19:08.000 and Vasilios, 19:08.000 --> 19:11.000 and our users actually use the GitLab CIY, 19:11.000 --> 19:12.000 and they say, 19:12.000 --> 19:14.000 I want to run on this cluster ABC, 19:14.000 --> 19:17.000 I want to run the single node flavor, 19:17.000 --> 19:18.000 and I want the short version, 19:18.000 --> 19:19.000 so like 30 minutes, 19:19.000 --> 19:21.000 and they click run pipeline in GitLab CI, 19:21.000 --> 19:22.000 and boom, 19:22.000 --> 19:24.000 they get the per node run, 19:24.000 --> 19:25.000 and they get a different node fail, 19:25.000 --> 19:27.000 they can click at it, 19:27.000 --> 19:28.000 and click on it, 19:28.000 --> 19:29.000 and look at the reframing log. 19:29.000 --> 19:33.000 So that's the way we integrated reframing SLI tool 19:33.000 --> 19:35.000 into something that our admins can use, 19:35.000 --> 19:37.000 without adding them to no reframing. 19:38.000 --> 19:41.000 And reframing even supports a drag unit export, 19:41.000 --> 19:43.000 and the lab CIY also supports a unique export. 19:43.000 --> 19:45.000 So you can click on the node, 19:45.000 --> 19:46.000 and say, 19:46.000 --> 19:48.000 oh, and actually this is the reframing log directly, 19:48.000 --> 19:52.000 you see the reframing log directly to the user right here. 19:52.000 --> 19:55.000 And I think that's the last slide, 19:55.000 --> 19:57.000 and right on time. 19:57.000 --> 20:00.000 And I think this is great, 20:00.000 --> 20:03.000 because that allows us to have more insights, 20:03.000 --> 20:05.000 and we can, 20:05.000 --> 20:07.000 when we run in GitLab CI, 20:07.000 --> 20:08.000 we populated that base, 20:08.000 --> 20:11.000 and then when people ask us in our team, 20:11.000 --> 20:14.000 hey, can you compare this between this driver version of Nvidia, 20:14.000 --> 20:17.000 and this driver version to verify everything is fine. 20:17.000 --> 20:19.000 We just run one command, 20:19.000 --> 20:22.000 and we give them the table in as key art format, 20:22.000 --> 20:23.000 but obviously it's our next, 20:23.000 --> 20:26.000 a lot of next steps to get more insights into the statistics, 20:26.000 --> 20:29.000 to get more comparison of the statistics, 20:29.000 --> 20:31.000 and also a big open is automatically, 20:31.000 --> 20:33.000 we make that more accessible to users, 20:33.000 --> 20:35.000 we did for GitLab CI, 20:35.000 --> 20:37.000 and also the query of the, 20:37.000 --> 20:39.000 the latency of the queries is still a bit slow, 20:39.000 --> 20:41.000 so that's something that's, 20:41.000 --> 20:43.000 if I see those, we'll work on. 20:43.000 --> 20:45.000 Thank you. 20:46.000 --> 20:48.000 Thank you. 20:54.000 --> 20:56.000 Any questions for feeling sensitive? 20:56.000 --> 20:57.000 Yeah. 20:57.000 --> 20:58.000 Do you think, 20:58.000 --> 20:59.000 kind of reframe the public base, 20:59.000 --> 21:02.000 where your CI might not have identical notes, 21:02.000 --> 21:05.000 so it might be a worse CPU on one, 21:05.000 --> 21:06.000 like, 21:06.000 --> 21:07.000 like, 21:07.000 --> 21:08.000 figure, 21:08.000 --> 21:11.000 you know, 21:11.000 --> 21:12.000 in this case, 21:12.000 --> 21:13.000 I said, 21:13.000 --> 21:14.000 oh yeah, 21:14.000 --> 21:15.000 do we support, 21:15.000 --> 21:16.000 can we support, 21:16.000 --> 21:19.000 in reframe the case where we have multiple types of notes? 21:19.000 --> 21:20.000 Yeah. 21:20.000 --> 21:22.000 We know we do support this use case, 21:22.000 --> 21:24.000 in reframe you can have multiple performance targets already, 21:24.000 --> 21:26.000 saying if the node is this type, 21:26.000 --> 21:27.000 you get this performance target, 21:27.000 --> 21:28.000 just this type, 21:28.000 --> 21:29.000 this performance target, 21:29.000 --> 21:30.000 and for this feature, 21:30.000 --> 21:32.000 you can add, 21:32.000 --> 21:34.000 arbitrary metadata, 21:34.000 --> 21:35.000 to the database, 21:35.000 --> 21:36.000 saying, 21:36.000 --> 21:38.000 I want to run this workload, 21:38.000 --> 21:41.000 and I'm going to add an arbitrary tag called, 21:41.000 --> 21:42.000 RWA1, 21:42.000 --> 21:43.000 and, 21:43.000 --> 21:44.000 and then if you have a different node, 21:44.000 --> 21:45.000 you'll say, 21:45.000 --> 21:46.000 I want to run with RWA2, 21:46.000 --> 21:47.000 and then you can ask, 21:47.000 --> 21:48.000 this feature, 21:48.000 --> 21:49.000 you can say, 21:49.000 --> 21:52.000 compare all the results just on RWA1. 21:52.000 --> 21:54.000 So, yeah, 21:54.000 --> 21:55.000 we can add, 21:55.000 --> 21:56.000 anything to add? 21:56.000 --> 21:57.000 No, 21:57.000 --> 21:58.000 from, 21:58.000 --> 22:00.000 also from reframe test side, 22:00.000 --> 22:03.000 you can support like multiple clusters at the same time, 22:03.000 --> 22:05.000 and then you can in your test, 22:05.000 --> 22:08.000 you can have constraints for your test, 22:08.000 --> 22:09.000 say for example, 22:09.000 --> 22:10.000 this test is for GPU, 22:10.000 --> 22:12.000 and then reframe automatically, 22:12.000 --> 22:14.000 will only select your test 22:14.000 --> 22:16.000 for a configuration that has, 22:16.000 --> 22:17.000 for example, 22:17.000 --> 22:18.000 TPU. 22:18.000 --> 22:20.000 So, yeah, 22:20.000 --> 22:21.000 that's it. 22:21.000 --> 22:23.000 What are the questions? 22:23.000 --> 22:25.000 Yeah, any more questions? 22:29.000 --> 22:30.000 One, 22:30.000 --> 22:31.000 one, 22:31.000 --> 22:32.000 one, 22:32.000 --> 22:33.000 and then, 22:33.000 --> 22:35.000 thank you very much. 22:35.000 --> 22:37.000 Thank you. 22:37.000 --> 22:39.000 Thank you.