WEBVTT 00:00.000 --> 00:05.000 Can you hear me? 00:05.000 --> 00:09.000 Everybody can hear me? 00:09.000 --> 00:12.000 So hi, I'm Anton Kirnoff. 00:12.000 --> 00:16.000 I'm, or maybe, I was, and I've been back developer. 00:16.000 --> 00:19.000 I work with FF Labs, or maybe I used to. 00:19.000 --> 00:25.000 And past year, I implemented multi-view decoding in AD codec. 00:25.000 --> 00:29.000 And in the FF and back CLI, Transcoder 2. 00:29.000 --> 00:33.000 In this talk, I will tell you, what is multi-view? 00:33.000 --> 00:35.000 Why might you care? 00:35.000 --> 00:38.000 Why might you care even if you don't care about multi-view? 00:38.000 --> 00:45.000 And some technical interesting aspects, hopefully interesting, of this work. 00:45.000 --> 00:52.000 It was sponsored by Vimeo and Meta, so thanks for making this possible. 00:52.000 --> 01:00.000 It has wide implications beyond just multi-view, but multi-view in itself is also quite interesting. 01:00.000 --> 01:03.000 So, to start, what is multi-view? 01:03.000 --> 01:06.000 I think the picture really says it all. 01:06.000 --> 01:14.000 You have two or maybe more video streams that are kind of independent, but not really. 01:14.000 --> 01:20.000 So, they are independent in the sense that you treat them as a two-parallel video streams, 01:20.000 --> 01:22.000 but there's a lot of redundancy. 01:22.000 --> 01:26.000 So, if you squint really, really closely, you probably can't see it from there, 01:26.000 --> 01:30.000 but they are actually not the same. 01:30.000 --> 01:31.000 They are actually different. 01:31.000 --> 01:35.000 And the canonical example of multi-view is to your scopic 3D. 01:36.000 --> 01:44.000 So, if this is the left-eye view, the right-eye view, 01:44.000 --> 01:50.000 so, yeah, that's the way people generally use this, 01:50.000 --> 01:53.000 but you can do other things with it. 01:53.000 --> 01:55.000 So, now you want to code this thing. 01:55.000 --> 01:58.000 The naive way you code two video streams. 01:58.000 --> 02:02.000 This is very simple and obvious, but your bitrageous doubled. 02:02.000 --> 02:04.000 So, you don't want that. 02:04.000 --> 02:08.000 So, what you do want is to make use of the redundancy, 02:08.000 --> 02:12.000 and somehow predicts one of the images from the other, 02:12.000 --> 02:16.000 and just encode the differences. 02:16.000 --> 02:20.000 You could use some kind of hacks like, well, maybe you put them side by side, 02:20.000 --> 02:24.000 and use intra-frame prediction, 02:24.000 --> 02:28.000 but, or you can interleave the frames, 02:28.000 --> 02:30.000 and put them into one stream, 02:30.000 --> 02:33.000 but these are possible. 02:33.000 --> 02:35.000 People sometimes do them. 02:35.000 --> 02:37.000 I think, but it's quite hacky, 02:37.000 --> 02:40.000 and, for instance, it forces you to decode 02:40.000 --> 02:42.000 always decode both of them, 02:42.000 --> 02:44.000 which you don't always want. 02:44.000 --> 02:46.000 Maybe you sometimes want just one of them. 02:46.000 --> 02:51.000 So, multi-view is a set of tools to deal with that. 02:51.000 --> 02:55.000 So, the thing I actually implemented is called NVHVC, 02:55.000 --> 02:58.000 which is multi-view for the HVC codec, 02:58.000 --> 03:01.000 also known as H265. 03:01.000 --> 03:07.000 As you all know, H265 is a successor to ABC or H264, 03:07.000 --> 03:10.000 which we all know and love, the best codec ever, 03:10.000 --> 03:12.000 objectively true. 03:12.000 --> 03:16.000 A and H264, they used to be a thing, which was called NVC, 03:16.000 --> 03:19.000 which was multi-view coding for H264. 03:19.000 --> 03:22.000 I think it was used in 3D blue rays, 03:22.000 --> 03:25.000 and we had a longstanding feature request to implement that 03:25.000 --> 03:27.000 an AD codec and that never happened. 03:27.000 --> 03:31.000 For a bunch of reasons, which I will elaborate on later. 03:33.000 --> 03:35.000 But so, yeah, it existed. 03:35.000 --> 03:37.000 It was used a little bit in the wild, 03:37.000 --> 03:40.000 but it's not really supported very much. 03:40.000 --> 03:43.000 So, in HVC, there is a similar thing, 03:43.000 --> 03:48.000 which we now call NV, which people call NVHVC. 03:48.000 --> 03:53.000 And it is a way of doing exactly where I show you 03:53.000 --> 03:58.000 in the previous slide of packing multiple semi-independent streams 03:58.000 --> 04:01.000 into a single HVC bit streams such that the streams 04:01.000 --> 04:04.000 can predict from each other, but otherwise you can sort of 04:04.000 --> 04:08.000 tweak them as independent, which is exactly what you want. 04:08.000 --> 04:12.000 It is based on multi-layer extensions. 04:12.000 --> 04:16.000 So, I think in H264, all of this was multi-view, 04:16.000 --> 04:20.000 was a separate thing, scalability, was a separate thing, 04:20.000 --> 04:22.000 other dancing animated ponies. 04:22.000 --> 04:24.000 It was a separate thing. 04:24.000 --> 04:26.000 In HVC, I think that the unified, 04:26.000 --> 04:31.000 there is sort of a general multi-layer extensions 04:31.000 --> 04:34.000 specification that, and then it's specialized 04:34.000 --> 04:37.000 into multi-view, scalable encoding, alpha, 04:37.000 --> 04:40.000 things, some kind of depth, texture, 04:40.000 --> 04:42.000 thingy, I didn't look into it. 04:42.000 --> 04:45.000 But there's a bunch of things of purposes 04:45.000 --> 04:47.000 that can be used for, but generally, 04:47.000 --> 04:50.000 people care about multi-view, about alpha. 04:50.000 --> 04:55.000 I think some people care about scalable, who knows. 04:55.000 --> 04:59.000 If you remember what an all unit had 04:59.000 --> 05:03.000 or looks like, which you should, there is a field in it, 05:03.000 --> 05:05.000 which is always zero. 05:05.000 --> 05:07.000 And if it's non-zero, you scream and run away, 05:07.000 --> 05:10.000 and the point of this work is that you don't scream, 05:10.000 --> 05:14.000 you don't run away, you face it, 05:15.000 --> 05:18.000 you don't, and do something useful with it. 05:18.000 --> 05:21.000 The first specification is insanely complex, 05:21.000 --> 05:24.000 because all of these things, they can be used together. 05:24.000 --> 05:28.000 And you can sort of have a multi-view, scalable stream 05:28.000 --> 05:32.000 with alpha, which has up to 63 layers, 05:32.000 --> 05:34.000 which one of them is the base one. 05:34.000 --> 05:38.000 This is the layer ID zero, which can be 05:38.000 --> 05:41.000 decoded on its own, by a decoder that 05:41.000 --> 05:43.000 doesn't know anything about any multi-layer, 05:43.000 --> 05:46.000 just ignores everything, and just decodes the base there. 05:46.000 --> 05:49.000 But the other layers sort of predict from it, 05:49.000 --> 05:52.000 and there can be a complex dependency graph. 05:52.000 --> 05:55.000 And as far as I know nothing supports that, 05:55.000 --> 05:57.000 even the reference implementation, 05:57.000 --> 06:01.000 like there's the base one, which only does base layer, 06:01.000 --> 06:04.000 and there's like three forks of it, 06:04.000 --> 06:06.000 one of which does multi-view, and the other one 06:06.000 --> 06:09.000 does scalable, and the other one does 3D, 06:09.000 --> 06:12.000 and I'm not sure which one does alpha, 06:12.000 --> 06:16.000 maybe somebody knows, and they are separate, 06:16.000 --> 06:19.000 and none of them can do all of these at once. 06:19.000 --> 06:21.000 But in principle, per specification, 06:21.000 --> 06:23.000 you can do all of these together, 06:23.000 --> 06:25.000 and if you look at this specification, 06:25.000 --> 06:28.000 which I highly recommend, it's just completely insane. 06:28.000 --> 06:30.000 So we decided to not support, of course, 06:30.000 --> 06:32.000 any of that, we only support two layers, 06:32.000 --> 06:35.000 with the second one, depending on the first. 06:35.000 --> 06:40.000 Although, with alpha, being interesting for people, 06:40.000 --> 06:44.000 maybe there will be a use case, where you have 06:44.000 --> 06:46.000 a multi-view stream with alpha. 06:46.000 --> 06:49.000 Somebody should create it, but I don't think 06:49.000 --> 06:53.000 the demand for this is driven by VR, I think, 06:53.000 --> 06:57.000 is so, probably has to be hardware for the does this, 06:57.000 --> 06:59.000 so, probably not, but it will be fun. 06:59.000 --> 07:02.000 But so far, we can do two layers, 07:02.000 --> 07:07.000 and that's it. 07:07.000 --> 07:09.000 Why do you care? 07:09.000 --> 07:12.000 So, one possibility, you care about your copy 3D. 07:12.000 --> 07:17.000 You have VR glasses, or Oculus Quest, 07:17.000 --> 07:19.000 Apple Vision Pro, one of these things, 07:19.000 --> 07:24.000 and you really like to record and watch videos on them. 07:24.000 --> 07:27.000 So that's one possibility, that's the canonical use case. 07:27.000 --> 07:30.000 You might care about alpha, so I didn't implement that, 07:30.000 --> 07:33.000 but I opened the door to that, and somebody else, 07:33.000 --> 07:36.000 already wrote the patches. 07:36.000 --> 07:40.000 So, that will be possible soon, probably. 07:40.000 --> 07:44.000 But more generally, multi-view, the coding, 07:44.000 --> 07:47.000 the reason why it was never implemented for issues, 07:47.000 --> 07:50.000 or one of the reasons, and why it was hard to implement for this, 07:50.000 --> 07:53.000 is that it challenges a bunch of assumptions we make internally, 07:53.000 --> 07:57.000 and also in the APIs about how video is decoded. 07:57.000 --> 08:01.000 For instance, you have a single input packet, 08:01.000 --> 08:04.000 like the coded HVC data that you sent to a decoder, 08:04.000 --> 08:07.000 and that decodes, that contains all the views. 08:07.000 --> 08:09.000 So, it decodes into multiple frames. 08:09.000 --> 08:13.000 Two in our cases, but we don't make the assumptions in the API. 08:13.000 --> 08:15.000 So, in principle, end frames. 08:15.000 --> 08:20.000 So, that was not really supported in a bunch of ways before, 08:20.000 --> 08:23.000 now it is, and that has implications, 08:23.000 --> 08:26.000 so maybe it allows some things which were not possible for. 08:26.000 --> 08:29.000 And the other thing is that now you have a single decoder, 08:29.000 --> 08:33.000 which produces frames for several independent stream, 08:33.000 --> 08:36.000 which, again, it has implications. 08:36.000 --> 08:39.000 It might allow some use cases which were not possible before. 08:39.000 --> 08:42.000 So, you might care even if you don't care about 3D. 08:42.000 --> 08:47.000 So, what was hard about implemented? 08:47.000 --> 08:51.000 So, first, inside the HVC decoder itself, 08:51.000 --> 08:55.000 so AD codec has like generic, generic code, 08:55.000 --> 08:58.000 which is the code independent, and then below that, 08:58.000 --> 09:01.000 there is the decoder specific stuff. 09:01.000 --> 09:05.000 So, this is for the decoder specific stuff. 09:05.000 --> 09:08.000 The main thing that you encounter, or the first one, 09:08.000 --> 09:12.000 is that a bunch of state that used to be per codec, 09:12.000 --> 09:14.000 is now per layer. 09:14.000 --> 09:17.000 So, you have your decoder context, which is a big struct, 09:17.000 --> 09:19.000 which with a bunch of state in it, 09:19.000 --> 09:22.000 and now a lot of that state is per layer. 09:22.000 --> 09:25.000 So, you have, you need to have multiple of these context, 09:25.000 --> 09:28.000 and for each layer we want to decode. 09:28.000 --> 09:31.000 A common approach, I don't know about your project, 09:31.000 --> 09:35.000 but a common approach that people very often do, 09:35.000 --> 09:40.000 is that you add a bunch of children of, 09:40.000 --> 09:44.000 or a bunch of copies of the same struct inside it, 09:44.000 --> 09:49.000 which saves you, it seems like it saves you work. 09:49.000 --> 09:52.000 So, it's because you don't have to really do anything, 09:52.000 --> 09:54.000 you just do that very simple thing, 09:54.000 --> 09:59.000 and from now on, some things are per codec, 09:59.000 --> 10:01.000 and some are per layer. 10:01.000 --> 10:03.000 This is a horrible, horrible, evil, 10:03.000 --> 10:06.000 obfuscation method, which you should never ever do, 10:06.000 --> 10:08.000 and if you do that, please stop. 10:08.000 --> 10:11.000 Because immediately, when you do that, 10:11.000 --> 10:14.000 you lose the information, which fields from the struct are, 10:14.000 --> 10:16.000 are meaningful in the parent, 10:16.000 --> 10:18.000 and which are meaningful in the child. 10:18.000 --> 10:21.000 Now, everybody who's reading the code later has to reverse 10:21.000 --> 10:26.000 the engineer, the struct, check all the places where 10:26.000 --> 10:28.000 it's where some field is used, 10:28.000 --> 10:32.000 and only then you discover, which is which. 10:32.000 --> 10:35.000 And you might think, oh, but if I document it, 10:35.000 --> 10:39.000 surely this will fix it, haha. 10:39.000 --> 10:41.000 Of course, nobody ever documents things, 10:41.000 --> 10:44.000 and if you do it, it will get out of date, 10:44.000 --> 10:46.000 eventually, because somebody dazed the code 10:46.000 --> 10:48.000 and doesn't update their documentation. 10:48.000 --> 10:52.000 So documentation helps a little, not that much. 10:52.000 --> 10:55.000 And it's, but also another problem is, 10:55.000 --> 10:57.000 you have a bunch of dead fields. 10:57.000 --> 10:59.000 In the parent context, and in the children, 10:59.000 --> 11:01.000 you have a bunch of fields that are just there, 11:01.000 --> 11:04.000 waste memory, waste cash, and don't do anything. 11:04.000 --> 11:07.000 So, and in the end, the amount of work it saves you, 11:07.000 --> 11:09.000 it's very little. 11:09.000 --> 11:12.000 It looks like a lot, but not really, 11:12.000 --> 11:15.000 and it's, it's work that's very straightforward. 11:15.000 --> 11:16.000 You don't have to think about it. 11:16.000 --> 11:19.000 In the future, probably, you charge a Pt will do it. 11:19.000 --> 11:22.000 So, please never ever do this. 11:22.000 --> 11:25.000 The thing you actually should do, 11:25.000 --> 11:27.000 you check all the fields. 11:27.000 --> 11:30.000 You find out which one are actually per layer, 11:30.000 --> 11:33.000 which you have to do anyway in the end. 11:33.000 --> 11:36.000 You just, in this approach, you just do it more systematically. 11:36.000 --> 11:38.000 And then you add a per layer context. 11:38.000 --> 11:42.000 You move the things in that per layer context, 11:42.000 --> 11:45.000 one by one, and hopefully you're done. 11:45.000 --> 11:49.000 If you work this was the majority of work by patch volume, 11:49.000 --> 11:53.000 but it was really straightforward mostly. 11:53.000 --> 11:56.000 If your code is really crappy and entangled and spaghetti-fied, 11:56.000 --> 11:59.000 this must be not trivial, because moving one thing 11:59.000 --> 12:01.000 can depend on some other thing, 12:01.000 --> 12:03.000 which happened here to some extent, 12:03.000 --> 12:06.000 but not as much as it could have. 12:06.000 --> 12:11.000 For instance, the H640 coder has more history. 12:11.000 --> 12:12.000 Let's say. 12:12.000 --> 12:17.000 And doing the same thing there will be more complicated. 12:17.000 --> 12:21.000 If you feel like second that problem would be prepared for some pain. 12:21.000 --> 12:25.000 So, that was the biggest thing I had to do. 12:25.000 --> 12:29.000 Another thing was frame output logic. 12:29.000 --> 12:32.000 As you all know, H2HVC, 12:32.000 --> 12:35.000 and also ADC they have frame reordering. 12:35.000 --> 12:38.000 So, when you decode a frame, you don't output it immediately. 12:38.000 --> 12:40.000 You put it in a decode a picture buffer, 12:40.000 --> 12:44.000 and then maybe depending on some conditions. 12:44.000 --> 12:46.000 You look at the picotic picture buffer. 12:46.000 --> 12:49.000 You select some specific frame from it, 12:49.000 --> 12:54.000 and then you maybe output it. 12:54.000 --> 12:58.000 One factor that complicated is that there are things, 12:58.000 --> 13:00.000 which are called sequences. 13:00.000 --> 13:06.000 A sequence is basically a segment of coded video, 13:06.000 --> 13:08.000 which has the same parameters. 13:08.000 --> 13:14.000 Like a single coded video that was encoded in one at once, 13:14.000 --> 13:15.000 for instance. 13:15.000 --> 13:18.000 And this can change at a time. 13:18.000 --> 13:20.000 So, you can concatenate a bunch of videos, 13:20.000 --> 13:24.000 and you get two sequences, or multiple sequences. 13:24.000 --> 13:28.000 And whenever you switch the sequence, 13:28.000 --> 13:33.000 you have a bunch of frames buffered for output later. 13:33.000 --> 13:35.000 And so, you could be decoding one frame, 13:35.000 --> 13:38.000 but still be a frame for months sequence, 13:38.000 --> 13:40.000 and still be outputting frames from a previous sequence, 13:40.000 --> 13:42.000 or in more pathological cases, 13:42.000 --> 13:45.000 you could be two sequences back, or 16 sequences back. 13:45.000 --> 13:48.000 Probably not six, I think, 15 is the limit. 13:48.000 --> 13:52.000 If you really want pain. 13:52.000 --> 13:57.000 So, and we had a lot of complicated logic to handle that. 13:57.000 --> 14:01.000 And actually, and what I had to do right now, 14:01.000 --> 14:04.000 because now we have not only do we have to handle this, 14:04.000 --> 14:06.000 but we also have two views. 14:06.000 --> 14:09.000 And when you switch sequences, you can also switch the number of views. 14:09.000 --> 14:12.000 You can switch from single view to multi view, 14:12.000 --> 14:15.000 and back, or you can switch the positions of the views, 14:15.000 --> 14:18.000 or the properties blah, blah, blah. 14:18.000 --> 14:21.000 So, this had to be added on top of that logic, 14:21.000 --> 14:24.000 which would be very complicated. 14:24.000 --> 14:29.000 And because I'm not smart enough to think about all that, 14:29.000 --> 14:32.000 I noticed that we actually don't have to do any of it. 14:32.000 --> 14:35.000 But all that complicated logic is only there, 14:35.000 --> 14:40.000 because we have the constraint that a single input packet 14:40.000 --> 14:42.000 has to output at most one frame. 14:42.000 --> 14:45.000 But we don't have that constraint anymore. 14:45.000 --> 14:48.000 So, I change the logic to output multiple frames at once, 14:48.000 --> 14:51.000 which we can do, and all of that horror goes away. 14:51.000 --> 14:55.000 So, this work actually simplified a lot of things, 14:55.000 --> 14:58.000 even though I now, it has to interleaf, 14:58.000 --> 15:01.000 still interleaf frames from multiple views. 15:01.000 --> 15:04.000 Now what it does is when it encounters a sequence switch, 15:04.000 --> 15:07.000 it just flushes the decoder picture by for completely, 15:07.000 --> 15:10.000 which we can do, which is great. 15:10.000 --> 15:14.000 I can also notice frame threading is inefficient for multi view. 15:14.000 --> 15:17.000 If you care, you might want to fix that. 15:17.000 --> 15:19.000 That's welcome. 15:19.000 --> 15:24.000 Now, moving a layer up in a decoder generic code, 15:24.000 --> 15:29.000 there was also a bunch of bunch of issues. 15:29.000 --> 15:36.000 We, as I said, we need a single input packet, 15:36.000 --> 15:39.000 and we need it to produce multiple frames, which need to be output, 15:39.000 --> 15:42.000 which is fine as far as public APIs concerned, 15:42.000 --> 15:48.000 because since the new API, it's 10 years old at this point, 15:48.000 --> 15:52.000 which was added by the infamous WM4. 15:53.000 --> 16:00.000 Exactly for this, to handle M to N arbitrary packet to frame mapping. 16:00.000 --> 16:05.000 So, on the public API level, this is fine, but internally, 16:05.000 --> 16:09.000 internally frame threading did not support that. 16:09.000 --> 16:14.000 A frame threading was working on the old API model. 16:14.000 --> 16:19.000 So, it could only do one packet to add most one frame. 16:19.000 --> 16:21.000 So, I had to change that. 16:21.000 --> 16:27.000 I had to port frame threading to the new API, new, then your old. 16:27.000 --> 16:33.000 Actually, I started doing that back in 2017 for my work on MVC, 16:33.000 --> 16:37.000 which I never finished, but it was never really, 16:37.000 --> 16:42.000 most of the work was most of the work in theory was done, 16:42.000 --> 16:45.000 but actually polishing it was quite complicated, 16:45.000 --> 16:49.000 because there was some unnamed decoder, 16:49.000 --> 16:53.000 which abused frame threading quite a lot. 16:53.000 --> 16:55.000 It did a bunch of things wrong. 16:55.000 --> 16:57.000 It did that thing, which I told you not to do. 16:57.000 --> 16:59.000 It did exactly this. 16:59.000 --> 17:05.000 So, I had to reverse and generate and undo that. 17:05.000 --> 17:09.000 It did that for slide threading, and just to make it readable and possible 17:09.000 --> 17:11.000 to understand for myself, I had to fix that, 17:11.000 --> 17:15.000 which incidentally made it 4% faster in single threaded decoding, 17:15.000 --> 17:18.000 which I didn't intend, but yeah. 17:18.000 --> 17:24.000 But also, it had some hacks like generic codec independent frame threading code 17:24.000 --> 17:30.000 would have code like, if the codec is this, do something insane. 17:30.000 --> 17:35.000 And by insane mean, reduced the number of threads by one. 17:35.000 --> 17:38.000 So, if you had two threads, it was running single threading. 17:38.000 --> 17:44.000 And there were also some races, and by thread sanitizer, and so on. 17:44.000 --> 17:49.000 So, I had to do, in order to implement multi-view for HVC, 17:49.000 --> 17:53.000 I had to fix this decoder, unfortunately, 17:53.000 --> 17:56.000 or maybe fortunately, if you care about it, because now it's faster, 17:56.000 --> 17:58.000 now it doesn't have any races. 17:58.000 --> 18:02.000 It's faster in single thread, it's faster in frame threaded mode. 18:02.000 --> 18:05.000 Now, frame threading is actually always faster than sliced threading, 18:05.000 --> 18:12.000 which makes sense, it should always be faster, otherwise there's no reason to use it. 18:12.000 --> 18:14.000 So, yep. 18:14.000 --> 18:19.000 In order, one thing that helped me a lot here is this new API, 18:19.000 --> 18:22.000 we have which is called draft struct, which was new. 18:22.000 --> 18:25.000 It's a year old, I think, about now. 18:25.000 --> 18:27.000 It was written by Andreas, thank you, Andreas. 18:27.000 --> 18:31.000 It's great, it's recently became public, so you can use it. 18:31.000 --> 18:36.000 It's an API for reference, counted structs with very little overhead, 18:36.000 --> 18:42.000 and very little boiler plays on top of it. 18:42.000 --> 18:44.000 So, it's very convenient. 18:44.000 --> 18:47.000 So, I heavily recommend it, it's great. 18:47.000 --> 18:52.000 So, I had to fix that, and then finish this batch, 18:52.000 --> 18:58.000 and then frame threading is now finally able to handle multiple over frames per packet. 18:58.000 --> 19:01.000 All that for just a small thing. 19:01.000 --> 19:06.000 Another challenge or a bunch of challenges is the public API part. 19:06.000 --> 19:11.000 As I said, the output part is not problematic, 19:11.000 --> 19:15.000 because we do support multiple frames, multiple output frames per packet. 19:15.000 --> 19:17.000 We did that for a long time. 19:17.000 --> 19:20.000 I think many colors actually don't get that right. 19:20.000 --> 19:24.000 An example recently that assumes that one packet produces at most one frame. 19:24.000 --> 19:29.000 So, all these colors are broken, but that's their problem, unfortunately. 19:29.000 --> 19:31.000 But it's not ours. 19:31.000 --> 19:36.000 The problems we do have actually is that all the multi-layer properties are per sequence. 19:36.000 --> 19:42.000 So, if this never happens, but it had to be implemented properly, of course. 19:42.000 --> 19:48.000 So, in principle, you have to consider the case where a multi-view video is concatenated with a single view one, 19:48.000 --> 19:53.000 and so, or you can have a multi-view video with different properties. 19:53.000 --> 20:02.000 So, you need to tell the color, what view ideas there are, what view positions there are, which view is right or left, 20:02.000 --> 20:04.000 and this can change dynamically. 20:04.000 --> 20:09.000 So, for that, I use the GetForex callback, which is named that way for historical reasons. 20:09.000 --> 20:14.000 It actually is used currently to configure hardware acceleration, 20:14.000 --> 20:22.000 but now it's also used to negotiate multi-view properties with the color. 20:22.000 --> 20:28.000 So, when that color bike is cold, the color gets the information about the stream, 20:28.000 --> 20:34.000 and can tell us that it wants either one or both views to be decoded, 20:34.000 --> 20:42.000 but in principle, as many as there are, which can be up to 60 feet, or in the API there's no limit. 20:42.000 --> 20:44.000 In Max. 20:44.000 --> 20:49.000 Also, I added every type of options, because we want to export multiple view ideas, 20:49.000 --> 20:55.000 and multiple view positions, and previously we didn't have a rate-type option. 20:55.000 --> 21:01.000 So, what was done was we communicate by using comma-separated strings, 21:01.000 --> 21:05.000 and parsing strings in C is great fun, everybody loves it, 21:05.000 --> 21:08.000 but because I hate fun, I took that away from you. 21:08.000 --> 21:11.000 So, yeah. 21:11.000 --> 21:17.000 And this will also be used heavily in other places, like in any filter we do that, 21:17.000 --> 21:20.000 and we do it everywhere all the time. 21:20.000 --> 21:25.000 And the frames that are produced by the decoder have side data, 21:25.000 --> 21:30.000 which tells you which view it is. 21:30.000 --> 21:34.000 So, that's quite simple, and then the color can deal with it as it likes. 21:34.000 --> 21:38.000 I will skip that, because I'm actually going very slowly. 21:38.000 --> 21:43.000 That was a repeat of my last year's talk, which you can look up. 21:44.000 --> 21:49.000 There is native support for multiview, and the CLI, and the Tonscoder tool. 21:49.000 --> 21:56.000 It's not originally intended to just, well, have the codec output all the frames in to leave, 21:56.000 --> 22:02.000 and then let the user deal with it, which was completely painful, because the users don't know anything. 22:02.000 --> 22:08.000 And so, because I do, I extend it three specifiers, which everybody loves, 22:08.000 --> 22:13.000 into view specifiers. So, now, before you could say, well, I want to decode the fourth video stream. 22:13.000 --> 22:18.000 Now, you can say, I want the left view of the fourth video stream, and you can pipe that to an output stream, 22:18.000 --> 22:23.000 or to a complex filter graph, typically you want to put the frames side by side, 22:23.000 --> 22:28.000 or, or marks them into different streams, or different files, or whatever. 22:28.000 --> 22:35.000 So, that's up to you. That can be done with, with the new view specifiers. 22:35.000 --> 22:40.000 One feature, people might, might be interested in. 22:40.000 --> 22:46.000 So, now, a single decoder in the CLI can produce multiple streams. 22:46.000 --> 22:49.000 It is, it is technically possible. 22:49.000 --> 22:54.000 This could be generalized, for instance, to support closed captions, and other features like that. 22:54.000 --> 22:57.000 You could have a video stream, which has embedded closed captions. 22:57.000 --> 23:08.000 And before, we supported extracting them by using insane hacks like, you use the Aby filter device pseudo-demuxer, 23:08.000 --> 23:17.000 which uses a movie video source, which opens a file, and somehow decodes the video internally inside the filter graph, 23:17.000 --> 23:25.000 and extracts the video stream and the closed captions, and then gives it back to you as a demuxer, 23:25.000 --> 23:28.000 which sort of works, but just know. 23:28.000 --> 23:37.000 And now, the CLI can, in principle, it's not implemented, but it could, it could in principle be done straight forwardly. 23:37.000 --> 23:40.000 And other things like that. 23:40.000 --> 23:43.000 Okay, I think I'm done. Thank you. 23:43.000 --> 23:50.000 Yes, Kevin? 23:50.000 --> 23:55.000 The splitting out closed captions notion does, wouldn't that mean that a decoder? 23:55.000 --> 24:00.000 You know, right now we have video decoders, audio decoders, and some type of decoders. 24:00.000 --> 24:06.000 You don't have a decoder that can produce both video and some type of data. 24:06.000 --> 24:14.000 And not on Abycodec level, on Abycodec level, I imagine it would give you a frame with side data. 24:14.000 --> 24:16.000 Oh, sorry. I have to repeat the question. 24:16.000 --> 24:19.000 Could you repeat the question, sir? 24:19.000 --> 24:27.000 The question was, you know, if that back today, the coders are typically categorized as video audio. 24:27.000 --> 24:28.000 Right, right, right. 24:28.000 --> 24:34.000 So the question is, in Abycodec, the decoder is a video or audio or a subtitle decoder. 24:34.000 --> 24:37.000 How do we make it out with closed captions? 24:37.000 --> 24:41.000 And the answer is, we don't do that in Abycodec. 24:41.000 --> 24:44.000 We do that in the CLI. That's my point. 24:44.000 --> 24:47.000 Right, so a decoder remains a video decoder. 24:47.000 --> 24:50.000 It gives you a frame, and the frame has side data with closed captions. 24:50.000 --> 24:53.000 And then the CLI pretends that it's actually two streams. 24:53.000 --> 24:56.000 It's a video stream and a subtitle stream. 24:56.000 --> 25:00.000 The decoder is actually producing multiple up and stream. 25:00.000 --> 25:05.000 The decoder decoder isn't. The CLI decoder object is. 25:05.000 --> 25:08.000 Those are different things. 25:08.000 --> 25:10.000 Are the questions? 25:10.000 --> 25:11.000 Yep. 25:11.000 --> 25:19.000 I'm wondering if you plan to expand this multi-use report to DVD angles. 25:19.000 --> 25:22.000 What about that? 25:22.000 --> 25:29.000 So the question is, am I planning to extend this to DVD angles? 25:29.000 --> 25:33.000 I don't think it's... 25:33.000 --> 25:36.000 I am not sure how DVD angles actually work, 25:36.000 --> 25:40.000 so I am not sure that this would be applicable to them. 25:40.000 --> 25:45.000 We do have a lot of activity on DVD demuxing right now. 25:45.000 --> 25:50.000 Maybe you should ask the author of that code. 25:51.000 --> 25:53.000 Sorry. 25:53.000 --> 25:55.000 Have the questions? 25:55.000 --> 25:58.000 Yeah, Victoria. 25:58.000 --> 26:03.000 Can we do internalized in MVHVC? 26:03.000 --> 26:06.000 Sadly, the windows do not open. 26:07.000 --> 26:10.000 Yep. 26:18.000 --> 26:23.000 So the question was, how do we accelerate with MVHVC? 26:23.000 --> 26:24.000 I don't know. 26:24.000 --> 26:26.000 I didn't try. 26:27.000 --> 26:30.000 Yeah, there's an encoder set at least on this curve thing. 26:30.000 --> 26:35.000 So enabling with the MVH module to encode already into the content. 26:35.000 --> 26:40.000 For decoding, in principle, I think the low-level hardware 26:40.000 --> 26:42.000 doesn't really care about multi-view. 26:42.000 --> 26:44.000 It only gets the reference frames. 26:44.000 --> 26:49.000 So the difference in decoding in the actual pixel level 26:49.000 --> 26:54.000 or macro-block level decoding is that you have more frames in your reference 26:54.000 --> 26:58.000 pictures, lists, or set, or whatever it is. 26:58.000 --> 27:01.000 So if the high-level code just adds one more frame in that, 27:01.000 --> 27:03.000 the hardware doesn't care in theory. 27:03.000 --> 27:04.000 I didn't try. 27:04.000 --> 27:09.000 But I would hope that it can work. 27:12.000 --> 27:13.000 Yep. 27:13.000 --> 27:17.000 I was wondering what kind of requirements 27:17.000 --> 27:22.000 for milk to view? 27:22.000 --> 27:29.000 Does two inputs need to be the same size? 27:29.000 --> 27:33.000 The same aspect in the late, you know, 27:33.000 --> 27:36.000 are they always left the right or can they be up and down 27:36.000 --> 27:40.000 or in the corner? 27:40.000 --> 27:46.000 So the question is, if there are restrictions on dimensions 27:46.000 --> 27:52.000 and aspect ratio and positions, so the answer is kind of. 27:52.000 --> 27:57.000 So positions are just metadata, right? 27:57.000 --> 27:59.000 So it's just a field. 27:59.000 --> 28:01.000 It's actually optional. 28:01.000 --> 28:06.000 You don't need to have a screen that doesn't tell you what the position is. 28:06.000 --> 28:13.000 Actually, the spec doesn't mandate that it has to be somehow oriented. 28:13.000 --> 28:16.000 That it has to be really to eyes. 28:16.000 --> 28:17.000 It could be anything. 28:17.000 --> 28:22.000 The interpretation is really in some metadata, which may or may not be present. 28:22.000 --> 28:26.000 I think the allowed positions are left right and top bottom. 28:26.000 --> 28:30.000 I don't really remember if it could have more complex orientations. 28:30.000 --> 28:34.000 For formats and resolutions and aspect ratios, in principle, 28:34.000 --> 28:35.000 they don't have to match. 28:35.000 --> 28:42.000 And actually there is this spec allows you to have different size of the different views. 28:42.000 --> 28:52.000 But we don't support that, because that is not same. 28:52.000 --> 28:57.000 More questions? 28:57.000 --> 28:59.000 Thank you, then.