WEBVTT 00:00.000 --> 00:07.840 Hello. 00:07.840 --> 00:12.240 So our next speaker is Marco Castellucho. 00:12.240 --> 00:18.080 He's a self-described engineer, which is a hacker and an engineer. 00:18.080 --> 00:25.120 He is also an engineering manager at Modzilla, where he is leading the CLN quality tools, 00:25.120 --> 00:27.760 PDFs and translation teams. 00:27.760 --> 00:33.200 Today, he'll be talking to us about how Modzilla is using machine learning and large language 00:33.200 --> 00:41.360 models to develop an experimental tool to make code review experience more pleasant and easy 00:41.360 --> 00:42.800 for developers. 00:42.800 --> 00:47.680 He would also be talking about the architecture and the iterative development process of 00:47.680 --> 00:51.200 the tool and the future plans for this tool. 00:51.200 --> 00:55.680 Welcome, Marco. 00:55.680 --> 00:57.600 Thank you for the introduction. 00:57.600 --> 00:59.280 Thanks. 00:59.280 --> 01:05.440 So I will skip the introduction since I missed your name. 01:05.440 --> 01:08.720 The Nika did it for me. 01:08.720 --> 01:14.240 So the slides are not visible. 01:14.240 --> 01:31.120 Where is the... I forgot to, of course. 02:14.240 --> 02:26.560 Perfect. 02:26.560 --> 02:28.800 Sorry. 02:28.800 --> 02:36.960 So, yes, I'm Marco Castellucho, as Danica said, and as you can guess from the slides, I'm working 02:36.960 --> 02:39.120 at Modzilla. 02:39.120 --> 02:46.000 I'm an engineering manager managing the CI and Quality Tools team, translations team and PDF team. 02:46.000 --> 02:52.640 In particular, this talk is more related to CI and Quality Tools, where we use both classical 02:52.640 --> 02:59.520 tools like static analysis, LinkedIn, and also machine learning based tools. 02:59.520 --> 03:03.440 We've been using them for a few years since before they were cool. 03:03.840 --> 03:13.920 Now, we're continuing to using them even with the additional tools like LLMs. 03:13.920 --> 03:22.480 Before we dive in the topic, I want to first give you an idea of the Firefox scale. 03:22.480 --> 03:29.760 We have hundreds of bug reports, actually, they are not all defect. 03:29.760 --> 03:35.600 We use bug reports for everything like tasks or announcements, future requests, so two million 03:35.600 --> 03:44.320 bug reports are not actually two million defect, but still, we celebrate when we arrive at 03:44.320 --> 03:46.160 these kinds of numbers. 03:46.160 --> 03:53.600 We release every four weeks, with thousands of changes, we have 12 major releases in 2024. 03:54.480 --> 04:00.880 Firefox is one of the biggest and most complex software we recently celebrated it's 04:00.880 --> 04:08.640 20th birthday in November, and the NetScape was open, source 27 years ago, but was 04:09.200 --> 04:13.680 started even before then, so there's a lot of legacy and tech that. 04:14.240 --> 04:24.960 We have 950,000 commits made by 10,000 contributors with 32 million lines of code, 04:26.000 --> 04:35.680 46,000 commits only last year with 1,100 unique contributors every year, this number keeps increasing, 04:35.680 --> 04:36.320 which is nice. 04:36.960 --> 04:48.240 We use lots of languages starting from C++, JavaScript, and Rust is taking more and more space over time. 04:51.200 --> 04:58.640 So the idea we had was not to automate reviews, it was not even to make reviews faster. 04:59.280 --> 05:06.080 It was to try to identify more issues and improve code quality through code review, 05:06.080 --> 05:11.120 by supporting developers. We don't want to automate the code review process. 05:11.120 --> 05:14.320 We want to support developers in this process. 05:16.480 --> 05:24.080 This was a collaboration between us, Ubisoft, and two universities, Queen School of Computing, 05:24.080 --> 05:28.720 and University of Montreal, we all worked together in open source project. 05:30.080 --> 05:37.280 We didn't have access to the Ubisoft game source code, but they could work together with us 05:37.280 --> 05:41.600 in the open source project, and we're working on a paper to show the results. 05:43.600 --> 05:49.040 So let's dive in. First of all, I tripped down prompt engineering lane. 05:49.600 --> 05:55.520 So we started with a simple prompt. We just asked LLM, 05:55.520 --> 06:00.480 please find issues in the patch, and we showed the patch to the LLM. 06:01.520 --> 06:10.640 Of course, this generated lots of weird comments, comments that usually developers don't really 06:10.640 --> 06:20.320 want to see. So we added the human examples. Initially, there were comments that we identified as good 06:20.320 --> 06:27.200 by default. Then later on, we tried to use code similarity to find comments that were made 06:27.200 --> 06:36.400 previously by humans on code that is similar to the patch. Then we had to fight against the 06:36.480 --> 06:43.360 urge of the LLM to be nice. As you might have noticed, they always want you to be happy. 06:44.640 --> 06:52.320 So we saw comments like, this is a good addition to the code. Of course, reviewers don't really care 06:52.320 --> 07:01.120 about this kind of comments. So we had to explicitly say, please don't add a praising comments. 07:02.080 --> 07:09.840 Then we had to ask it to focus. It was commenting on everything. So sometimes it was generating 07:09.840 --> 07:15.120 comments that were related to things that the patch was already doing. So it was suggesting to 07:15.120 --> 07:22.560 do things that the patch was doing. So we had to ask it to focus on addition only, on the other lines. 07:24.560 --> 07:30.720 Then we had to add more and more and more and more filters. For example, there were many comments 07:31.040 --> 07:38.320 focusing on making sure that something exists. But of course, if a developer is writing a patch, 07:39.440 --> 07:46.240 they are using functions that exist. They will not use things that do not exist in the source code. 07:46.240 --> 07:53.520 So these kind of comments are kind of useless. This already gave us some initial interesting results, 07:54.160 --> 08:03.040 but we needed to do a bit more. We had to try to elicit reasoning from the LLM. So we initially 08:03.040 --> 08:09.520 before asking to review the patch, we asked the LLM to summarize the patch. So they could not 08:09.520 --> 08:18.720 understand what the changes were and then review based on the summary. Then we noticed it was still 08:18.720 --> 08:26.480 confused about usage of variables. For example, if it couldn't see where the variable was defined, 08:26.480 --> 08:35.440 in case of patches that for example, they are changing function. The variable definition is way earlier 08:35.440 --> 08:42.480 and you cannot see it unless you go back to the initial part of the function. So basically, 08:42.480 --> 08:49.120 we gave a tool to the LLM to be able to see more context around the changes if it wanted to see more 08:49.120 --> 08:57.520 context. Then we also gave it a tool to search code using our search fox tool, which is a code 08:57.520 --> 09:04.240 search tool for a Firefox source code. So we gave it the ability if it wanted to see 09:04.240 --> 09:11.920 function functions that were being used in the patch. We gave it the ability to search for those 09:11.920 --> 09:19.040 functions and see their implementation in the source code. This was particularly useful for 09:19.040 --> 09:25.120 Ubisoft, for example, because the LLM are already trained on Firefox source code, while for Ubisoft 09:25.840 --> 09:31.360 their code is called source. So they really needed the ways for the LLM to see what 09:32.320 --> 09:41.920 the prints of Perseya functions were doing. We also added the bad examples to 09:43.760 --> 09:52.240 other bad examples to teach the LLM not to add them. For example, still praising comments like 09:52.240 --> 09:58.960 overall the patch seems to be well implemented with no major concerns. And again, here, 09:58.960 --> 10:08.720 just like the human comments, initially we used predefined examples that we identified by ourselves, 10:09.360 --> 10:14.560 and then later on we started using data from from the field. I will talk about that later. 10:16.480 --> 10:23.520 So it seems linear. It actually wasn't this linear. It went more like, okay, go back, 10:24.400 --> 10:31.920 go forward, go back, refine, refine, refine until we got something that we were more or less happy with. 10:32.720 --> 10:43.360 And we were satisfied enough to offer it to a set of developers. So what it looks like? 10:45.120 --> 10:50.960 This is an example before we deployed the tool. So there was already a review comment. 10:50.960 --> 10:57.200 The LM didn't have access to it. And it generated the comment which was saying more or less the 10:57.200 --> 11:04.400 same thing in other words. So it was very promising. And you can see the interface here. The developers 11:04.400 --> 11:10.560 had the ability to approve comments. Mark them has invalid or Mark them has trivial the UI 11:10.560 --> 11:19.280 changed a little later. But I will show you an example later on. This is another example from when 11:19.360 --> 11:22.880 the tool was actually deployed. It was actually maybe from two days ago. 11:24.320 --> 11:32.480 It was nice so I added it. The LM identified that potentially potential overflow in this code 11:33.200 --> 11:37.360 and so the developer fixed it by adding a check for the overflow. 11:39.760 --> 11:46.160 This is another interesting example. In this case the LM identified a possible division by zero 11:47.120 --> 11:55.520 and it was actually true. But it wasn't clear how to fix it because there were tests that were 11:55.520 --> 12:01.760 actually using images with size zero. So the developer didn't exactly know how to fix it. 12:01.760 --> 12:07.200 The patch landed as it is and the developer will investigate more to figure out what to do exactly. 12:10.160 --> 12:14.960 And this is an example with JavaScript. You can see actually the first part of the patch 12:15.600 --> 12:22.240 was in another hand of the patch. It's line 15 and 16 and the second part of the patch is 12:22.240 --> 12:29.360 we later. And the LM was able to identify and problem with the accessing of this variable. 12:31.760 --> 12:36.400 It was a different language before it was a simpler class. Now it's a JavaScript 12:37.040 --> 12:42.080 of course it's able to identify issues in both languages and it cannot identify issues 12:43.040 --> 12:48.800 in different parts of the patch. Not only it has the full context basically. 12:50.480 --> 12:58.160 And this is UI when users reject comments. So when they approve it, it looks like this. 12:59.360 --> 13:05.840 The comment was generated the contents of the comment and then we say this comment was generated 13:05.840 --> 13:12.640 automatically and was approved by this person. Or if they added the comment as I will show later 13:12.640 --> 13:21.120 it will say it was generated automatically and was edited by. We want to make sure that developers 13:21.120 --> 13:24.800 that receive these kinds of comments know that they were generated automatically. 13:27.040 --> 13:35.280 This is the interface. This is actually from Ubisoft side. What happens when you try to reject comments 13:35.840 --> 13:44.080 so users could select the reason for rejection. Most of the cases it was a trivial comment 13:44.080 --> 13:50.160 when they rejected. But there were many cases also of useful in the development phase, 13:51.040 --> 13:55.840 not in the review phase or useful as a tip for reviewers, not a review comment. 13:56.240 --> 14:07.120 So do people like it? How many comments are accepted? 8% of comments are accepted 14:07.120 --> 14:13.600 from Modzilla, 7% of Ubisoft. It's not that huge number but it's also very promising because 14:13.600 --> 14:21.920 it means it is finding valuable, highly valuable cases. There is still a lot of noise that we need 14:21.920 --> 14:28.000 to address but still it's very promising as you saw in the examples before. 14:30.000 --> 14:36.960 Some developers said it detects perfectly possible flows in the code but most of the suggestions 14:36.960 --> 14:40.880 are more in the development realm than the review realm and that's why we added those two 14:41.760 --> 14:46.560 cases. Useful as a tip for reviewers, not as a review comment. Useful development phase, not in the 14:46.640 --> 14:56.480 review phase. There are not considered here between the 8% or the 7% because they were not accepted 14:56.480 --> 15:04.720 were rejected but they were still useful in some way. So we had another percentage of appreciated 15:04.720 --> 15:10.640 which contains both the accepted ones which were published and the ones that were considered 15:10.640 --> 15:18.080 as useful even though not useful as comments in the review. It was more or less the fourth of the 15:18.080 --> 15:26.880 comments, almost the third of the comments that Ubisoft. It also depended on the seniority of the 15:26.880 --> 15:35.520 developers. Some more general developers appreciated this review tips more than more senior developers 15:35.600 --> 15:45.120 for example. The other thing we were wondering is if the reviews would be slower or faster with 15:45.120 --> 15:51.600 this kind of tool we were not as I said at the beginning we were not aiming to make reviews faster 15:51.600 --> 15:58.720 but we also didn't want to make them slower and developers said things like often looking at 15:58.720 --> 16:04.000 the suggestion helped me think more about the code and reflect a bit more on what is done and why 16:04.080 --> 16:10.320 it is done. Overall I think it is a bit slower to do reviews but my reviews are actually better 16:10.320 --> 16:17.040 now that I used the tool. We didn't track the review time at Modzilla but Ubisoft 16:17.040 --> 16:24.720 tracked the review time and we saw that the additional time for this kind of comments was not that 16:24.720 --> 16:33.440 long. So for accepted comments it was around 28 seconds for rejected it was 16 seconds. Probably 16:33.520 --> 16:39.920 it was obvious that they were bad comments and in general it took 42 seconds more per review. 16:39.920 --> 16:48.000 So it didn't add too much review time. Of course there are some cases where it took way longer 16:48.880 --> 16:56.560 so towards 150 seconds and this is probably because sometimes the LLM could hallucinate 16:56.560 --> 17:02.720 problems that do not actually exist but then if you're not familiar enough with the code you 17:02.720 --> 17:08.000 need to actually go and try to find out if it is actually the case or not. It actually happened 17:08.000 --> 17:14.640 to me once. I was not familiar with the API that the developer was using so I had to search for 17:14.640 --> 17:22.560 documentation, try to figure out if it was real or not etc etc and I think it turned out to be real 17:22.560 --> 17:32.080 but still it was a long rabbit hole and our comments edited we were we were wondering if they were 17:32.160 --> 17:40.240 edited or not and what happens when they are edited we found that most comments are just accepted 17:40.240 --> 17:47.760 as they are or they are shortened and this is because LLM is often two verbose so in the most 17:47.760 --> 17:56.960 reason prompt we are asking it to be super concise. There were very few examples of length and 17:57.280 --> 18:02.080 comments at the time we wrote the paper it was only one. Now there are a few more. 18:03.520 --> 18:10.800 You can see in the first case it was just because the developer wanted to be nice to the developer 18:10.800 --> 18:19.840 and it added that suggestion on how to fix the comment. In the second case instead the reviewer 18:19.920 --> 18:28.720 was not sure about the comment but it thought it could be valuable so they said this was suggested 18:28.720 --> 18:34.000 by the review helper Adon however maybe this should handle errors but please make sure. 18:37.680 --> 18:42.240 I guess this is a case where you could consider it might have been a waste of time for 18:42.240 --> 18:47.280 if it turned out not to be true we don't know I need to look into it. 18:49.120 --> 18:54.720 The other thing we wanted to see is are accepted comments valuable enough 18:55.600 --> 19:02.720 as human comments so we set out to see if they led to the same number of changes and it turned out 19:02.720 --> 19:11.440 that yes more or less 74% of LLM comments lead to changes and 73% of human comments lead to 19:11.440 --> 19:15.360 changes so they lead to changes more or less in the same way. 19:18.240 --> 19:26.960 Next steps so we saw a few promising results at the same time we saw there is a noise in the results 19:26.960 --> 19:34.880 so we want to improve them. So the first step improved bug detection capabilities 19:35.840 --> 19:43.040 so right now we're using as I said in the prompt engineering phase we're using past human 19:43.040 --> 19:50.400 comments plus the knowledge of DLLM from pre-training and we're thinking of building a data set 19:50.400 --> 20:00.160 of synthetic comments so basically on Bugzilla a few years ago we were smart enough to think 20:00.160 --> 20:13.920 of adding a new field called regressed by where developers can say which patch caused a bug so 20:13.920 --> 20:20.880 this has been used for metadata purposes to know how many regressions there are instead of 20:20.880 --> 20:26.560 over time but it can also be used in this case for example for building a data set of 20:26.560 --> 20:35.520 fake comments to teach DLM about very highly valuable cases because basically we have the original 20:35.520 --> 20:40.880 bug we have the patch associated to the original bug we know this patch causes another bug which 20:40.880 --> 20:47.760 is the regression and we have the patch that fixes the regression so by giving to the LLM the original 20:47.760 --> 20:54.960 patch where the bug was not caught by the reviewer and the fixed patch which fixes the regression 20:55.040 --> 21:01.920 caused by the original patch we can generate fake comments that would have prevented these 21:01.920 --> 21:07.920 regressions of course it's not always possible because sometimes that my regression might be due 21:07.920 --> 21:15.200 to external reasons it was it would have been possible to find it by a reviewer but there are 21:15.200 --> 21:21.200 many cases where it is possible and these are extremely highly valuable because they were missed 21:21.280 --> 21:30.320 during the review by humans and this is to improve the acceptance rate but we also want to 21:30.320 --> 21:36.880 reduce the rejection rate due to noise because the LLM is still even though we added a lot of 21:36.880 --> 21:44.880 filtering etc to remove praise comments and stuff like that it's still making some of them 21:44.880 --> 21:52.160 so we want to reduce trivial and unwanted comments first of all we want to allow customization 21:52.160 --> 21:58.960 by type as I said there are some developers who like to see the review tips some other developers 21:58.960 --> 22:05.920 don't want to see review tips so we want to categorize them and give the option to the reviewers to 22:05.920 --> 22:14.720 see some of them and other reviewers to see only a subset and we want to introduce more filtering 22:14.720 --> 22:21.680 using past trajectory example so with the implementation of the tool as I showed we collect 22:22.640 --> 22:28.800 feedback from developers and we have a lot of accepted and rejected comments we can use 22:28.800 --> 22:36.160 the rejected comments in just them into a vector database so whenever we generate new comments 22:36.160 --> 22:44.080 with LLM we can find for similarly looking comments in the rejected comments vector database 22:44.080 --> 22:48.960 by similarity and we can add them into the prompt and ask the LLM not to generate 22:50.400 --> 22:55.920 things that look like that we're still working on it so I don't know exactly how 22:55.920 --> 23:11.840 the solution will look like I don't know exactly how the solution okay how the solution 23:25.920 --> 23:38.000 yes I think it's probably this maybe let's try to remove it 23:38.000 --> 23:50.080 maybe let's try to remove it 24:08.960 --> 24:25.280 oh it's working sorry about that so yeah we don't know exactly how the solution will look like 24:25.280 --> 24:32.800 but we have a big new not big but big enough data set of rejected comments from developers 24:32.800 --> 24:39.600 which we can use for clustering filtering and so on and so forth and we hope this will greatly 24:39.600 --> 24:47.920 reduce the number of noise comments the other thing we want to do is auto fixing comments a bit 24:47.920 --> 24:54.400 like the developer did to help sorry the regular did to help the reviewer this is an example 24:55.040 --> 25:01.600 here on the left there is a comment via reviewer of course we only want to do these in simple 25:02.560 --> 25:07.680 because probably the LLM won't be able to address very complex cases at least for now 25:08.400 --> 25:12.640 but for very simple cases where developers can just accept without having to 25:13.920 --> 25:19.520 edit the code locally and push the patch again ask for review again etc etc we could generate 25:19.520 --> 25:25.360 suggestions directly in the review system and they could just accept them and the land the patch 25:26.000 --> 25:31.200 so there's an example here where the reviewer said variables is actually required 25:31.920 --> 25:38.400 pointed to some documentation and but can be an empty dicked and so the LLM just added variables 25:39.280 --> 25:40.800 attribute with an empty dicked 25:43.440 --> 25:50.320 we also want to there are also many developers who said this is very useful 25:51.120 --> 25:56.720 but I would like to have it while I'm working on the patch and not after I submit the patch for 25:56.720 --> 26:03.440 review so we are thinking of building an ID integration so that they can use it before 26:04.160 --> 26:10.480 so this kind of issues especially the tips for example use useful in development phase 26:11.440 --> 26:16.640 they can actually use them and avoid wasting the time of the reviewers in this case we could actually 26:16.640 --> 26:23.520 speed up the review we wouldn't see it because the initial review by the LLM would happen before 26:23.520 --> 26:34.160 submission but it would still be a win overall all right this is more or less it when it comes to 26:34.160 --> 26:41.680 using LLMs for code review but I also wanted to show a bit what we're doing in addition to code 26:41.680 --> 26:50.560 review to still support developers with various tasks first of all QA test case generation so LLMs 26:50.560 --> 26:57.760 can often have hallucinations so you could say that they're very creative so we thought 26:58.480 --> 27:03.600 maybe we could use them for test case generation where you need a lot of creativity to break 27:04.560 --> 27:14.400 the software so we are getting to the LLM the role of expert quality assurance engineer 27:15.440 --> 27:21.120 we're giving examples of test cases we specified the feature that the LLM should test 27:22.480 --> 27:27.680 with the details of the feature the scope of the testing and so on and so forth and then we 27:27.680 --> 27:35.760 ask it to generate test cases and we tested it on one two three four five six seven eight five 27:35.760 --> 27:50.320 fox features which were both UI features and more backend features it went very well in general 27:51.040 --> 28:01.600 it generated 54 cases so 27% of the test cases were novel so the no QA QA thought of them 28:02.240 --> 28:08.400 one under the one already existed so 50% so which means they were good but already existed 28:09.600 --> 28:18.000 29 were invalid and 16 other reasons for rejection but in general it was very valuable for the QA teams 28:18.720 --> 28:28.320 you can see in in some cases where the feature is more obvious and there are less chances for 28:28.320 --> 28:34.160 corner cases like reader view comfort for example there was only one novel case but in other 28:34.720 --> 28:42.720 features which could lead to more corner cases like image to gen AI in PDF there were way more 28:43.440 --> 28:52.080 novel cases found in in case of the back button intervention which was mostly back and stuff 28:52.080 --> 29:02.320 there was no novel test case found another aspect is smart documentation search so we're building 29:02.320 --> 29:10.800 a ragged architecture to ingest the Firefox source code the back from baczilla the history of 29:10.800 --> 29:20.480 development basically and then building a chatbot for developers this was mainly thought for 29:20.480 --> 29:27.840 new contributors who might come in or new employees etc and we started testing it with the PDF 29:28.720 --> 29:38.160 by taking questions and answers from metrics from the PDFGS metrics channel and comparing the 29:39.280 --> 29:48.400 human answers with the LLM answers so we have an example here which was good where the 29:50.000 --> 29:57.600 person was asking how to download things in parallel etc the human answer it was not our answer 29:58.080 --> 30:03.120 I guess it was somebody else from the channel just said download download it at user 30:03.120 --> 30:13.600 another viewer well the LLM answer was this is already handled automatically by the browser 30:13.600 --> 30:21.040 network stack and we had the developer from the PDF team who helped evaluate the cases and 30:21.040 --> 30:28.800 he preferred the second the LLM answer to the human answer in the second case instead it was a 30:28.800 --> 30:37.760 wrong so it was a question related to offsets of the page etc the human answer was very short but to the 30:37.760 --> 30:47.040 point the LLM answer was wrong because it got the offsets wrong but then we actually found that the 30:47.120 --> 30:53.600 problem was in the documentation because the documentation was wrong there are many other cases 30:53.600 --> 30:58.080 were actually the LLM is wrong but in this specific case the LL the documentation was wrong 31:00.720 --> 31:07.040 another interesting thing for I think for us but also for other open source 31:08.880 --> 31:14.000 project we have lots of commits and it's very hard to identify what is important 31:14.960 --> 31:21.520 for end users or what is important for other developers and so on and so forth so we're experimenting 31:21.520 --> 31:30.400 with using an LLM to ingest the committee story from the software and identifying what is important 31:30.400 --> 31:37.040 for various reasons so in this case we're starting with the user release notes so we're focusing 31:37.040 --> 31:43.920 we're asking LLM to focus on user phasing changes but in the future we're also planning to do the same 31:43.920 --> 31:49.440 for web developers so we could identify commits which are changing APIs in a way that 31:50.320 --> 31:57.040 and developers care about and same for identifying new valuable contributions from new 31:57.040 --> 32:07.040 contributors for example another aspect which is very very important for Modzilla it's web 32:07.040 --> 32:15.440 compat tissue detection so we have many issues due to differences between browsers in some cases 32:16.160 --> 32:23.680 due to websites which only test in one browser compared to the other and so there are many cases 32:23.680 --> 32:32.160 where the websites might be working well in a Chrome for example Mod not in Firefox this is an 32:32.160 --> 32:39.760 example here where you can see the graphs on the Chrome side and in Firefox the content is almost 32:39.760 --> 32:50.240 empty basically and we're using a multi model model to identify potential issues we're still 32:50.240 --> 32:55.840 starting this project now but it's already showing some promising results of course there are 32:55.840 --> 33:00.240 lots of difficulties in this province not just a matter of comparing screenshots because you might 33:00.240 --> 33:07.760 have differences just by for example different advertisement when you reload the page or if you have 33:07.760 --> 33:13.440 a carousel for example if you take a screenshot now it's not going to be the same in two seconds because 33:13.440 --> 33:20.480 the carousel will move so you need to identify things that are actually important versus things that 33:20.480 --> 33:30.080 are just part of the website normally we also want to expand autofixing what 33:30.080 --> 33:36.720 I talked about before regarding comments we want to expand it to additional cases for example when 33:36.720 --> 33:47.920 we get crash crashes from users we can identify potential causes for the crash and suggest 33:47.920 --> 33:56.800 potential fix the other thing we would like to try is analyze build failure logs for a given patch 33:57.680 --> 34:05.680 to try to determine and resolve the build failure and both of these in theory could have a feedback 34:05.680 --> 34:11.760 cycle so that we can for example in the build case we can build get the error log 34:12.960 --> 34:18.640 try to fix it through the LLM if it still fails give it the new error log and so on and so forth 34:18.640 --> 34:30.640 just try a few times until the issues fixed and this gets us to the end if you're interested in 34:30.640 --> 34:37.040 experimenting with with our stuff any of the things that I mentioned and see where it could go on 34:37.040 --> 34:43.200 Firefox or if you want to extend support for this kind of tools to other review systems right now 34:43.200 --> 34:48.320 we only support our review system plus the Ubisoft one which is maybe only used by users of 34:48.960 --> 34:55.760 and our satisfoing this maybe only used by us or other projects if you want to adopt 34:55.760 --> 35:03.760 our tools in your project come talk to us in either later or in our bedroom tomorrow 35:04.000 --> 35:08.640 modila devram thank you 35:24.640 --> 35:31.840 thanks great talk maybe I missed it but is Mozilla hosting the LLM or are using an external service 35:32.800 --> 35:38.720 we're using an external service we're actually running multiple LLMs at the same time 35:41.680 --> 35:48.480 to compare results across models both size in size and type 35:49.440 --> 35:57.040 if you thought about boosting it yourself yes we haven't done it yet but the plan would be actually to 35:57.120 --> 36:03.920 use an open source model fine tune it with the data that we are collecting so that we can 36:03.920 --> 36:09.360 host it ourselves and it can be an open source model 36:13.760 --> 36:20.640 what are the costs for running those services I haven't looked at the costs recently they were not 36:21.280 --> 36:28.800 super high but if you want I can share the numbers with you later okay and especially the 36:28.800 --> 36:35.840 environment costs please the environmental costs yeah 36:39.280 --> 36:45.680 thanks for this great talk so I was interested I suppose you selected Ubisoft as a partner 36:45.680 --> 36:49.760 because you have common points with them so for instance they code in C++ with 36:49.760 --> 36:56.080 without divulging in these secrets of course I would like to ask you how the general public 36:56.080 --> 37:00.480 could make use of your research do they have to be programming what they're specifically 37:00.480 --> 37:07.120 or is there something that can be used generally sorry I missed the last part of the question 37:07.120 --> 37:13.520 the question is how reusable is your work for people who work with C++ code bases 37:15.680 --> 37:22.880 I think it should be pretty reusable the collaboration we Ubisoft was because we already were collaborating 37:22.880 --> 37:28.240 with them we have a lot of touchpoints like you mentioned because they also have a big large 37:28.240 --> 37:37.120 code bases in C++ and legacy code bases and the collaboration was both because we had similar 37:37.120 --> 37:43.360 problems but also because we wanted to try to make the two generic enough so that it could be applied 37:43.360 --> 37:57.040 also by other projects hello when generating the prompt of you know the steps of 37:57.040 --> 38:03.920 generating the context and creating all the prompt is there any limitation you you faced like 38:03.920 --> 38:12.960 the size of the poor requests for example like at what when you have like multiple issues right 38:13.040 --> 38:19.360 correlated with that patch how would you decide which context you the injector not and what was 38:19.360 --> 38:27.520 the limitation there yeah when we started we had way more limitations because of the content 38:27.520 --> 38:36.720 of the context size so we kept over time increasing and increasing the limitation on the so we 38:36.800 --> 38:43.760 have a limitation on the patch size if it is too large we don't review it and this limitation was 38:44.800 --> 38:50.240 pretty small at the beginning and then we kept increasing it and increasing it over time 38:50.880 --> 38:59.120 because of the larger context sizes that new models support it's certainly important especially 38:59.120 --> 39:04.640 as we build more and more tools right now we have only two tools which is the asking for more 39:04.640 --> 39:10.400 context around the changes lines and searching for code in the future if we add more and 39:10.400 --> 39:18.880 more tools for example searching for documentation running test or whatever we will have to choose 39:18.880 --> 39:26.480 what is most valuable because we will have limitations on the on the context anyway at some point 40:04.720 --> 40:13.840 out I was surprised by the little or the low acceptance rate of those comments my assumption 40:13.840 --> 40:21.440 would have been it would have been a lot higher and I was wondering if the AI is kind of forced to 40:21.440 --> 40:27.120 comment on every PR or is it free to just say I got nothing on this one and I'm you know I don't 40:27.120 --> 40:34.720 have to add a comment just curious about that can you repeat sorry does the AI have to comment 40:34.720 --> 40:41.120 on every PR or can it say I have nothing to add to this PR and just that maybe you know contribute to 40:41.120 --> 40:48.240 the lowest so in general even in cases where there are some issues it will still generate some 40:48.240 --> 40:54.240 trivial comments so that's why we're working on reducing that but yes what what you say is through 40:54.240 --> 41:04.160 the LLM has a urge to say something all the time so in some cases it will generate when it 41:04.160 --> 41:08.880 doesn't find it when it finds something it still generates something that is not so interesting 41:08.880 --> 41:15.920 but many of them are interesting but when it doesn't find anything it still really wants to say 41:15.920 --> 41:21.840 something and so it will fall back to its behavior even if we say please don't say this please 41:21.840 --> 41:28.320 don't say that it will still fall back to its behavior of saying something and yes so there are still 41:28.320 --> 41:42.720 many of these cases and we're working on reducing them I have a question as well I think I have 41:42.880 --> 41:53.680 to maybe like do you dog feed the AI tool to also review the code that is being used to generate 41:53.680 --> 42:03.680 the two? No not currently right now we are only using it on Firefox source code not on any other 42:03.680 --> 42:11.040 modzilla project so we're using we have a small group of developers who are participating in this 42:11.040 --> 42:18.000 experiment and they're all Firefox developers working on Firefox source code currently we only 42:18.000 --> 42:23.840 show the results to the reviewer so that because we want to collect as much feedback as possible if we 42:23.840 --> 42:30.080 showed it to the developer the developer could just fix the issue without clicking on the button 42:30.080 --> 42:34.480 accept or clicking on the button reject it it wouldn't make sense for the developer to accept the 42:34.480 --> 42:40.560 comment publish the comment and then fix it so that's why in the experimentation phase we're 42:40.560 --> 42:54.160 just showing it to reviewers and only Firefox reviewers. Hello yes another question so I may have 42:54.160 --> 43:01.360 misunderstood but like right now the tool only goes over the added lines the lines that were added 43:01.440 --> 43:08.000 in the patch what about the lines that were removed why are they not being reviewed by the tool? 43:08.000 --> 43:17.200 The issue was that before we asked it to focus on added lines and it was focusing on changes that 43:17.200 --> 43:25.440 were already made by the patch so it was suggesting to do what the patch was already doing so it was 43:25.760 --> 43:32.000 kind of a hack to avoid the LLM focusing on things that the patch was already doing which 43:32.000 --> 43:37.280 wouldn't make sense to comment about. I think it's a limitation that we should resolve 43:38.960 --> 43:46.880 while building the prompt engineering process it sounds easy but it's extremely complex like I said 43:46.880 --> 43:53.600 we go back and forth back and forth back and forth we build some tools to validate changes so whenever 43:53.760 --> 44:03.360 we make a change we have a tool that compares the we run the tool on many past reviews and we 44:03.360 --> 44:08.400 compare the comments that are generated by the new version of the prompt with the old version 44:08.400 --> 44:16.400 of the prompt automatically and then we manually label the ones that are new completely new 44:16.480 --> 44:20.800 compared to before so we can see okay is the new version of the prompt 44:22.640 --> 44:29.440 generating the same comments as before is it generating more accept more comments that are accepted 44:29.440 --> 44:34.240 that were previously accepted is it generating fewer comments that were previously rejected so 44:34.240 --> 44:42.400 now we have all of these tools around the tool itself and we can more easily iterate on the prompt 44:42.480 --> 44:48.320 so this is one of the things that I want to try next try to remove this or refresh it to ask 44:48.320 --> 44:56.480 it to focus on new changes not only on added lines also on the removed lines but we need to find 44:56.480 --> 45:03.040 a good way so that it doesn't fall back to saying to adding comments on things that are already 45:03.120 --> 45:12.880 done by the pat thanks any more questions you have described some comments that were 45:14.080 --> 45:23.200 good but not good review comments but good tips for development how can we have a comment given 45:23.200 --> 45:29.680 during a review which is good for development but not be a good review comment is a little bit 45:29.680 --> 45:39.440 serious to me so basically there were things like I could find some examples later and share them 45:39.440 --> 45:46.960 with you but one example that comes to mind is things like ensure that this code path is not 45:46.960 --> 45:54.080 leading to a crash for example this is something that the reviewer should do it's not for the developer 45:54.160 --> 46:02.160 to do so it's not a comment it's something like ensure this is a good like please double check this 46:02.160 --> 46:07.680 but it's a suggestion to the reviewer actually I mean it could be considered as a suggestion for 46:07.680 --> 46:11.840 both but usually when you are reviewing a patch you don't you don't add a comment to say please 46:11.840 --> 46:17.120 make sure this is correct you will make sure that this is correct and then if it is not correct 46:18.000 --> 46:24.560 sure if it is correct you will say something but you will never comment saying please make sure 46:24.560 --> 46:26.560 that this is correct 46:26.560 --> 46:56.000 okay I wonder does your system support iterative development and review is the element comment 46:56.000 --> 47:05.920 only done once per review and will it have context on previous comments it did 47:07.200 --> 47:13.120 at the moment it's one review and that's it we are thinking of 47:13.120 --> 47:23.520 re-architecting re-architecting it to be a bit to take into account the previous review 47:23.520 --> 47:31.600 as well but also to maybe make it conversational in case of those useful review tips for 47:31.600 --> 47:38.480 example if the element says please make sure that this is correct you could ask why are you saying 47:38.960 --> 47:46.720 that's and you could ask why are you saying that or can you look this function can you look 47:46.720 --> 47:54.160 up this function for me you know you could interact with the comment itself to dig deeper 47:55.760 --> 48:01.200 the other thing that comes to mind when I see this is we're also planning to 48:01.200 --> 48:06.240 right now we're generating the comment and that's it like you saw before we're planning to also 48:06.240 --> 48:14.160 generate reasoning behind the comment so this is this has to there's two reasons why we're 48:14.160 --> 48:19.600 we're doing that one is so that the element maybe will realize by itself that comment is 48:19.600 --> 48:24.800 wrong by generating the reasoning behind it and the other reason is if we generate both the 48:24.800 --> 48:30.720 comment and reasoning the developer does not have to spend as much time validating it because 48:30.720 --> 48:35.840 it will see the reasoning and it will be easier to see if there is something wrong in the 48:35.840 --> 48:52.400 in the reasoning there is an inflow okay any last questions there is one there 49:00.960 --> 49:13.680 yeah awesome talk at a question about tokens like how many tokens generally are being 49:13.680 --> 49:19.840 computed during like each review and is it a multi-step review to get the final result or is it just 49:19.840 --> 49:27.520 like a single like conversation that happens once it's so I don't know about the tokens any to 49:27.600 --> 49:35.760 looking to that it's a multi-step process so it's not just a single prompt and the result it's 49:36.720 --> 49:44.240 summarized the changes okay now please find issues okay now please filter these issues based on 49:44.240 --> 49:53.440 these other results so there are multiple steps to get to the answer there is the link to the project 49:54.000 --> 50:02.480 in the presentation so you can anybody can have a look at how the prompt prompting system prompting 50:02.480 --> 50:19.280 process works thanks just checking to see if anyone has any more questions okay looks like we 50:19.280 --> 50:24.320 are done thank you so much Marka that was really nice presentation