WEBVTT 00:00.000 --> 00:07.000 Thank you very much for the good interest. 00:07.000 --> 00:12.000 Again, I'm Dio, I work at TikTok in the US and I work on this project. 00:12.000 --> 00:17.000 So today I'm going to, this project is more closer to the application 00:17.000 --> 00:21.000 site where like other talks are focused on like how you enable T and 00:21.000 --> 00:23.000 like VMs and all that. 00:23.000 --> 00:26.000 So it's more like a higher level application. 00:26.000 --> 00:33.000 How we're going to use computer computing in, you know, real work. 00:33.000 --> 00:36.000 So let me get started. 00:36.000 --> 00:40.000 So this is the whole content so I'm going to start with why we need the 00:40.000 --> 00:42.000 private data analytics and why it's hard. 00:42.000 --> 00:47.000 And I'll talk about our project and also include some demo. 00:47.000 --> 00:50.000 So private data analytics. 00:50.000 --> 00:52.000 So why private data is so important. 00:52.000 --> 00:55.000 So everybody will agree that it was very good for value extraction. 00:55.000 --> 00:59.000 Like personalized contents or, you know, 00:59.000 --> 01:03.000 training the recommendation model for better recommendation and so on. 01:03.000 --> 01:06.000 But not many people talk about public interest. 01:06.000 --> 01:12.000 So I'm going to focus on the public interest case. 01:12.000 --> 01:20.000 So sharing private data, if you see, it's actually really, really important for public interest, 01:20.000 --> 01:22.000 especially public health. 01:22.000 --> 01:28.000 Some researchers want to use medical data, personal medical data is very, 01:28.000 --> 01:34.000 very strictly protected or personal health data collected by personal medical devices. 01:34.000 --> 01:41.000 Or for public safety, even very, very strict private data like PII, like personal address or 01:41.000 --> 01:47.000 for number, they can be used to identify some public safety issues. 01:47.000 --> 01:51.000 Like if they're associated with crimes or illegal activities. 01:51.000 --> 01:56.000 Education, personal academic performance, like their scores or attendance, 01:56.000 --> 02:01.000 and their engagement information, or the combined 02:01.000 --> 02:06.000 in combination with some public, sorry, private data like address, 02:06.000 --> 02:11.000 can be used to find out some correlation between academic performance, 02:12.000 --> 02:17.000 and people like where they leave or their background and so on. 02:17.000 --> 02:21.000 We're civic engagement, personal beliefs or social activities, 02:21.000 --> 02:27.000 how this kind of personal beliefs affects the social activities of personal and so on. 02:27.000 --> 02:31.000 These are not the only examples. 02:31.000 --> 02:38.000 So to give a very concrete example, one of the research that was published in CCS last year 02:38.000 --> 02:43.000 is to understand illicit drug promotion by using cross-platform data. 02:43.000 --> 02:47.000 So the what research did was that they used, 02:47.000 --> 02:53.000 they figured that there's a pattern in illicit drug promotion promoters. 02:53.000 --> 03:01.000 So how they do it is that they basically use the cross-platform referral traffic 03:01.000 --> 03:08.000 to draw people into their drug-doing promotion, 03:08.000 --> 03:12.000 but without getting detected by their moderation, 03:12.000 --> 03:20.000 by using some, because it's very hard to detect this kind of cross-platform, 03:20.000 --> 03:26.000 because either YouTube or Instagram, they will have their own data, 03:26.000 --> 03:29.000 and they have to determine based on their own data. 03:29.000 --> 03:37.000 In this case, they were able to identify those kind of cases by leveraging both data 03:37.000 --> 03:40.000 from different organizations. 03:40.000 --> 03:46.000 Another example is there's a very big initiative in a UK called HDR UK. 03:46.000 --> 03:49.000 What they're trying to build is a trusted research environment, 03:49.000 --> 03:55.000 where they want to combine all the medical data from the health providers 03:55.000 --> 04:02.000 to a little public researchers to get insights from those data. 04:02.000 --> 04:13.000 Yeah, these are examples of the efforts to provide access to private data for public interest. 04:13.000 --> 04:15.000 But why is it hard? 04:15.000 --> 04:18.000 The first challenge, of course, the data privacy risk. 04:18.000 --> 04:23.000 There's a trusted issue, because a lot of entities might have conflict interest, 04:23.000 --> 04:27.000 like YouTube and Instagram, they might not want to share their data to each other. 04:27.000 --> 04:30.000 This go with using data fabrication. 04:30.000 --> 04:35.000 Even if you're claiming that your benign researcher is still possible 04:35.000 --> 04:43.000 that researcher do some things with the private data that they have never promised to do, 04:43.000 --> 04:49.000 like tracking some private information of the data and so on. 04:49.000 --> 04:54.000 And there's another issue, which is different, different trust domain issues. 04:54.000 --> 05:02.000 So, a lot of time data is processed in a different place from different places 05:02.000 --> 05:06.000 that is not owned or controlled by the data owner. 05:06.000 --> 05:12.000 This is especially true when you deal with the trust domain, 05:13.000 --> 05:18.000 and there's a big compliance issue. 05:18.000 --> 05:25.000 Apart from the security issues, you still have the condition to the security. 05:25.000 --> 05:28.000 You definitely need to protect the data. 05:28.000 --> 05:32.000 In addition to that, you have to keep all the privacy policies enforced, 05:32.000 --> 05:37.000 such as data retention or purpose limitation of the data. 05:37.000 --> 05:45.000 And providing the raw data might not legally alert in some countries or in some areas. 05:45.000 --> 05:53.000 And also changing the geolocation of the data or formal data could be also legally restricted and so on. 05:53.000 --> 06:01.000 The second challenge is that nowadays the data is distributed across multiple places, 06:01.000 --> 06:04.000 even for a single organization, you can think of it. 06:05.000 --> 06:10.000 In the old days, the organization, if they own the data, they will have the data in their own servers, 06:10.000 --> 06:12.000 they manage and control. 06:12.000 --> 06:19.000 But in these days, this is not true because they often delegate the data to the third party, 06:19.000 --> 06:22.000 data warehouse, like software or data breaks, 06:22.000 --> 06:28.000 or maybe they can even store it in some cloud provider resources like buckets. 06:28.000 --> 06:34.000 And also the compute exists not only in the organization's servers, 06:34.000 --> 06:42.000 but also in the cloud provider, like they've owned the workload in the GCP or azer and so on. 06:42.000 --> 06:46.000 So it raises the challenge about accountability and transparency. 06:46.000 --> 06:51.000 So when things went wrong, like let's say there's a data breach happen, 06:51.000 --> 06:56.000 who's going to take the responsibility. 06:56.000 --> 07:02.000 It's very hard to determine what calls the breach and who's going to, 07:02.000 --> 07:04.000 it's all about the accountability. 07:04.000 --> 07:12.000 And also it's very important to make it possible to verify every single data transfer, 07:12.000 --> 07:18.000 every single processing of the data in the compute notes. 07:18.000 --> 07:22.000 That's what we need. 07:22.000 --> 07:30.000 So we concluded that we may need a standard way that provides a strong privacy protection mechanism, 07:30.000 --> 07:32.000 using various PTs. 07:32.000 --> 07:37.000 And also not only this mechanisms exist, 07:37.000 --> 07:42.000 we also need to enforce that technically. 07:42.000 --> 07:46.000 So terms and conditions to the researchers are sorry researchers, not enough, 07:46.000 --> 07:55.000 you know, it cannot prevent them from abusing the data or violating the privacy policies. 07:55.000 --> 07:59.000 And we also need to have accountability and transparency, 07:59.000 --> 08:03.000 so we need to be able to provide a tool to the data owners 08:03.000 --> 08:10.000 that they can confidently audit or verify what's happening with the data. 08:10.000 --> 08:13.000 And finally they're usability. 08:13.000 --> 08:19.000 So with all these guarantees, you should not sacrifice the results. 08:19.000 --> 08:26.000 So some PT technology sacrificed the accuracy of the results for the sake of the privacy, 08:26.000 --> 08:32.000 but this was not our requirement when we designed this system. 08:32.000 --> 08:38.000 And in addition to that, we wanted it to be very, very easy to deploy, 08:38.000 --> 08:44.000 and very, very easy to use, and also very easy to customize. 08:44.000 --> 08:50.000 So there are existing solutions already when we exploit these problems. 08:50.000 --> 08:52.000 So one is the data clean room. 08:52.000 --> 08:57.000 So in the industry, they're already using this kind of call, 08:57.000 --> 08:59.000 a framework called data clean room, 08:59.000 --> 09:06.000 where you basically define the policy on every single single statement to, 09:06.000 --> 09:13.000 you know, define who can access which table, and who can query which kind of query and so on. 09:13.000 --> 09:21.000 And it's operating in some third party that has no confidence on interests or anyone in the data, 09:21.000 --> 09:23.000 anyone who owns the data. 09:23.000 --> 09:27.000 The second option that we thought about was a differential privacy, 09:27.000 --> 09:31.000 the differential privacy actually pre-processed the data, 09:31.000 --> 09:38.000 or add a noise to the result of the aggregated SQL to limit the information leakage, 09:38.000 --> 09:41.000 theoretically limit the information leakage. 09:41.000 --> 09:45.000 The final option was cross-experiment environment. 09:45.000 --> 09:58.000 So we kind of assess the processing concept, which of the techniques. 09:58.000 --> 10:01.000 So first of all, SQL power space data clean room. 10:01.000 --> 10:05.000 Although they provide a very good usability, they allow researchers, 10:05.000 --> 10:09.000 they allow users to query anything on the data, 10:09.000 --> 10:12.000 and they provide very high accuracy. 10:12.000 --> 10:17.000 They necessarily feature to technically enforce the privacy policies. 10:17.000 --> 10:23.000 Also, they often miss the privacy protection in general. 10:23.000 --> 10:34.000 The differential privacy on the other hand provided very well-defined technical guarantee on privacy, 10:34.000 --> 10:38.000 but it kind of sacrifices accuracy. 10:38.000 --> 10:41.000 So traffic is executing environment on the other hand. 10:41.000 --> 10:46.000 It not only provides technical enforcement and high accuracy, 10:46.000 --> 10:52.000 but also it can provide some transparency in terms of accountability that I mentioned before. 10:52.000 --> 10:58.000 One issue is that the usability of it, so I said it could be better. 10:58.000 --> 11:04.000 By that, I mean, for the, this type of data analytics, 11:04.000 --> 11:09.000 we figure that cross-experiment environment is very, very hard to use, 11:09.000 --> 11:15.000 because of the way, the analyzer data, the analyzer data doesn't match with the model, 11:15.000 --> 11:21.000 that cross-excusion environment deals with the workloads. 11:21.000 --> 11:24.000 So let me, let me talk a little bit about that later. 11:24.000 --> 11:28.000 But that's why we decided this project humanity. 11:28.000 --> 11:36.000 So to this end, we built the framework with the following goals. 11:36.000 --> 11:40.000 First, technical enforcement on the privacy policy, 11:40.000 --> 11:43.000 policy via various PT technologies, 11:43.000 --> 11:46.000 and just secondly, you wanted to be usable, 11:46.000 --> 11:50.000 so we wanted to provide an interactive tool to utilize the data 11:50.000 --> 11:55.000 and third, the accuracy, we should not sacrifice accuracy 11:55.000 --> 11:58.000 for the sake of anything else, 11:58.000 --> 12:02.000 and then, finally, the transparency and accountability. 12:02.000 --> 12:05.000 Oh, actually, last thing is the deployment. 12:05.000 --> 12:09.000 We wanted to make it easy to deploy into the cloud. 12:09.000 --> 12:12.000 So these are our design goals. 12:12.000 --> 12:17.000 So one of the observation that we saw is that 12:18.000 --> 12:21.000 the data analytics actually happens in two stages. 12:21.000 --> 12:25.000 One is programming stage and the other is execution stage. 12:25.000 --> 12:30.000 And each of the stage has very, very different requirements. 12:30.000 --> 12:32.000 So in the programming stage, usually, 12:32.000 --> 12:36.000 you only need very small data set and very small amount of compute. 12:36.000 --> 12:40.000 You don't need like a thousand GPUs or something. 12:40.000 --> 12:44.000 And it's better to be very interactive, 12:44.000 --> 12:51.000 because when you program, you usually try some code with your data 12:51.000 --> 12:55.000 and play with the data to get some initial insights 12:55.000 --> 13:00.000 before you want to do the full analysis. 13:00.000 --> 13:03.000 But because of that, it's very hard to control the data. 13:03.000 --> 13:07.000 Researchers or users, they can do anything with the data, 13:07.000 --> 13:10.000 and it's very hard to control. 13:10.000 --> 13:13.000 So it has a very high privacy risk. 13:13.000 --> 13:17.000 On the other hand, once they're done with the programming, 13:17.000 --> 13:22.000 they are actually able to run this very large batch 13:22.000 --> 13:25.000 on the larger data and compute. 13:25.000 --> 13:29.000 And it's only happened once after you program everything 13:29.000 --> 13:31.000 and you make sure that it works. 13:31.000 --> 13:36.000 Just have to run it once to get the final output. 13:36.000 --> 13:40.000 And this stage actually is easier to control the data. 13:41.000 --> 13:44.000 Also, it has a little privacy risk because of that. 13:44.000 --> 13:47.000 So the approach that we took is that, 13:47.000 --> 13:51.000 actually, why don't we separate these stages and focus on 13:51.000 --> 13:54.000 different problems in each of the states. 13:54.000 --> 13:59.000 So for the protection of the execution stage where you run 13:59.000 --> 14:04.000 this workload in large batch on our actual data, 14:04.000 --> 14:08.000 we can use the Confession Computing. 14:08.000 --> 14:12.000 And for the programming stage, we can use other PET technology 14:12.000 --> 14:16.000 that has different trade-offs. 14:16.000 --> 14:19.000 Like synthetic data is one of the examples you can use 14:19.000 --> 14:24.000 differentially private synthetic data to mark the actual data. 14:24.000 --> 14:29.000 Like has a same statistical characteristic, 14:29.000 --> 14:34.000 but has no risk of privacy leakage for example. 14:34.000 --> 14:36.000 So that's the basic idea. 14:36.000 --> 14:39.000 And the benefit of this is that actually, 14:39.000 --> 14:42.000 what you're doing is actually separating the data 14:42.000 --> 14:44.000 policy and code policy. 14:44.000 --> 14:48.000 So you can choose the, you can flexibly choose 14:48.000 --> 14:51.000 the data policy on the programming stage. 14:51.000 --> 14:55.000 You can either use LDP perturbation or sample data 14:55.000 --> 14:57.000 or DP synthetic data. 14:57.000 --> 15:01.000 Whatever the data on it wants to protect the data privacy 15:01.000 --> 15:04.000 with the budget that they want. 15:04.000 --> 15:08.000 Where they still can, although, 15:08.000 --> 15:12.000 users to get the accurate result in the execution stage. 15:12.000 --> 15:18.000 And you can enforce the call policy on the execution stage. 15:18.000 --> 15:22.000 So we'll get accurate result on the execution stage 15:22.000 --> 15:26.000 because it will run on the full data set, 15:26.000 --> 15:30.000 which is securely enabled by Confession Computing. 15:30.000 --> 15:34.000 And specifically why Confession Computing is very useful here. 15:34.000 --> 15:37.000 So first is, it provides transition, 15:37.000 --> 15:41.000 transition of trust making your work with various trust model. 15:41.000 --> 15:45.000 For example, cross organization data providers, as an example, 15:45.000 --> 15:49.000 you don't have to not all the data providers 15:49.000 --> 15:51.000 may trust each other. 15:51.000 --> 15:56.000 In this case, you can transition this execution to the cloud 15:56.000 --> 16:00.000 and then who has no conflict of interest 16:00.000 --> 16:05.000 and run the workloads there without needing to complicate the trust model. 16:05.000 --> 16:08.000 And in particular, if you have execution, of course, 16:08.000 --> 16:13.000 is guaranteed by the Rotate Station plus the trust execution environment. 16:13.000 --> 16:16.000 And also one of the very interesting things that we found 16:16.000 --> 16:19.000 is that the attention and report could be also used 16:19.000 --> 16:24.000 to prove that it was executed in a legitimate environment. 16:24.000 --> 16:29.000 So why this is very useful in our use case is that 16:29.000 --> 16:33.000 the scientific research evaluation results 16:33.000 --> 16:36.000 often needs to be reproducible, 16:36.000 --> 16:42.000 but without having them reproduce the entire research, 16:42.000 --> 16:44.000 you can just provide a testing report and say that, 16:44.000 --> 16:48.000 okay, this is the script and this is the output provided by this 16:48.000 --> 16:51.000 clip on a certain environment. 16:51.000 --> 16:56.000 And this could be just proof of the experiment 16:56.000 --> 17:01.000 and the proof of the integrity of the evaluation, the research. 17:01.000 --> 17:04.000 This is the Manatee Data and Code Pipeline. 17:04.000 --> 17:09.000 So we use a Jupyter-Hop to provide a Jupyter-Hop 17:09.000 --> 17:11.000 into Jupyter Lab interface to a user, 17:11.000 --> 17:16.000 and user can actually interact the API using the 17:16.000 --> 17:20.000 Jupyter Lab extension, and then the data access 17:20.000 --> 17:24.000 through the data SDK, which will access different 17:24.000 --> 17:26.000 data at different stages. 17:26.000 --> 17:29.000 When the API suddenly gets a job, 17:29.000 --> 17:35.000 it will schedule this container in the executor park 17:35.000 --> 17:38.000 in a T-pack and we made it flexible 17:38.000 --> 17:43.000 that such that you can choose a different T-pack and, 17:43.000 --> 17:47.000 like, depending on your needs. 17:47.000 --> 17:50.000 It's very easy to deploy this via the platform, 17:50.000 --> 17:53.000 it deploys in the Kubernetes cluster, 17:53.000 --> 17:56.000 either in GCP or in MiniCube, 17:56.000 --> 18:01.000 and it will leverage some of the cloud resources if necessary. 18:01.000 --> 18:03.000 So here's the use case that tick. 18:03.000 --> 18:07.000 So a tick though we have the same exact problem, 18:07.000 --> 18:10.000 because we have to provide the data 18:10.000 --> 18:14.000 to the public researchers to provide a transparency. 18:14.000 --> 18:19.000 And we have launched T-pack have launched a product called 18:19.000 --> 18:22.000 VCE based on this solution, 18:22.000 --> 18:26.000 and it was built on top of the open source. 18:26.000 --> 18:29.000 There are other potential use cases, obviously. 18:29.000 --> 18:34.000 We're exploring, and then I'm going to quickly show the true demo. 18:34.000 --> 18:39.000 So the demo shows that we saw from the insurance charts 18:39.000 --> 18:42.000 data set from Kaggle and it's open data set. 18:42.000 --> 18:46.000 And the task is, we want to train a model that predicts 18:46.000 --> 18:51.000 insurance charts based on, yeah, based on the data. 18:51.000 --> 18:54.000 And then we use the differential, 18:54.000 --> 18:56.000 private synthetic data in first stage. 18:56.000 --> 18:58.000 We provisioned that. 18:58.000 --> 19:00.000 Let me show the video. 19:00.000 --> 19:02.000 So in the Jupyter Lab interface, 19:02.000 --> 19:07.000 you can create a notebook. 19:07.000 --> 19:11.000 And then you initialize the environment. 19:11.000 --> 19:14.000 And actually, you can import the data SDK, 19:14.000 --> 19:16.000 and initialize the data SDK. 19:16.000 --> 19:21.000 Then with the data SDK, you can access the stage one data, 19:21.000 --> 19:23.000 the raw data. 19:23.000 --> 19:27.000 And then you can actually explore the synthetic data 19:27.000 --> 19:31.000 by printing out some correlationship, hit map, 19:31.000 --> 19:34.000 and also other things. 19:34.000 --> 19:38.000 And once you're ready, you can submit this job to the second stage. 19:38.000 --> 19:41.000 And then it will go to the API, 19:41.000 --> 19:43.000 and you can see that the image is building. 19:43.000 --> 19:46.000 So we'll build the back end, we'll build the image, 19:46.000 --> 19:49.000 the container image, and once the image built, 19:49.000 --> 19:52.000 it will schedule it to the container. 19:52.000 --> 19:54.000 Sorry, the key back end. 19:54.000 --> 19:57.000 The key back end we're using here is the confidential space. 19:57.000 --> 20:00.000 So you can see once the VM finishes, 20:00.000 --> 20:05.000 you can download the output, and then the output, 20:05.000 --> 20:09.000 and the output, you can see the results from the real data. 20:09.000 --> 20:12.000 So here, the output privacy is not guaranteed, 20:12.000 --> 20:18.000 but you can add additional staff in between the download output 20:18.000 --> 20:24.000 to make sure that nothing, nothing private goes out from the output. 20:24.000 --> 20:27.000 This is about the code policy. 20:27.000 --> 20:30.000 And let me skip this part. 20:30.000 --> 20:33.000 So, sorry. 20:33.000 --> 20:37.000 The later part will do the execute boost 20:37.000 --> 20:40.000 to train the model. 20:40.000 --> 20:43.000 And then the testing report, you can download the autism report, 20:43.000 --> 20:48.000 and show, see, you know, this is the autism report 20:48.000 --> 20:50.000 from the Google Conference space, 20:50.000 --> 20:52.000 and you can actually verify the signature, 20:52.000 --> 20:55.000 as well as comparing the output hash 20:55.000 --> 20:59.000 that is in the autism report to, you know, 20:59.000 --> 21:04.000 prove that this output was in the generated by this script 21:04.000 --> 21:07.000 with the certain hash in the, 21:07.000 --> 21:10.000 let's make a confidential space environment 21:10.000 --> 21:13.000 with the, with the SCV enabled. 21:13.000 --> 21:18.000 So, yeah, let's pretty much it. 21:18.000 --> 21:24.000 So the project, I only have one minute, so, yeah. 21:25.000 --> 21:28.000 Actually, you can try this out. 21:28.000 --> 21:32.000 The fully open source, you can locally deploy the mini-coup, 21:32.000 --> 21:35.000 and also try, although the mini-coup version 21:35.000 --> 21:39.000 doesn't really use a TE, you can still try the interface. 21:39.000 --> 21:43.000 And you can actually follow the tutorial to reproduce 21:43.000 --> 21:46.000 or a shown here in the GCP. 21:46.000 --> 21:49.000 If you have a GCP account, you can try that. 21:49.000 --> 21:53.000 And we're collaborating Google on this project, 21:53.000 --> 21:58.000 and it's always joined us for more collaboration. 21:58.000 --> 22:01.000 Yes, that's nice. 22:01.000 --> 22:06.000 Thank you. 22:06.000 --> 22:11.000 So, we can move in and out, but we can have some quick UA, 22:11.000 --> 22:14.000 but we speak a little bit, so we can hear you while a lot of people 22:14.000 --> 22:16.000 move in and out. 22:16.000 --> 22:18.000 Yeah. 22:19.000 --> 22:23.000 And I'm pouring like in a TE, 22:23.000 --> 22:27.000 we have to run a constant time code, 22:27.000 --> 22:32.000 or essentially code that a quick branching pattern 22:32.000 --> 22:36.000 can not be done in the data, which are processing otherwise you need 22:36.000 --> 22:41.000 the data by us, for example, five minutes of timing ahead. 22:41.000 --> 22:44.000 And it seems that you are running here just, 22:45.000 --> 22:48.000 as it is, but they are not going to be done. 22:48.000 --> 22:53.000 So, what do you think about that? 22:53.000 --> 22:54.000 Okay. 22:54.000 --> 22:57.000 To repeat the question was, 22:57.000 --> 23:00.000 it seems that we're not protecting against the site channel, 23:00.000 --> 23:04.000 because if the execution time is not constant, 23:04.000 --> 23:08.000 it may be susceptible to timing channel attack, right? 23:08.000 --> 23:09.000 Yeah, that's a good question. 23:09.000 --> 23:11.000 I think it's a separate, 23:12.000 --> 23:17.000 separate issue that can be addressed by some additional techniques. 23:17.000 --> 23:21.000 But what we're trying to solve here is not the site channel, 23:21.000 --> 23:25.000 or the scope of TE itself. 23:25.000 --> 23:28.000 What we're doing is that we're building, 23:28.000 --> 23:34.000 you know, this general data-private data analytics platform, 23:34.000 --> 23:36.000 using the existing TE. 23:36.000 --> 23:38.000 That's why we're focused on this in this work. 23:38.000 --> 23:42.000 But of course, I think that if the work was very susceptible 23:42.000 --> 23:45.000 to timing channel or any other site channel attacks, 23:45.000 --> 23:48.000 I think it should be addressed case by case. 23:49.000 --> 23:52.000 If you have the question, 23:52.000 --> 23:55.000 you see that, like, for research, 23:55.000 --> 23:58.000 how can I choose our choice? 23:58.000 --> 24:00.000 Because you don't get them the data, 24:00.000 --> 24:03.000 you just get them the ultimate statistics. 24:03.000 --> 24:06.000 So, how can you make sure that it actually activates? 24:06.000 --> 24:08.000 I've been this stage once. 24:08.000 --> 24:11.000 Yeah, is it like the, actually, 24:11.000 --> 24:15.000 how do you, how do you, the question? 24:16.000 --> 24:20.000 What's kind of the difference between, 24:20.000 --> 24:23.000 I was at the, and just giving them data, 24:23.000 --> 24:26.000 and in, how can they still have, 24:26.000 --> 24:29.000 with the size of the site, not having the load data, 24:29.000 --> 24:31.000 and when they're like giving to the problem, 24:31.000 --> 24:33.000 that's really, actually, 24:33.000 --> 24:35.000 if they're going to go through the things. 24:35.000 --> 24:37.000 Yeah, I think the question is, 24:37.000 --> 24:41.000 why is the difference between just giving the data to the researchers, 24:41.000 --> 24:44.000 and, you know, doing this approach? 24:45.000 --> 24:49.000 Sure, that they're getting the correct insight for ourselves. 24:49.000 --> 24:52.000 So, I think, in the second stage, 24:52.000 --> 24:55.000 you will get the full, you know, 24:55.000 --> 24:58.000 output using the real data, 24:58.000 --> 25:00.000 so you'll get the insight. 25:00.000 --> 25:01.000 The, the problem is, 25:01.000 --> 25:03.000 the, the first, the programming stage, 25:03.000 --> 25:05.000 where you, basically, 25:05.000 --> 25:08.000 sacrifice some accuracy using, 25:08.000 --> 25:10.000 using some data protection techniques, right? 25:11.000 --> 25:13.000 And, and I think our argument is that, 25:13.000 --> 25:16.000 you can run this job like multiple times 25:16.000 --> 25:19.000 before you actually produce the final result. 25:25.000 --> 25:27.000 Yeah, yeah, yeah. 25:27.000 --> 25:29.000 Yeah, yeah. 25:29.000 --> 25:31.000 That's, that's a model. 25:31.000 --> 25:33.000 Yeah, thank you very much. 25:33.000 --> 25:35.000 All right, run. 25:35.000 --> 25:36.000 Sorry. 25:36.000 --> 25:37.000 Sorry. 25:37.000 --> 25:39.000 Yeah, we can't tell offline. 25:40.000 --> 25:42.000 Yeah, yeah.