WEBVTT 00:00.000 --> 00:30.000 So we have 8, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 00:30.000 --> 00:37.000 We get a chance to talk about this urgent topic of small data. 00:37.000 --> 00:46.000 I believe that we are all, or that have been at some point, been hoarding data in our lives. 00:46.000 --> 00:54.000 At some point, when I first started to work with data a little bit more serious, about 10, 15 years ago, 00:54.000 --> 00:58.000 that time I was a product manager for a big media site. 00:58.000 --> 01:01.000 We were a bit naive. 01:01.000 --> 01:07.000 We were trying together as much data as possible about our readers, 01:07.000 --> 01:10.000 because we wanted to improve the reading experience. 01:10.000 --> 01:13.000 We wanted to improve the reading time and business. 01:13.000 --> 01:19.000 I think we didn't do too much harm at that time because the systems were also very immature. 01:19.000 --> 01:25.000 But looking back, I can wonder now, why did no one ask the question, 01:25.000 --> 01:31.000 like, can you find these insights in other ways, other options? 01:31.000 --> 01:35.000 And it's getting worse, it's getting worse. 01:35.000 --> 01:42.000 So if we work with, today, with health data, for instance, 01:43.000 --> 01:48.000 also people I know who are really, really engaged in that, 01:48.000 --> 01:54.000 who are hoarding data to make better medicines, to make more precise medicines, 01:54.000 --> 01:56.000 which is a really good cause. 01:56.000 --> 02:01.000 I have several people in my nearby who suffers from severe diseases, 02:01.000 --> 02:06.000 and could really use some good, more precise medicines, 02:06.000 --> 02:10.000 but no one really asks that question today either. 02:10.000 --> 02:14.000 So that's what we're going to talk about today. 02:14.000 --> 02:18.000 We're working with the project Ernest Tapir, 02:18.000 --> 02:25.000 which is one of the purposes is to find cyber threats in the data, 02:25.000 --> 02:29.000 in this, in this queer data, and in cyber crime, 02:29.000 --> 02:35.000 the hoarding of data is really getting troublesome. 02:35.000 --> 02:42.000 So my name is Ulika Vincent, and I've been working with this project, 02:42.000 --> 02:47.000 the Ernest Tapir for about two years, and I recently picked up coding again, 02:47.000 --> 02:52.000 a couple of years ago, which was a very good idea, my life got better. 02:52.000 --> 02:56.000 And I'm working together with Michael Kulberg, 02:56.000 --> 03:00.000 who is one of the founders of the project, and also data architect, 03:00.000 --> 03:03.000 and a lot of other things, too. 03:04.000 --> 03:11.000 So first I'm going to just give you a short overview of what our project is about, 03:11.000 --> 03:16.000 and then Michael will get into a little bit more detail. 03:16.000 --> 03:25.000 So, first, before introducing the Ernest Tapir project, 03:25.000 --> 03:28.000 I want to introduce our way of working, 03:29.000 --> 03:32.000 and the big data is to come a very common way, 03:32.000 --> 03:37.000 it's the salvation of all knowledge in the world. 03:37.000 --> 03:40.000 You gather as much data as possible, 03:40.000 --> 03:45.000 you try to comply with laws by doing checkboxes, 03:45.000 --> 03:50.000 and you try to protect the sensitive data with all kinds of chills, 03:50.000 --> 03:55.000 and try to make customers and others trust you. 03:55.000 --> 03:59.000 But we are trying to do, or what we are doing, actually, 03:59.000 --> 04:05.000 not just trying, is that we work with very sensitive data, 04:05.000 --> 04:09.000 but we want to collect minimum to get the insights needed. 04:09.000 --> 04:14.000 Oh, they come, oh, sorry, can't walk around. 04:14.000 --> 04:16.000 Sorry. 04:16.000 --> 04:21.000 And we try every day to find ways to throw away data, 04:21.000 --> 04:24.000 as soon as possible. 04:24.000 --> 04:27.000 We want to distribute the storage of the data, 04:27.000 --> 04:31.000 and instead of just filling in checkboxes, 04:31.000 --> 04:35.000 we want to be compliance by the sign, 04:35.000 --> 04:41.000 and we do protection by sort of differential privacy. 04:41.000 --> 04:44.000 So, what is the Ernest Tapir? 04:44.000 --> 04:47.000 The Ernest Tapir is a privacy first, 04:47.000 --> 04:52.000 open source platform with localizations for analytics 04:52.000 --> 04:57.000 on DNS query data. 04:57.000 --> 05:05.000 And it runs on tapir runs next to the recursive resolver, 05:05.000 --> 05:09.000 for those of you who might not know the queries 05:09.000 --> 05:13.000 that are sent when you ask you or your application 05:13.000 --> 05:16.000 do a look up on the internet, 05:16.000 --> 05:19.000 passes a recursive resolver. 05:19.000 --> 05:26.000 And we upload events and aggregates to a cloud analytics platform. 05:26.000 --> 05:30.000 And publishes observations back to the edge, 05:30.000 --> 05:33.000 that can take some action of it. 05:33.000 --> 05:39.000 Just to give a very quick view on how the data looks like, 05:39.000 --> 05:43.000 if you're not doing this every day. 05:43.000 --> 05:48.000 This is what happens when I load BrusselsTime.org. 05:48.000 --> 05:51.000 This is all the DNS queries sent. 05:51.000 --> 05:56.000 So, here you can find a lot of interesting things, both threads, 05:56.000 --> 05:58.000 such as botnet, etc. 05:58.000 --> 06:04.000 but also leaks of identifiable information about you. 06:04.000 --> 06:09.000 But we want to look at this data to observe it 06:09.000 --> 06:16.000 and to find strange things and bad actors. 06:16.000 --> 06:20.000 And it's toxic. 06:20.000 --> 06:27.000 Some design principles we are using aggregation, 06:27.000 --> 06:33.000 where we separate data sets really. 06:33.000 --> 06:43.000 And we try to make or we're doing individual tracking 06:43.000 --> 06:47.000 impossible by design. 06:47.000 --> 06:50.000 And you can't after our aggregation, 06:50.000 --> 06:56.000 we can't do reverse engineering to find it. 06:56.000 --> 07:00.000 So, we also work with minimization. 07:00.000 --> 07:08.000 Other solutions often do minimization after extracting 07:08.000 --> 07:11.000 the ETL process. 07:11.000 --> 07:16.000 But we do transformational minimization at the source. 07:16.000 --> 07:27.000 But the main idea is that when we do aggregation, 07:27.000 --> 07:34.000 instead of looking at individual traces, 07:34.000 --> 07:39.000 then we won't be able to have any data that can be used. 07:39.000 --> 07:41.000 I tried to teach my children this. 07:41.000 --> 07:46.000 Like, data you share will probably be leaked every day. 07:46.000 --> 07:51.000 And we as developers or analysts or product people 07:51.000 --> 07:57.000 should be aware of this. 07:57.000 --> 08:01.000 We also kind of working with differential privacy, 08:01.000 --> 08:05.000 which is that the results from our observations 08:05.000 --> 08:09.000 won't differ in a significant way. 08:09.000 --> 08:13.000 If you're individual browsing behavior is in the data 08:13.000 --> 08:16.000 or not to simplify it. 08:16.000 --> 08:19.000 And it has the reliability. 08:19.000 --> 08:25.000 So you can always, always state that it's not possible 08:25.000 --> 08:29.000 to find you in the data. 08:29.000 --> 08:34.000 Another thing we're doing is that we want to publish 08:34.000 --> 08:40.000 these observations to the public and partners third parties. 08:40.000 --> 08:42.000 By doing that from the start, 08:42.000 --> 08:45.000 design the system to share the data, 08:45.000 --> 08:50.000 then we will make better priorities every day 08:50.000 --> 08:56.000 because otherwise the ISPs who run the NSTA peer on their resolvers 08:56.000 --> 08:58.000 would entrust us. 08:58.000 --> 09:01.000 They would never share the data. 09:01.000 --> 09:04.000 When they know we're going to publish it. 09:04.000 --> 09:08.000 So it has to work in our architecture. 09:08.000 --> 09:10.000 So to summarize, 09:10.000 --> 09:14.000 stop the pathological hoarding we're saying. 09:14.000 --> 09:17.000 Well, over to Michael. 09:17.000 --> 09:19.000 Thank you. 09:19.000 --> 09:20.000 Okay. 09:20.000 --> 09:23.000 So I need to get into the box. 09:23.000 --> 09:28.000 I don't like boxes, but I'll stick to this one. 09:28.000 --> 09:32.000 So I'm just going to go into the technical stuff for a little bit. 09:32.000 --> 09:35.000 Because everyone keeps asking me about the technical bit. 09:35.000 --> 09:38.000 And we've created this analysis platform. 09:38.000 --> 09:41.000 And this is how the current sausages are made. 09:41.000 --> 09:46.000 And some useless details are with land data. 09:46.000 --> 09:49.000 And we use bark and nuts and stuff to analyze it. 09:49.000 --> 09:51.000 And we have microservices. 09:51.000 --> 09:55.000 And Jupiter hub is the interface to meet analysts. 09:55.000 --> 10:02.000 So this was the quick review of the technical part of analysis. 10:02.000 --> 10:07.000 When it comes to the segmentation part that we've mentioned, 10:07.000 --> 10:09.000 this is an overview. 10:09.000 --> 10:14.000 Basically, you have internal stuff that ISPs don't want us to see. 10:14.000 --> 10:16.000 And that was with throw away. 10:16.000 --> 10:20.000 And then we have the ones that we sort of already know about. 10:20.000 --> 10:25.000 Those we gather up and aggregate and throw away stuff. 10:25.000 --> 10:29.000 And basically say, oh, I really can't put up with Google this week. 10:29.000 --> 10:32.000 So everything under Google goes into a bucket. 10:32.000 --> 10:35.000 And I just know I'm in a Google query store. 10:35.000 --> 10:42.000 Then we have unique events where we're interested in unique events. 10:42.000 --> 10:47.000 Because there's an in cybersecurity. 10:47.000 --> 10:53.000 Typically, 90% of all new domains are malicious in some way. 10:53.000 --> 11:02.000 So having those domains available is a very good way of predicting what's going to go bad in a short term. 11:02.000 --> 11:05.000 So the first time we see any of these domains, we send them. 11:05.000 --> 11:10.000 But we disconnect them from any other queries or any user information, etc. 11:10.000 --> 11:13.000 So we just basically get a domain that says, oh, this one's new. 11:13.000 --> 11:16.000 Or at least that server thought it was new. 11:16.000 --> 11:19.000 And then we need to figure out if it's actually new. 11:19.000 --> 11:21.000 And then we have the things that are in between. 11:21.000 --> 11:26.000 Now, when you guys are going shopping or you're walking down the street and you're looking at all the places you pass, 11:26.000 --> 11:30.000 you create a pattern of where you are. 11:30.000 --> 11:34.000 And that pattern is in itself an identifier. 11:34.000 --> 11:38.000 So as you saw the queries that came from this newspaper, 11:39.000 --> 11:46.000 so one of the newspapers I know about, they have like 380 queries for their front page, basically. 11:46.000 --> 11:52.000 And there's a number of different ad tokens that will identify you in different ways. 11:52.000 --> 11:57.000 And when your computer connects to a network, it will typically go check if if you're running Windows, 11:57.000 --> 11:59.000 it's going to ask, are there in updates available? 11:59.000 --> 12:05.000 And then you have all your software that also wants to know about updates, all these things create patterns. 12:05.000 --> 12:14.000 And those patterns are both interesting, but also identifying and, well, toxic. 12:14.000 --> 12:20.000 So those end up in the bucket over at the end, where we currently we're throwing them away, 12:20.000 --> 12:23.000 because we built all the other stuff. 12:23.000 --> 12:34.000 But we do need a local analysis platform that aggregates this data and removes the patterns and identifying information. 12:34.000 --> 12:39.000 So this is the segmentation part. 12:39.000 --> 12:45.000 And for this data, well, this is where it ends up being, and it's probably not readable. 12:45.000 --> 12:52.000 But here we have our histograms and it's basically Google and counts. 12:52.000 --> 12:58.000 Well, these are sketches, and this is a null sketch. 12:59.000 --> 13:02.000 The sketches will identify you if you have two few of them. 13:02.000 --> 13:12.000 So we have a hard cut off for these, where you won't really have a sketch until the number of users passes specific number. 13:12.000 --> 13:14.000 And our current number is 20. 13:14.000 --> 13:27.000 So at the point where you're one of 21 or so users, there's going to be a sketch here to be able to handle the, 13:27.000 --> 13:33.000 well, I lost the word, but anyway, user count is one of those with cardinality problems, right? 13:33.000 --> 13:46.000 So to handle cardinality of users, we use, each of those sketches are hyperlog log sketches to maintain an approximation of the users across time. 13:46.000 --> 13:50.000 When it comes to events, these are the events. 13:50.000 --> 13:55.000 Like, well, this one is really interesting. 13:55.000 --> 14:01.000 So as a WCDN app will come with higher and lower cap letters. 14:01.000 --> 14:05.000 That's a strategy for adding more bits into DNS. 14:05.000 --> 14:08.000 That's why it's upper load case letters. 14:08.000 --> 14:12.000 They just show up once from each server. 14:13.000 --> 14:22.000 And we aggregate them centrally to see if they're actually completely new or if they just show up late somewhere. 14:22.000 --> 14:30.000 And the methods we use for doing this, well, it's a patchy spark. 14:30.000 --> 14:38.000 I picked this particular one because an interesting fact that we learned the hard way is that a patchy spark is running on JVM, 14:38.000 --> 14:43.000 but actually it doesn't know about 64 bit unsounding numbers. 14:43.000 --> 14:47.000 So you actually have to do some really weird stuff too. 14:47.000 --> 14:53.000 If you're using a bit string on your analysis or your data collection platform, 14:53.000 --> 14:59.000 and you're sending that up and you're using a 64 bit string to indicate things, 14:59.000 --> 15:06.000 there's only going to be 63 of them available in Scala. 15:06.000 --> 15:11.000 So basically we just replace them with letters and use that. 15:11.000 --> 15:15.000 So when it comes to sharing this data, 15:15.000 --> 15:18.000 data commons is a fabulous idea. 15:18.000 --> 15:21.000 I love it. I really want to share this data. 15:21.000 --> 15:23.000 Sharing data is a bit tricky. 15:23.000 --> 15:25.000 I don't know if you guys remember this, 15:25.000 --> 15:31.000 but Strava made interesting boxes on the maps in Afghanistan, 15:31.000 --> 15:34.000 and in the architects. 15:34.000 --> 15:38.000 So that was unfortunate. 15:38.000 --> 15:40.000 And this is because they're sharing data, 15:40.000 --> 15:46.000 and they weren't really thinking that this would in any way be problematic. 15:46.000 --> 15:54.000 Netflix also had this prize where he was supposed to do things with their recommendation system. 15:54.000 --> 16:02.000 And it turns out that if you have very specific tastes in which movies you watch, 16:02.000 --> 16:07.000 yes, it is possible to find you in aggregated data. 16:07.000 --> 16:11.000 And this is one of the concerns we have for our data, 16:11.000 --> 16:19.000 because I mean the chances that there is someone asking some really odd questions is pretty high. 16:19.000 --> 16:23.000 I know for a fact that I would probably stand out, 16:23.000 --> 16:28.000 but that makes it hard to share the data. 16:29.000 --> 16:37.000 So when it comes to, when we've been working on this to create data that we believe we can share, 16:37.000 --> 16:41.000 we've had a very basic design strategy. 16:41.000 --> 16:44.000 And as we design it as well as we can, 16:44.000 --> 16:47.000 and we try to break it, like really break it, 16:47.000 --> 16:52.000 and then we redesign it, and then we go back to trying to break it. 16:52.000 --> 16:57.000 And at some point we'll probably want someone else to try to break it as well, 16:57.000 --> 17:01.000 before we hand over all the data. 17:01.000 --> 17:10.000 There's also a tricky part about this data since it's related to security research, et cetera. 17:10.000 --> 17:17.000 Chances are that any publication of the data needs to be somewhat delayed, 17:17.000 --> 17:26.000 so that all the security people get a chance to sort of make sure that they fixed all the stuff that potentially could fall out from this data. 17:26.000 --> 17:30.000 So, but having this as your goal, 17:30.000 --> 17:38.000 actually put some pressure on you to think a number of times about your data before you release it to anyone. 17:38.000 --> 17:46.000 And this, I don't think it's unique for DNS data. 17:46.000 --> 17:51.000 I think this is something that can be applied to a number of different types of data, 17:51.000 --> 17:54.000 coming from networks and computers. 17:54.000 --> 18:02.000 I can think of a number of things there, but it would be interesting to know about other areas where this could be applied. 18:02.000 --> 18:09.000 So, if you have ideas where you can use these strategies in some really wild other field, 18:09.000 --> 18:12.000 I would definitely be interested in nobody. 18:12.000 --> 18:20.000 So, I believe that is the main gist of what we're doing. 18:20.000 --> 18:25.000 So, I think we'll leave a lot of time for questions, I hope. 18:25.000 --> 18:38.000 Yeah, I can, I can just mention also that we would love you to try to break our model. 18:38.000 --> 18:46.000 So, if you like to contribute in some way or just follow our work on even if you're not in the DNS world, 18:46.000 --> 18:49.000 it could be interesting to exchange experiences. 18:49.000 --> 18:54.000 Please reach out to us by email or most of them are linked in, 18:54.000 --> 18:57.000 or check out the site and rep of because it's, 18:57.000 --> 19:03.000 I didn't put it here, but we have the rep on GitHub and just reach out for any, 19:03.000 --> 19:09.000 and a, a fetch to try to break our model. 19:09.000 --> 19:12.000 Yeah, questions. 19:21.000 --> 19:24.000 Then maybe I can have a question to you. 19:25.000 --> 19:33.000 Do you enter the project that you can read on the view and that there's view there, 19:33.000 --> 19:38.000 or do you just log in and research what these are looking for? 19:44.000 --> 19:49.000 So, what we do? 19:49.000 --> 19:51.000 Yeah, you put the question. 19:51.000 --> 19:58.000 The question was, if we are filtering filtering the, 19:58.000 --> 20:04.000 the means from the user and the NSTP doesn't make filtering decision. 20:04.000 --> 20:11.000 The NSTP which is a bit different from other solutions that makes filtering decisions. 20:11.000 --> 20:17.000 We publish observations to the resolver operator where we, for instance, 20:17.000 --> 20:22.000 can, an observation can be that it's a new domain, low rank, 20:22.000 --> 20:25.000 a sudden ramp up and things like that. 20:25.000 --> 20:29.000 And then there is a module in the NSTP. 20:29.000 --> 20:34.000 We call the policy processor, which can take these observations, 20:34.000 --> 20:39.000 and decide in the policy together with other sources, 20:39.000 --> 20:45.000 such as list, etc., and set the response on that domain. 20:45.000 --> 20:48.000 Okay, now I've got three observations from TAPI. 20:48.000 --> 20:53.000 Plus, this domain has these characteristics. 20:53.000 --> 20:55.000 Okay, let's filter it. 20:55.000 --> 20:58.000 So, TAPI just observe. 20:58.000 --> 21:04.000 We want to, and not decide if it's a bad or a good domain. 21:04.000 --> 21:07.000 Yeah. 21:07.000 --> 21:12.000 So, and the question I would like you to bring home is, 21:12.000 --> 21:17.000 like, when did I throw away data later? 21:18.000 --> 21:32.000 Thank you. 21:32.000 --> 21:33.000 Thank you. 21:47.000 --> 21:52.000 Thank you.