WEBVTT 00:00.000 --> 00:25.440 We're switching a bit to the topic now. We're back in 2020. It's on OI here. At least not 00:26.360 --> 00:50.080 we're back in 2023 and we're talking about digital services act, which is, well, new legislation, 00:50.080 --> 00:57.920 but that kicked in, at the end of 2023. We are specifically focusing on a package that 00:57.920 --> 01:07.040 me and Luca, which is over there, developed to analyze the transparency database of the digital 01:07.040 --> 01:15.760 services act. A quick overview of digital services act is a legal framework that basically promotes 01:15.760 --> 01:22.640 transparency online for the platforms and especially for a very large platforms. It has 01:22.640 --> 01:30.320 many layers of transparency and let's say rights that have been given to the users of online 01:30.320 --> 01:36.720 platforms. For example, now we have to do, there is the obligation to have transparent terms and 01:36.720 --> 01:42.720 condition, which are clear to read that explains and risk assess the algorithm that 01:42.720 --> 01:50.720 the platform uses, such as the recommended system. Platform platforms have to explain the content 01:50.720 --> 01:57.760 moderation policies, they apply to the user content. Then there are additional consumer protection 01:57.760 --> 02:04.000 rights. For example, there cannot be targeting advertisement for miners. There is a mechanism now, 02:04.000 --> 02:09.920 which is mandatory for platforms for users to report illegal or incompatible content online. 02:10.480 --> 02:15.440 And then there are new transparency and that access provisions, which are going to focus on. 02:16.240 --> 02:23.120 For example, now, every time you go to an online shop, there is the obligation to share the 02:23.120 --> 02:29.360 selling information, which was not mandatory before, quite interestingly, and you can have also 02:29.360 --> 02:38.480 the details of the advertisement that is proposed to you by the platform. There are many transparency 02:38.480 --> 02:44.640 provisions. I will just quickly show you them. There are some transparency reports, which are 02:44.640 --> 02:50.880 beyond what reports about the content moderation activities of the platforms. The transparency 02:50.880 --> 02:56.880 database, which I present later in details, then, as we said, the terms and condition that we are 02:56.880 --> 03:02.240 tracking in collaboration with the time open terms archive. So we are tracking the changes that 03:02.240 --> 03:07.440 platform applies to them, how they treat the user data, how they present content to the user 03:08.400 --> 03:14.480 and then there are advertisement libraries, so every platform, every large platform, 03:14.480 --> 03:19.840 has to present the user, the repository, where all the information about all the advertisement 03:19.840 --> 03:25.440 that has been run on the platform can be freely seen by everyone at anything. And then there are 03:25.440 --> 03:32.480 others, which are very technical. I won't go into details, just know that for researcher, there is 03:32.480 --> 03:40.000 a new data axis provision, meaning that the vector researchers can get access to close data or 03:40.000 --> 03:47.600 private data from the companies under very strict conditions, but it is an unprecedented measure 03:47.600 --> 03:58.320 to scrutinize the activities of the platform. What does the provision do for example, 03:58.320 --> 04:04.240 the transparency reports gave for the first time another view of the content moderation 04:05.520 --> 04:11.840 human resources that platforms are allocating to moderate the content online, so we have a 04:11.840 --> 04:17.920 breakdown by language. We have the accuracy of the content moderation activities, etc. And so far, 04:17.920 --> 04:26.560 we had three rounds of these reports, and one is due next spring. And then we focus on the transparency 04:26.640 --> 04:34.240 database. The transparency database collects the anonymized version of all the content moderation 04:34.240 --> 04:40.320 decision that the platform takes against the user content, so the data life cycle is the user 04:40.320 --> 04:45.200 of course creates content on the platform, and then you have the platform that either by 04:46.160 --> 04:54.000 proactive decision or by not is from a user takes down or moderates the content, and then the 04:54.000 --> 05:04.320 platform under the DSA is obliged to notify the user on why and the causes and the reasons 05:04.320 --> 05:11.920 that caused the content to be moderated. And then it sends it to the user and an anonymized version 05:11.920 --> 05:18.000 to the transparency database. And what it looks like, it's like a very big JSON with all the 05:18.000 --> 05:24.320 information about the statement of reasons to content moderation, specifically finding the category, 05:24.320 --> 05:31.840 the automation that was used in the process, the pretext outlining, for example, the legal 05:31.840 --> 05:39.200 grounds, etc. And then once it's in database, you currently have three ways to look at this data. 05:39.200 --> 05:44.880 There is a website search that we offer, which is a kind of real time, but it's very limited in 05:44.880 --> 05:50.960 the scope and the only covers the last six months of data. And then there is an online dashboard 05:50.960 --> 05:57.360 that gives us an aggregate view of the data, but it's very limited in functionalities if you 05:57.360 --> 06:02.720 want to go deeper in the analysis. And then there are these daily dams, which are basically the 06:02.720 --> 06:10.880 daily volume of statements received by the database in a CSV damper, which are very big and 06:10.880 --> 06:17.840 requires a lot of pre and post processing. So our package basically focuses on this part of the 06:20.240 --> 06:29.040 of the pipeline and tries to optimize and streamline this streaming, this kind of analysis. 06:29.040 --> 06:35.760 So the content of the database is quite big, and our package cannot do like a very 06:36.400 --> 06:44.320 miracle, for example, now we are after about 25 billions of statements in the database. 06:44.320 --> 06:51.840 And as you can see, the biggest share of it is by one specific player. So even if you want to 06:51.840 --> 06:57.520 analyze them, you still need the package, you still need a very big machine, let's say, 06:59.120 --> 07:05.040 with a lot of throughputs, if you want to analyze the daily dams. And even the aggregated view of it 07:05.440 --> 07:13.760 by the categorical views, by the categorical variables, it's kind of two gigabyte in the end. 07:13.760 --> 07:19.280 So if we just remove the bigger player, you see that we have a breakdown of the content, which is 07:19.280 --> 07:29.760 like more heterogeneous. And so that said, that was just to say that this database is quite big, 07:30.080 --> 07:36.480 if you also account, for example, for the free text data that are in. So the coming back to our 07:36.480 --> 07:42.720 package, it's a package that can install in different ways. We provide the different 07:43.760 --> 07:49.760 venues. There is a Python package that you can directly install. We also provide out of the box, 07:50.480 --> 07:56.240 well, Docker container image that is exposing different ways to interact with it. I will show 07:56.240 --> 08:02.960 it in the last one. One of these is the best boarding capabilities. And we also offer, of course, 08:02.960 --> 08:08.800 interactive online documentation. As said, there are three ways to interact with the package, 08:08.800 --> 08:16.400 if you, for example, run the container images. There is an API interface, which offers a standard 08:16.400 --> 08:26.480 dyes-fast API interactive interface to try it out, the different queries that you can perform, 08:26.480 --> 08:30.960 which are basically the download, the filtering and the aggregation of the data that are 08:30.960 --> 08:37.120 found in database. The same functionalities are applied and mirrored by a common line interface, 08:37.120 --> 08:44.720 which is easily configured with some configuration file. And then you have an interactive way. 08:45.600 --> 08:50.240 Just to say that we will be in the workshop later. So if you want more details or you want a 08:50.240 --> 08:58.800 small demo, you can stay and we will be happy to provide one. So coming back to the third 08:58.800 --> 09:05.360 way to interact, there is also a dashboard link built on superset, the Apache dashboard link system, 09:06.240 --> 09:14.560 framework. And we just show some of the possible solutions that are like breakdown that you might 09:14.640 --> 09:24.880 be interested in. For example, I'm sorry, but the default font of superset is quite small, 09:24.880 --> 09:31.120 I have to say, but you can have, for example, a breakdown, very easily, of course, the platform, 09:31.120 --> 09:37.680 that meets the contents. For example, you have a TikTok, Amazon, Pinterest, Facebook, etc. 09:37.680 --> 09:44.080 And the category of the content that they were moderating. So for example, for most of the platform, 09:44.160 --> 09:50.640 this is just a scope of platform service, which is kind of a part two for them. And then you have 09:50.640 --> 10:00.400 other categories. You can also have breakdowns of in a time series or like compare the manual 10:00.400 --> 10:06.240 or automated content moderation from different platforms. So you can see like daily patterns 10:06.240 --> 10:12.480 and where the people are where platform are using automated or not content moderation. 10:12.560 --> 10:17.680 There are other breakdowns that I can show you later in the workshop, just to mention that 10:17.680 --> 10:25.120 there is a flourishing community about this in the research, in the community, in academic research 10:25.120 --> 10:33.840 community. And there will be an update of the database, late in July 2025, which mainly 10:33.840 --> 10:40.240 we didn't produce content identifier for illegal products that are moderated. These are our 10:40.240 --> 10:45.280 coordinates, if you are interested in and I said stay around for the workshop and the 10:45.280 --> 10:48.480 panel if you want to more information about this. Thank you very much. 10:55.040 --> 10:57.120 I don't think there is time for questions. 11:10.640 --> 11:16.640 Yes. I forgot to mention that I'm in Rico from the European Commission. I 11:16.640 --> 11:20.800 welcome, did you connect in the DSC enforcement team as a data scientist?