WEBVTT 00:00.000 --> 00:12.440 Okay, hi everyone, so with the next talk, we have Steven, we're talking about talking 00:12.440 --> 00:17.000 in talking to the community, I think we should be that hard. 00:17.000 --> 00:22.440 Thank you, this plan, this talk was originally planned to have a co-speaker, but he couldn't 00:22.440 --> 00:24.640 make it because he felt it. 00:24.640 --> 00:31.800 So with myself, I'm Steven, I work on a project I call dogspec, and I'm going to talk 00:31.800 --> 00:36.040 about that project. 00:36.040 --> 00:41.440 Dogspec originates from the Dutch government and we have an issue there because we have quite 00:41.440 --> 00:49.800 a bit of documents in government that are not accessible, that are not insropable, they're 00:49.800 --> 00:56.520 hard to get comfort, and within Dutch law and European law, it's basically a requirement 00:56.520 --> 01:02.480 that any information you make public because a government, it should also be public for people 01:02.480 --> 01:08.360 who, for example, countries or have problems with eye-size or have a different situation 01:08.360 --> 01:15.440 why they can't access information in the same format, and we took on the task at the 01:15.520 --> 01:21.120 level here, that's part of the Dutch Ministry of Interior, to turn them into accessible 01:21.120 --> 01:26.680 and reusable HTML. 01:26.680 --> 01:32.920 So you might think the obvious solution is to just use a Spendock, that actually worked 01:32.920 --> 01:45.240 for a while, until it didn't, Sandock is a very good conversion to with a lot of features, 01:45.240 --> 01:50.200 it does a lot of formats, but it might be if you need specific features or you need 01:50.200 --> 01:54.640 a specific way of output, then it's harder to use. 01:54.640 --> 02:01.000 So we built around it, we created a piece of software, we call it first the P4 and the P4 02:01.040 --> 02:12.000 P4 is a Pendock, P processor, and we also ended up doing post-processing, which was 02:12.000 --> 02:15.880 actually more complex than doing the conversion ourselves, as you can see, we have piece of 02:15.880 --> 02:22.240 inputs, P processor before it goes into Pendock, and when it went out of Pendock, we 02:22.240 --> 02:29.280 used the same kind of software to post-process, it's to clean it up, Pendock would do the 02:29.280 --> 02:35.240 core conversion, P processor would normalize stuff for Pendock to do it a bit easier, and 02:35.240 --> 02:43.880 post-processing would fix up, put it in the format that we actually wanted, but you 02:43.880 --> 02:49.440 lose control in this stack, because you basically use Pendock as a black box that you 02:49.480 --> 02:57.960 pre-processing post-process on, it's complex, and the annoying part is every new requirement 02:57.960 --> 03:04.680 in our application, it requires a change in Pendock filters, or in the input we put in, 03:04.680 --> 03:10.360 or the output we got out and we need to process, so you keep on basically having brittle 03:10.360 --> 03:17.440 changes, if I know the core issues, of course, Pendock is primarily a common line interface, 03:17.440 --> 03:24.080 it does do a web server, but it's not feature-completes, it's for sure not design for 03:24.080 --> 03:31.600 collaborative editing, and that's what I'm going to show you next, but first, but it's 03:31.600 --> 03:38.920 we designed the AST-based VJBase that we can convert to, since in-between the input and 03:38.920 --> 03:45.000 outputs, and yeah, we did our conversion ourselves, you can find a recent version of the 03:45.000 --> 03:53.400 AST on this link, I'm now actually going to show something and what it does, so you have 03:53.400 --> 04:07.760 better picture with it, so our project was basically a tool to make documents more accessible, 04:07.760 --> 04:14.360 you could upload a document in our editor, and we would have validation on levels of like 04:14.360 --> 04:27.520 headings or machine objects, you can fix them, this would disappear, and then we can 04:27.520 --> 04:38.480 convert it back to order formats, and a user can, for example, just upload it to a content 04:38.480 --> 04:44.040 management system or another system, what is very cool about this, I will zoom back a 04:44.080 --> 04:50.040 little bit, is that validation message you see here is also basically an output for the 04:50.040 --> 04:57.000 converter we built, we created an input reader for the word document file, and we also 04:57.000 --> 05:03.280 created an output for the editor, so you can convert from the editor and the editor back 05:03.280 --> 05:12.840 to order formats, this was mostly surrounded around accessibility issues, around documents and 05:12.840 --> 05:23.040 documents, as a tool that would help governments make documents accessible, and as a tradition 05:23.040 --> 05:32.840 in this room, I will also give a shout out to Blocknotes in the suite, we also build this tool 05:32.840 --> 05:41.560 for Blocknotes basically where you can use it to import documents into Blocknotes in the suite, 05:41.560 --> 05:45.880 it happens in the same manner, this so the dockage reader is always the same, it doesn't 05:45.880 --> 05:52.000 matter if you export to HTML or another content management system, this was merged to production 05:52.000 --> 06:05.680 last week, and it helps to migrate from Microsoft works in general of course, it works kind 06:05.680 --> 06:16.320 of the same way as you saw in the previous demo, it just applies a document, it's being 06:16.320 --> 06:29.880 converted, and then you can basically manipulate your content that was also in the original 06:29.880 --> 06:58.620 document, so in order to convert we created an AST based on JSON, where you can divine 06:58.620 --> 07:05.060 elements, it's very similar to other ASTs, also such as the Blocknotes one we heard earlier 07:05.060 --> 07:11.980 about, it's also types, I use type check nowadays to type it and you can basically create 07:11.980 --> 07:16.940 your own image or to other programming languages, means that you only have to maintain one 07:16.940 --> 07:26.180 spec, and you can use it in order languages, I wrote the code base of the current converter 07:26.180 --> 07:34.580 in a lecture, so that's also the current language I convert my touch to, and basically with 07:34.580 --> 07:42.020 this AST you can easily map it to Blocknotes or tip-tap elements, if you're lost with this 07:42.020 --> 07:49.620 spec was really an AST to describe any and every element, so that means if this happens 07:49.620 --> 07:55.740 to your input document that there's a strange element inside of it, or layouts, we want 07:55.740 --> 08:04.940 to be able to describe it, so we can at least try to convert, as the least amount of loss 08:04.940 --> 08:14.700 as possible, we currently focus on.ex, there is possible to also convert PDFs, but that's 08:14.700 --> 08:21.420 bit more that the search is on talk, as to do with machine learning, and we convert to 08:21.420 --> 08:28.540 editors like Blocknotes and tip-tap and to formats like HTML and EPUP, and it is planned, you 08:28.540 --> 08:35.740 can see in this chart I'm not sure if it's very visible, but you can see basically what we 08:35.740 --> 08:41.980 want to implement this year and what we implement it, so currently for input, we support 08:41.980 --> 08:48.620 talkics and tip-tap, but we also want to do HTML, markdown of course, and open documents, 08:48.700 --> 08:54.060 we also want to import from Blocknotes, search again, basically export your Blocknotes documents 08:54.060 --> 09:01.820 also to formats through our system, we want to make an export in.ex, markdown and everything, 09:01.820 --> 09:10.060 this is a plan for the coming year, and we want to go to PDF for up with using types, 09:10.060 --> 09:19.820 so types as in tool to basically render out your documents as PDFs, and the decision also 09:19.820 --> 09:26.380 very, really want to go is rewriting the Alexa code to Rust, which also means you would be able 09:26.380 --> 09:34.540 to run it in browsers with WebAssembly, we'd be able to run it as a command-line interface that's 09:34.540 --> 09:45.340 also currently possible as an API that's also currently possible. A skew worker, as a library, of course, 09:47.740 --> 09:53.580 and with FFI-dynamics to any language, this would practically mean that you can use this in any 09:53.580 --> 09:59.180 projects, you can use this in any program language, and most importantly, you don't need a 09:59.260 --> 10:04.620 server if you are an editor that needs Falcon version, you can just do it in a browser, 10:05.340 --> 10:11.420 which is quite important for enter-integrated systems, because they can't expose 10:12.380 --> 10:16.060 documents to the server, because the food basically breach and the encryption. 10:19.260 --> 10:23.260 And that makes it quite interesting for projects like CREPAT and the next graph 10:23.660 --> 10:30.700 where the server can see the document contents, and by ensuring the conversion happens 10:30.700 --> 10:38.700 client-sized, we basically preserve the privacy of the user. It would mean that you can 10:38.700 --> 10:43.420 convert without back-end infrastructure, which would do its latency, and would be quite real time, 10:44.140 --> 10:52.940 and as a set-three times, and to end. I have some URLs here where you can see the projects, 10:54.380 --> 11:00.060 you can see last week docks, you can see the project docks by itself, you can see anodok that's the 11:00.060 --> 11:06.700 all the code base that we use at logias, part of the ministry, and I add that links, of course, 11:06.780 --> 11:18.380 to Pandok. All right, this is my talk, you can see my contact details here in case that you 11:19.340 --> 11:22.940 want to reach out, and I would like to take your questions. 11:24.940 --> 11:28.700 I have a first question, so when we're talking about work conversion, 11:28.780 --> 11:37.260 the question is how far can you go, because I worried it's a very large format with a lot of 11:37.260 --> 11:43.420 features, so what are the limits in terms of what is supported, what is not supported, 11:43.420 --> 11:53.020 most probably more important in the export, and link to that blog post has now recently 11:53.100 --> 12:00.540 comments and suggestions, are you planning to support comments and suggestions in a word format? 12:02.380 --> 12:07.580 That's the problem. Let's take on your first question first. Can you repeat your question 12:07.580 --> 12:13.500 of me? Yeah, I also know today. Yeah, the first question was how far can you go on working 12:13.500 --> 12:19.100 important? Yeah, how far can we go on working important? You can basically go as far as the 12:19.100 --> 12:26.700 budget reaches, but it's not your technical question. It's quite hard format, because there are 12:27.980 --> 12:34.380 quite a collection of versions of the formats that you can use to describe the exact same element, 12:35.260 --> 12:44.380 and that's quite intense. There will always be some cases that you can't really cover, 12:45.260 --> 12:49.180 and you basically have to work with a lot of test documents to import. 12:53.180 --> 13:01.340 I think almost on a level of Pandok, almost. I think you do a little more with food notes 13:01.340 --> 13:07.900 and end notes, which I don't cover yet, but it's yeah, plan to be covered. Your second question 13:07.980 --> 13:15.900 was about comments in blockness. It would be very feasible. I also haven't seen their 13:15.900 --> 13:21.740 implementation yet, so I'm not sure how hard it would be. I think their implementation 13:21.740 --> 13:31.980 are, it's quite new also. Maybe also to your question. Pandok also like to cover some of 13:32.540 --> 13:41.900 this formula, like we just like the work mode, and also of latex, are you planning to be 13:41.900 --> 13:49.820 future parity in this side? Yeah, the question was, they cover penocorfered popular formats, 13:49.820 --> 13:56.940 like automatic, latex, am I planning to be basically on par with features? It's hard, 13:57.020 --> 14:04.300 because they support so many formats. At least take you a full year to actually be on par, 14:05.500 --> 14:13.580 at least. I try to cover as many formats, but I focus on the more popular ones, because they are 14:13.580 --> 14:26.460 more important. I'm not a 100% sure, I just do the really good. You said that you have millions of 14:26.620 --> 14:33.900 old documents. If the idea to convert them all in, you know, more than format, or is it just 14:33.900 --> 14:42.300 so that people can still access the old formats, because if it's the first one, maybe there should 14:42.300 --> 14:50.060 be some automated batch where you're hanging out converting. Yeah, the question was about having 14:50.140 --> 14:57.900 millions of documents. This was the case in the Dutch government, and your question was 14:57.900 --> 15:05.260 to have it sorry. Are you intending to convert them all, which would mean having plenty of 15:05.260 --> 15:11.660 them through this, and in terms of doing what you showed, you know, my girl? Yeah, so the question is, 15:11.740 --> 15:20.700 are we planning to convert them all? The Dutch government has quite a bit of documents. It's also 15:20.700 --> 15:26.300 hard to have a pipeline of it, because it's also separate governments. It would be quite feasible to 15:26.300 --> 15:36.460 have a pipeline, but this product that I showed was really focused on the user basically doing the work, 15:36.540 --> 15:43.100 the user being guided, and that's quite necessary, because the source can't always be accessible, 15:44.220 --> 15:49.580 adding structures are not always in the correct order. So the item means that we need to 15:51.580 --> 15:59.500 make guesses about what it should be, or we should just as a user. That's the hard part of the 15:59.500 --> 16:05.820 pipeline is that it won't all be accessible. You would need to validate it, it would need to let 16:05.900 --> 16:12.380 you choose, to let the user choose, as to make it accessible. 16:13.660 --> 16:18.380 I have another question in your list of potential features in the future. You mentioned PDF 16:19.980 --> 16:27.900 as output, I'm not sure if it was input or so. Yeah. Why for this, for example, in output PDF, 16:27.900 --> 16:33.900 when there's quite a lot of solutions that already exist, like including just printing, for example, 16:33.900 --> 16:41.580 or or, or so, lots of features that already exist to convert the HTML media for 16:41.580 --> 16:46.460 problems. Yeah, but I think there's something good to do. Yeah, the question was, why 16:46.460 --> 16:56.300 focus on output PDF, because they are quite a bit of solutions. I would say depends on money. 16:56.860 --> 17:03.500 This is interesting for last week, so that's why it would be interesting for me to build. 17:03.500 --> 17:05.500 That's a short answer. 17:08.380 --> 17:10.380 Yeah, I'll explain. 17:27.260 --> 17:42.220 I've looked at the unified Js. I've looked at the unified Js. 17:42.220 --> 17:45.260 They're seeing objects, but I will after I'll just talk now. 17:45.260 --> 17:47.260 Thank you. 17:51.500 --> 17:55.740 It's great to hear this. We actually do something like this 20 years ago, so we decided to take a first 17:55.740 --> 18:01.580 route and sort of handle the optimised word and made it for us to stay alive and to come there. 18:02.380 --> 18:07.580 So I learned a bit about the last few format. We'll just talk in the docics to look at it. 18:08.620 --> 18:13.660 But I confident that you can never become future complete as a connecting to the first question. 18:13.740 --> 18:17.660 Are you not always in the life of someone who's trapped, so you speak about that always 18:17.660 --> 18:22.780 to another breaking future and then so you can't quite get there, because I don't even seem to say, 18:22.780 --> 18:27.340 this is not a problem. Yeah, the question, am I confident that I can be future complete? 18:28.540 --> 18:35.660 Basically no. The well, yeah, the docics contains so many applications that's practically impossible. 18:36.620 --> 18:43.980 I do think it will be feasible to convert most of the documents that users make 18:44.620 --> 18:48.620 fully, because they won't use the entire set of orders available. 18:50.220 --> 18:55.500 Being fully on par with which completion, on the external specgets, 18:56.220 --> 18:58.940 the spec is so big that it's practically impossible. 18:59.900 --> 19:05.580 That's what we just want to provide for the users' set of templates that you will speak 19:05.580 --> 19:15.020 template for. Yeah, using templates makes it easier, because you actually control the inputs, 19:15.020 --> 19:18.940 so it also makes it easier to convert, because you know, but you expect. 19:20.940 --> 19:25.980 And that's the result, it includes some sort of check of provenance and approval of it. 19:26.940 --> 19:28.300 So, what do you mean? 19:28.300 --> 19:34.540 You make a transformation of content, does that result? The content, does it contain 19:34.540 --> 19:40.540 metadata that says it's origin is here, and it was transformed by this and this and I 19:40.540 --> 19:47.180 guess that it's condemned correctly. If it was metadata, it does include some metadata 19:47.180 --> 19:53.980 for the moments. I think authors, I mean, that's a data generated by your transformation 19:54.300 --> 20:02.700 that is done correctly, and it does not describe if it's done correctly, but that does mean 20:02.700 --> 20:07.500 you need something like that. Sorry? You see demand for anything like that. 20:08.540 --> 20:14.940 Yeah, it would be very interesting to have something that also checks if the conversion was done 20:14.940 --> 20:20.220 correctly, but that means that you are not sure how this practically would work. 20:20.940 --> 20:24.940 And sounds like you need a second application to check if the conversion would be correct. 20:27.500 --> 20:33.260 But it is, I've heard it before in government circles, especially where they want 20:35.180 --> 20:40.460 the conversion as correct as possible. It could be interesting, but I'm not sure how it, 20:40.460 --> 20:42.060 but it would look like practically. 20:50.540 --> 20:51.020 Sorry? 21:03.020 --> 21:06.300 Yeah, because she was, do I provide round-fripped conversions? 21:08.300 --> 21:08.700 Sorry? 21:09.740 --> 21:10.460 It didn't happen. 21:16.460 --> 21:18.300 What do you exactly mean with it? 21:20.220 --> 21:22.780 I don't know if there's anything that changes, and then it's said back. 21:22.780 --> 21:25.900 And really, other parts, which you don't support, still, it says, like, 21:25.900 --> 21:27.260 listed tables or whatever. 21:29.260 --> 21:33.660 It really depends on the output formats, on what is actually supporters. 21:34.380 --> 21:35.900 Not sure if it answers your question. 21:36.940 --> 21:39.980 But, like, there's actually some support, just keeping the data 21:39.980 --> 21:40.940 because you don't understand. 21:41.740 --> 21:45.580 Oh, do I try to support keeping the data that I don't understand? 21:46.540 --> 21:49.500 This moment and not, but it would be interesting, 21:49.500 --> 21:53.820 but it would also be hard to create a spec around that. 21:54.540 --> 21:58.220 But it could be quite interesting, for example, if you letter improved, 21:58.940 --> 21:59.900 document conversion. 22:00.540 --> 22:05.020 Because then you can basically confer it again with the current knowledge. 22:05.580 --> 22:07.260 But no, I don't do that at this moment. 22:11.900 --> 22:13.900 Okay, do we have any questions? 22:16.300 --> 22:17.340 No? 22:19.020 --> 22:19.980 Okay, so thank you. 22:28.300 --> 22:28.780 Sorry. 22:28.780 --> 22:30.780 Okay.