WEBVTT 00:00.000 --> 00:08.240 Hey, where come to I talk about Suffolk Tsung? 00:08.240 --> 00:13.120 This is not really about DNS in general, but more about orchestration, orchestrating DNS. 00:13.120 --> 00:22.000 It's all a bit silly, that's why we're here, so be prepared. 00:22.000 --> 00:35.120 Well, Tommy, I hope you can make it really likes bash, and you will find that in the talk again. 00:35.120 --> 00:37.200 And I like distributed systems. 00:37.200 --> 00:40.520 I read internet standards for fun and implement them. 00:40.520 --> 00:45.000 It's kind of our got in the sole DNS thing. 00:45.000 --> 00:52.680 And we both do a product surf fail, which is an authoritative DNS service. 00:52.680 --> 00:58.200 We have everything is open source, so we run power DNS, which is open source software, 00:58.200 --> 01:05.240 our back end is open source, and our web interfaces as well. 01:05.240 --> 01:06.240 It's fun. 01:06.240 --> 01:11.000 We support multi-tenancies, so when you delegate it to the main, you can choose the primary 01:11.000 --> 01:18.360 name server, and all the important stuff, like key material and stuff, never leave the primary 01:18.360 --> 01:24.440 name server, and it's just delegated, and the rest of the zone is just AXFR to all the secondary 01:24.440 --> 01:27.440 name servers in the network. 01:27.440 --> 01:35.800 We have both product-helset, product-helset name servers, which we run as a product, and we 01:35.800 --> 01:40.400 also have when community members, when we trust them enough, and they're active for 01:40.400 --> 01:45.720 a while, they can also start to have time, they can also run their own DNS service, which 01:45.720 --> 01:53.360 then take a network, and also start to, well, people can also put zones there if they 01:53.360 --> 01:55.400 want to. 01:55.400 --> 02:00.680 We both have stable name servers, which are like production ready, they're running 02:00.680 --> 02:05.960 since a while because of the unstable, and they're working name servers, which are different 02:05.960 --> 02:11.280 still in the testing phase, or even temporary name server, when someone brings some 02:11.280 --> 02:16.160 silly hardware to them, event, and we happen to have an IP address, just plug it in, and 02:16.160 --> 02:21.400 oh yeah, we run the DNS server for a few days, so we just, we can plug it in the project, 02:21.400 --> 02:27.080 and people could use the name server they want to do, and as mentioned before, our 02:27.080 --> 02:35.960 tech, the quotation mark stack is built on Power DNS, as mentioned before, Bash, there's 02:35.960 --> 02:42.160 a lot of Bash that, if I have back end as written in Bash, it's, and the API proxy is written 02:42.160 --> 02:50.800 in Bash, and it works, surprising you well, better than some other solutions, and of course 02:50.800 --> 03:01.760 there's auto-curl because we need, don't be right, there's auto-curl, and one and a half 03:01.760 --> 03:09.080 hours in front of it, it will be faster to walk, yes, I will give it to you in this work, 03:09.080 --> 03:17.760 I will need to lean in really close, because Power DNS is a lovely API, we need to 03:17.760 --> 03:24.120 somehow access, why do we need to orchestrate, so the handful of servers you can just copy 03:24.120 --> 03:29.720 config around, it's fine, you can get it by with that, when you get like a handful more 03:29.720 --> 03:36.320 servers, that's becoming more tedious, more painful, start getting this configuration, 03:36.320 --> 03:42.080 and just as a certain point, just don't even bother with manually trying to copy configuration 03:42.080 --> 03:50.120 system, no, we have 12 servers now, just forget it, and not all of them are hosted by 03:50.120 --> 03:56.240 us, they're hosted by volunteers who have spent time, and we don't just want to bug them 03:56.240 --> 04:01.480 all the time, hey, can you please, can you please add this new name server, we just added 04:01.480 --> 04:06.960 for a few days, so we really need to have something to shrink, important, important part 04:06.960 --> 04:11.840 is that we don't have root access to all of the servers, which is by design, because 04:11.840 --> 04:18.480 we want this to be sustainable, and we want to divide the work among as many people as possible, 04:18.480 --> 04:23.520 instead of just, oh yeah, I will fix it, and we have 100 servers to fix, then like, 04:23.520 --> 04:30.880 us two people, that would suck, so we want to avoid that, so what do we need, as mentioned 04:30.880 --> 04:35.760 before the network status is assumed to change, we may have new servers which people want 04:35.840 --> 04:42.640 to host, we may have to serve as a service, because there were just temporary or like, oh, 04:42.640 --> 04:49.040 someone doesn't want to host a service anymore, which is fine, or like IP migrations, because 04:49.040 --> 04:53.760 everything is also done, we don't really have a lot of own infrastructure, we grant a lot 04:53.760 --> 05:01.600 of these servers, so sometimes we just move as well, everyone's something that is easy to configure 05:01.680 --> 05:07.600 and run for volunteers, so when we just have push new names of us like the mod of servers, 05:07.600 --> 05:12.960 you have to change IP addresses, and something we will get into later, we have to change IP addresses 05:12.960 --> 05:21.440 for the primaries of the different zones, it's not really feasible to do at this scale with 05:21.440 --> 05:29.280 volunteers and just gets really neat that automated, we need to manage something which I have 05:29.360 --> 05:34.800 called static and dynamic configuration, the static configurations like just in the config files, 05:34.800 --> 05:39.200 which servers are allowed to AXFR, which allows servers are allowed to notify, which servers should 05:39.200 --> 05:45.280 you notify on zone updates, and then there's a second part, probably, and there's something called 05:45.280 --> 05:50.880 auto primaries, when you have the right name server set and you AXFR is own, which the server 05:50.880 --> 05:56.000 has previously not seen and just automatically creates it, we make a lot of use for that, because 05:56.960 --> 06:02.080 we don't need to script an extra API endpoint, it just takes some work of us, it's very nice, 06:02.880 --> 06:08.560 and that's called, that's stored in SQL, so we need to update that with PDFs, 06:08.560 --> 06:14.160 YouTube, and everything should run as previously mentioned, with reasonably low privileges, 06:14.960 --> 06:21.840 we don't just want to dump some, some we get scripted, we ran some we get scripted, we wrote 06:22.480 --> 06:28.480 yeah, it run it as root done for real about it, it's not really, that's not really nice, 06:29.280 --> 06:34.880 some existing solutions would be Ansible, which I unfortunately make a sense of use of, 06:35.440 --> 06:39.760 I don't like it, you kind of describe the states, and just hope everything 06:40.960 --> 06:45.440 matches that state, and there's an extra files in the directory, just don't worry about it, 06:45.520 --> 06:51.760 it just don't worry about the extra files, they're just belong there, and Ansible just wants 06:51.760 --> 06:55.840 to be root over time, so it doesn't really fit in the whole, yeah, yeah, we should have minimal 06:55.840 --> 07:05.040 privileges, but it's also hard to integrate a bit, and there's nix, I know nix, I make extensive 07:05.040 --> 07:10.640 use of that, on my own hardware, but I would never recommend it to anyone else, it's like a very 07:11.200 --> 07:18.560 circumstances on which I would, it is missing a lot of documentation, it's slow, like a proper 07:18.560 --> 07:31.440 nix concept, it's slower than an Ansible deployer, and you're not doing it right, 07:33.440 --> 07:38.560 a very like multi targets, you need to build like one configuration for each target, and that's 07:38.640 --> 07:46.000 even slower, not fun, dynamic configuration is not really something, it's designed into nix, 07:46.000 --> 07:53.440 people have wrote some scripts which fetch the state and update the state for some data bases, 07:53.440 --> 08:00.160 but it's mostly nix finds when managing static configuration files and not whatever the hell 08:00.160 --> 08:05.200 we are trying to do, as also hard to integrate again because people need to run a nix demon, 08:05.280 --> 08:09.200 and when you connect to the demon with some user to develop the config again, 08:10.960 --> 08:18.640 not really feasible as well, so build our own, they still do a microphone and continue 08:18.640 --> 08:27.200 it, you can still do a microphone for that, you won't be, no it's fine, you can just stay 08:27.200 --> 08:40.560 and really close me, it's okay, hold on, I think we're good, so we have devised a solution, of course 08:40.560 --> 08:49.200 in Bash, and the solution is two parts, it's a client server architecture, the clients are deployed 08:49.200 --> 08:56.240 on individual power DNS servers, as in on the same machine, because they need to modify some state 08:56.240 --> 09:03.680 that is on the same machine, so that just makes sense, and the servers or the server right now 09:03.680 --> 09:13.440 is a centrally located control plane, and we are looking slowly into, we are thinking about 09:13.440 --> 09:19.680 distributing this somehow, but distributed systems are not really superbomatic, so we have started 09:19.680 --> 09:25.040 with a simple solution where there is a central authority, purely because that's easier for us, 09:27.200 --> 09:33.360 we don't like saying that essentially this is blockchain architecture, we don't like saying that 09:33.360 --> 09:40.720 this is blockchain, but you have a chain of events, which is to say, when you add an event to 09:40.720 --> 09:45.120 the timeline, you can never remove that, you can just add another one that modifies the state, 09:46.000 --> 09:57.680 this is made to ensure that all servers can know what all other servers are, like what point of 09:57.680 --> 10:03.520 the timeline there are, so they can request changes from a specific point, from a specific height 10:03.520 --> 10:12.960 in the timeline, in the chain, and we implement two simple instructions right now, which is 10:13.280 --> 10:21.680 modify and delete, modify either ads or modifies an existing record, or the delete removes 10:21.680 --> 10:30.240 records or removes the server, and surface inclined, as FanFa has already expressed, 10:30.240 --> 10:36.320 modifies both the static configuration, so generates a conflict file, and the dynamic parts, 10:36.320 --> 10:44.960 so calls up, pardon us, comment, saying, hey, add this out of primary, and generally, 10:46.240 --> 10:53.360 yeah, we have deployed with duals on most systems, which is to say, it is running as a low 10:53.360 --> 11:03.280 privilege user, and we gave that user specific commands that they can run, some of this can 11:03.280 --> 11:13.520 probably be changed from root to pdns, I don't think we need to run pdns, but this is 11:13.520 --> 11:20.480 roughly what we run in, and this nicely lowers the amount of surface area that we need to 11:21.280 --> 11:30.160 work about, so in case of a security breach, potentially they will be a little bit to this 11:30.240 --> 11:37.680 carnage, if there is anything running on the server, except for pardon us, thanks to the polling 11:37.680 --> 11:45.120 architecture, so we have clients connecting to the central server, we can ask the clients to report 11:45.680 --> 11:55.200 specific data back, which is to say, we report the script version, we report the pardoned 11:55.280 --> 12:02.960 unit version, and we have some small error reporting, so in case something goes wrong, we may be able 12:02.960 --> 12:11.120 to know just for this conflict plane, separately from this we have Grafana, and we also think about 12:11.120 --> 12:16.640 introducing Grafana support to look for like the permitture support to set up a sync, 12:16.640 --> 12:24.800 this is something for the future, other than that we are thinking about improving the admin 12:25.520 --> 12:31.440 UI, because currently it's just a web page, and it is the least smallest web page I could make, 12:31.440 --> 12:38.480 which is to say HTML forms, and very, very little error handling or anything else, it just fails 12:38.480 --> 12:44.800 outright, because it's not supposed to be used by a user, it's supposed to be used by us, and yeah, 12:44.800 --> 12:53.280 so we deployed it, we had our first successful server migration, everything seemed to go all right, 12:53.280 --> 13:00.240 until like 30 minutes later, some are modifies their zone, and suddenly the eggs are not accepted, 13:00.240 --> 13:07.600 like all of the servers are rejecting eggs, eggs are farce for zones that were mastered on that 13:07.600 --> 13:14.000 server does it, that we migrated, what gives, so turns out the power DNS except for the 13:15.040 --> 13:21.040 static and dynamic configuration that we have discussed previously, it also has 13:21.360 --> 13:30.160 healed in the database that specifies a list of IPs of a master of a certain zone, so this 13:30.160 --> 13:38.400 is populated on the first notifying, and we were unaware about this and heal that very migration, 13:40.080 --> 13:47.120 so we needed to hold patch, and the fastest way to hold patch will be, we don't want to know 13:47.120 --> 13:52.320 how to solve this properly, you want to solve this right now, so we just look down to all servers 13:52.320 --> 14:00.880 manually, the thing that we were trying to prevent from having to do, and we manually updated the database 14:00.880 --> 14:08.320 to fix this, we started the servers, everything started, appropriating all right, so that was 14:08.320 --> 14:18.880 maybe 30, maybe 40 minutes of 45 minutes of downtime, from the beginning through, as discovering 14:18.880 --> 14:25.680 something is wrong up to the resolutions, that was really, really fast, regardless, the proper 14:25.680 --> 14:32.960 fix is apparently to use our dedicated power DNS comment for this, but this shows us another 14:33.280 --> 14:40.640 problem, Serfale's think client don't really have any access to all list of domains, 14:40.640 --> 14:49.280 because they are fully separate from all our other info, they are integrated only with each other, 14:49.280 --> 14:56.960 and we wanted to keep them separate to maybe allow others to deploy them without having to 14:57.280 --> 15:08.160 deploy all of our stack, and that's a problem that we have solved, we were trying to solve it through, 15:09.280 --> 15:18.800 maybe PDN, PDN as you can also show us the list of primaries, well you can, but I wouldn't want to 15:18.800 --> 15:26.320 expect this output to never change, so my assumption is that this is not a stable API, so to speak, 15:27.680 --> 15:35.040 and I went looking for further solutions, unfortunately the only solution I found was 15:35.040 --> 15:42.480 parsed the compile, extracted the API keys for the HTTP API, and tried to co-local host, 15:43.600 --> 15:50.160 this is not yet integrated into our code base, as can be seen by the very nice mockup code do not run, 15:51.120 --> 15:57.920 this is because I didn't have any servers to test it on our test environment, but yeah, 15:57.920 --> 16:07.920 generally should work, hopefully, and that was one of our wildest, like the whole engineering 16:07.920 --> 16:15.360 disaster, as I would say, want to call it, was one of the wildest things that we didn't expect 16:15.360 --> 16:19.440 in a product scenario, it was for everything else we were able to prepare quite nicely, 16:19.440 --> 16:24.720 and here there are just some things that even if they are documented, there is so much documentation 16:24.720 --> 16:28.720 that's seeking through all of it is like, when you don't know that you will have a problem, 16:29.440 --> 16:37.440 it's just not feasible, so we need this. In summary, Vachosahamer and we made 16:37.440 --> 16:43.520 nails out of PDN as YouTube, and I hope you like the presentation, if anyone has any questions, 16:43.600 --> 16:47.600 I would love to hear them. 16:54.320 --> 16:56.960 We have a lot of time with questions, so. 16:58.960 --> 17:06.720 Usually it's recommended to do XFR's over as a SQL connection, because VMS is funny, 17:06.720 --> 17:09.840 it tends to do things at my next level that you do, saying the truth, and so on. 17:10.800 --> 17:16.480 Why understanding is that right now it's just going directly through my idea? 17:16.480 --> 17:26.720 Currently, yes. The question is, the general recommendation is to proxy XFR's through some 17:26.720 --> 17:34.320 secure connection, and how are we doing this. We are currently just answering XFR's over the web 17:35.200 --> 17:43.200 without any encryption. The thing is, we are planning to move to WireGuard, but this is 17:44.080 --> 17:49.280 undertaking when you want to make this automatic, so we essentially will need to extend 17:49.280 --> 17:56.000 Serpile Sync into also provisioning WireGuard, which hopefully is something in the near future. 17:56.480 --> 18:09.040 For right now, we have not seen troubles with this. The XFRs are propagating and as far as I'm concerned, 18:10.080 --> 18:18.080 they are signed, right? I don't think that they could be tampered with. 18:18.480 --> 18:27.600 Plus, for users who are worried about tampering, we do offer DNS and then the way we 18:27.600 --> 18:36.560 offer the DNS is we pre-sign the zone on the master, and all the secondary servers have 18:38.880 --> 18:45.360 a row and it's insect free records and everything else. We're just to say that if you modify any 18:45.360 --> 18:49.600 part of the zone, it will be noticed by the DNS second implementation on the other side. 18:51.440 --> 18:57.440 Hope that answers it. Anyone else?