WEBVTT 00:00.000 --> 00:14.040 Hi everybody, again, thank you for coming to my talk and I'm going to be talking about my 00:14.040 --> 00:22.160 experiences trying to make asynchronous Iowa work on all of the BSTs and Solaris, Alumos 00:22.160 --> 00:29.960 and actually 10 operating systems. Most of which didn't work but some of which gave me 00:29.960 --> 00:35.080 a lot of ideas about work that could be done and I've done some experimentation and 00:35.080 --> 00:40.760 I want to talk about that and see if I can make contact with other people who might be interested 00:40.760 --> 00:46.280 in the same source of problems. So I actually do most of my post-paste development on a 00:46.280 --> 00:55.960 3BSD box and I've made a few small patches to 3BSD and I was punished for the commit 00:55.960 --> 01:02.600 bit for that but I haven't, I'm not a particularly prolific contributor to 3BSD. I would 01:02.600 --> 01:10.840 like to do more work and some of this is experimentation in that direction. Okay, so I wonder 01:10.840 --> 01:20.680 if I can change page down, sorry we're going to have to scroll. So the talk has cut 01:20.760 --> 01:26.040 up into five sections. The first three have slides and the second one just sort of discussion. 01:28.040 --> 01:37.400 So what databases want from the IOS stack? We can watch the giant world scroll past. 01:38.120 --> 01:44.920 So firstly, if you go back to the origins of Postgres, it was started in 1986 at Berkeley 01:45.880 --> 01:53.320 and that's the same time in place that BSD was happening. So a lot of the early work in Postgres 01:53.320 --> 02:02.440 was done on and and other related projects was done on BSD systems and some various spin-off 02:02.440 --> 02:08.920 systems, sun machines and dynamic machines and they were in that sort of general family. 02:09.720 --> 02:15.160 And the people that worked on that wrote a lot of papers about how terrible the whole 02:15.160 --> 02:21.160 file system interface of Unix was because Unix didn't allow them to have direct access 02:21.160 --> 02:29.960 to storage devices, to disks in particular, which goes back to a decision made at the really 02:29.960 --> 02:35.800 at the very beginning of Unix when they were trying to fit it into eight kilobytes of Colonel 02:35.880 --> 02:39.880 memory or whatever it was back in the day on a PDP seven or whatever it was. 02:41.720 --> 02:49.080 And there was a lot more control in Malthix and you know as the AT&T guys famously stripped 02:49.080 --> 02:53.800 down to the absolute minimum system, they throughout a lot of stuff that all contemporary 02:53.800 --> 02:59.880 operating systems could do and it's taken a long time for the Unix world to kind of catch up with 02:59.960 --> 03:06.360 that which is I think is an interesting bit of history. So I'm going to go through four things 03:06.360 --> 03:12.600 that all databases want from a from a disks subsystem. You can see this in Oracle and DB2 03:12.600 --> 03:18.760 and SQL Server and MySQL, everything. Postgres is fairly late to coming to this game of trying to 03:18.760 --> 03:24.280 do direct IO and the other three things that make up this sort of group of four things that we want 03:24.360 --> 03:33.000 from disks. So the first thing is direct IO which means not using Colonel Buffers for data. 03:34.600 --> 03:41.320 So on the left you can see Postgres reading a bunch of blocks from disk and pulling them all the 03:41.320 --> 03:46.840 way up into just by calling the read system call or PREV or one of those kind of system calls. 03:47.240 --> 03:53.560 Pulling that through the Colonel Buffer Pool and inter Postgres's own Buffer Pool. 03:54.520 --> 03:59.480 It's fairly unusual for software for user space programs to have a Buffer Pool and that's kind 03:59.480 --> 04:03.240 of where a lot of these special requirements come from. Almost all applications don't have these 04:03.240 --> 04:10.280 requirements right. So we have a lot of opinions about when IO should be happening and the rate 04:10.280 --> 04:15.000 of IO and all kinds of stuff like that that normally most applications just lead to the Colonel and 04:15.000 --> 04:21.320 we don't really want to have a secondary Buffer Pool that we're kind of fighting with for 04:22.840 --> 04:29.320 resources for RAM. We also don't really want to have the CPU involved in every read or write. 04:29.320 --> 04:34.200 We don't really want to be copying stuff in and out of the Colonel PageCash or Buffer Pool as 04:34.200 --> 04:37.160 you might call it. I think I'm using Linux terminology when I say PageCash maybe. 04:37.400 --> 04:43.400 And so on the right, unfortunately I have to slide, sorry, scroll. 04:46.440 --> 04:51.000 Using direct IO means that you basically just turn all that off and you hope, depending on the 04:51.000 --> 04:58.280 file system and drivers and so on. You hope that all of your IO becomes DMA transfers directly 04:59.560 --> 05:05.880 transferring data from storage devices into or out of user space accessible memory that's 05:05.960 --> 05:10.520 mapped into the space in this case, the PushBestMuffable. But the interesting thing about 05:10.520 --> 05:15.000 direct IO is that it's both an optimization and a pessimization. It's an optimization in the sense 05:15.000 --> 05:25.880 that the CPU is not involved. It just builds a NVME or SCSI read or write command and sends it to the 05:25.880 --> 05:32.520 device which means pretty much putting it into another queue somewhere and telling the device to 05:32.520 --> 05:39.160 do things through the stack of device driver and so, a cam system and device driver and so on. 05:39.160 --> 05:43.320 But unfortunately if you use synchronous calls to do direct IO, you then have to wait the longest 05:43.320 --> 05:48.600 time possible. There's no other cache helping you. So every single read or write call is going to 05:48.600 --> 05:56.840 take the full latency store. So direct IO gets a whole lot of CPU out of the way and gets a lot of 05:56.920 --> 06:08.760 cycles out of the way. But you then have to stall. So that's bad. And we'll address that with 06:10.040 --> 06:14.280 slide after this one. The second of the four things I wanted to talk about is vector IO and this 06:14.280 --> 06:19.960 is a thing that or scatter gather IO. Not that many user space programs care about scatter gather IO 06:19.960 --> 06:25.320 because normally if you want to read a large amount of data into memory you arrange to have a 06:25.320 --> 06:30.840 buffer that's the right size and you just do your read or write. But databases because they have 06:30.840 --> 06:36.360 again because they have buffer pools. They are really opinionated about where memory should go. 06:36.360 --> 06:44.360 So they have a bunch of buffer replacement algorithm problems that exist also in the current 06:44.360 --> 06:50.040 with the kernel's buffer pool. And the buffer is that they might find or might be able to allocate 06:50.040 --> 06:56.600 to the data that you're trying to read in might not be contiguous. And yet we still want to generate 06:56.600 --> 07:02.360 the minimum number of IO's. If you're using a cloud provider you might be paying for a certain number 07:02.360 --> 07:09.240 of IO's or if you're using physical hardware. This is a certain number of IO's you can do. 07:09.240 --> 07:17.000 It would be ridiculous if you had to do multiple IO's. Generate more IO's and take longer because 07:17.640 --> 07:21.720 the buffers that you find to put the data into are not contiguous when you're reading a 07:21.720 --> 07:27.720 large contiguous chunk of disk. So we want to better generate scatter gather IO commands at the 07:27.720 --> 07:35.800 driver level and have them executed as a single DMA transfer straight into your non-contiguous 07:35.800 --> 07:40.200 buffers in memory. And all the drivers can do that. That's part of the SCSI and NVMe protocol 07:40.200 --> 07:45.560 and so on. It's just something we want to do from user space and not have it get lost somewhere 07:45.640 --> 07:50.200 in the IO's stack and converted into a four loop that's actually doing multiple IO's. 07:50.200 --> 07:56.440 So that's the second of four things that AWS is one from a disk storage stack. The third thing 07:56.440 --> 08:05.160 is asynchronous IO and we'll talk in a moment about what that looks like but this is about separating 08:05.160 --> 08:13.480 the submission of kind of starting IO's from the completion which means finding out the result. 08:13.480 --> 08:18.680 It's an error. Possibly waiting for it to happen if it hasn't happened yet but if you align or 08:18.680 --> 08:23.880 if your program is sufficiently clever and it does things ahead of time then hopefully you get 08:23.880 --> 08:32.280 to consume the conditions later without going off CPU. So the goal of SXRSO is really to be able 08:32.280 --> 08:38.920 to use direct IO effectively to hide the latency problem and solve that customization problem. 08:39.880 --> 08:45.880 I think it's really interesting that so many other operating systems are even included 08:45.880 --> 08:53.320 a meagodoss. That was one of my early learning experiences. Even meagodoss had async and as IO's 08:53.320 --> 08:59.560 as a first class thing that you could use in an application so I remember games where it managed 08:59.560 --> 09:04.600 to get a couple of floppy drives going at once without difficulty because it wasn't blocking 09:04.680 --> 09:11.640 in the way that tradition. I have to speed this up because of the debacle with my Mac video 09:11.640 --> 09:18.120 not working. So the fourth thing that I've grouped together is four things that are a little 09:18.120 --> 09:24.200 bit special about database disco is that we want to make sure that all rights can be done concurrently 09:24.200 --> 09:30.360 so we don't want to have if we're writing data out of our buffer pool to disk and it's consists 09:30.360 --> 09:35.240 of many separate chunks. We want to control the amount of concurrency and we don't want anything 09:35.240 --> 09:40.840 like a, you know, like an i-node level lock to cause those rights to be serialized which is 09:40.840 --> 09:47.480 something that happens in a lot of operating systems when you're using direct IO that becomes 09:47.480 --> 09:52.920 terrible because serializing a bunch of things is an incredible, you know, actual hardware 09:52.920 --> 09:58.040 transfer waiting for completion will be very bad. Another example that so that's when 09:58.120 --> 10:02.680 writing from the buffer pool the other kind of disk IO that databases all do is writing logs 10:03.240 --> 10:08.520 and this it's a, I think it's a fairly complicated sort of thing but to get the maximum amount of 10:08.520 --> 10:14.760 throughput and lowest latency that there's some kind of heuristic where you want to get, you want 10:14.760 --> 10:21.320 to be able to start writing a chunk of transaction log and then more transactions happen and you 10:21.320 --> 10:25.640 start writing before the previous lock had finished you start more and you want those to be 10:25.640 --> 10:31.240 dumping how many as well. So that's an area where a lot of file systems have funny locking 10:31.240 --> 10:37.240 and moidal out that to happen for various reasons. So that's the fourth of the four things that 10:37.240 --> 10:41.480 I think are a little bit unusual about database disk IO. There are some more things that I think 10:41.480 --> 10:47.880 which should have, I've done some experimentation in this area since times running short I'm actually 10:48.440 --> 10:54.920 going to skip the set on the left there and say that being able to register buffers and just 10:55.000 --> 10:58.680 the stuff on the right, being able to register buffers and register file descriptors so that they 10:58.680 --> 11:05.880 don't have to be pinned or held each time you do a read from a file if you've opened a file and 11:05.880 --> 11:09.720 and you're going to be doing a whole of reads from that file and each time you do that if you call 11:09.720 --> 11:15.080 a reads system called inside there you'll see that the file descriptor has to be looked up and 11:15.080 --> 11:20.200 and like reference count adjustments. There's some small improvements like that that can be done 11:20.200 --> 11:27.160 and you can see that happening in Windows or Linux in very high performance programs where they're 11:27.160 --> 11:34.840 really getting down to removing unnecessary CPU work. So I'd like to be able to find some 11:35.480 --> 11:38.840 programming interfaces that would allow all of those tricks and these other tricks over here 11:38.840 --> 11:45.880 that I haven't got time to talk about. So the reason I'm interested in all this is because in 11:45.880 --> 11:50.440 Postgres 18 which just came out we've finally released something that we've been working on for 11:50.440 --> 11:56.120 a few years which is a subsystem we're in the process of converting all of Postgres's disco. 11:56.120 --> 11:59.880 Eventually we'll do network as well we have prototypes but I'm not talking about network here 12:00.600 --> 12:06.760 it's pretty similar in many ways but I'm just going to focus on disks. So the idea is that 12:07.640 --> 12:12.360 in the past I'm going all the way back to 1986 we did all our other by simply calling reading 12:12.360 --> 12:18.600 write or p read and p write and in the last couple of years we've released in a series of steps 12:20.520 --> 12:26.440 support for doing VectorDO and in 18 which just came out recently we've got this thing where you 12:26.440 --> 12:33.960 can set this new setting IOMFID. On Linux there's an option IOU ring and that will use this new 12:35.000 --> 12:39.640 kind of universal system call that they have in Linux which allows you to start all kinds of 12:39.640 --> 12:47.000 operations and the interesting ones here are just reads and writes and that's done without 12:49.560 --> 12:54.040 well in the best of cases it depends on the file system and all kinds of other things like 12:54.040 --> 13:00.840 this compression involved this might not be true but in simple cases when you start an IOU it 13:00.840 --> 13:06.200 doesn't run any kind of like a kernel thread to do there it doesn't use all the traditional code 13:06.200 --> 13:12.680 files that would implement a synchronous read or write it just sends the you know converts 13:12.680 --> 13:16.600 your instruct you know converts the logical blocks into physical blocks and does the you know 13:16.600 --> 13:21.080 bureaucracy like that and then it pretty much just starts the I would pushing it down through the 13:21.080 --> 13:26.120 driver and returns control and then later you can you can you know wait for the completion events 13:26.920 --> 13:31.880 and all that happens without the assistance of any extra you know any scheduling overheads due to 13:31.880 --> 13:36.040 internal you know worker pulls or anything like that there's nothing like that in the best of cases 13:36.040 --> 13:40.520 which is what we want to what we want to happen we also have IOMFID equals worker and this is what's 13:40.520 --> 13:48.280 used at the moment on free BST or any of the LBSTs or LIMOS and that's just a simple portable 13:48.280 --> 13:56.120 system that we need as as a baseline and that has a pool of worker processes that do the IOs for you 13:56.120 --> 14:00.440 so they finish up spending their time just consuming from a queue and sitting in reading write 14:00.440 --> 14:08.680 calls during them for you so you can get on with doing other stuff. I've gotten experimental 14:08.680 --> 14:15.080 POSIX AO patch which works only on free BST and I'll talk about why that is I tried to get it 14:15.080 --> 14:20.280 working on whole bunch of other Linuxes you would think from the name that it would be portable but 14:20.280 --> 14:27.160 POSIX AO is I think essentially a failure I've got a few theories for why it failed I mean 14:28.120 --> 14:35.400 the basic sort of reason I think is that not that many applications or even programming languages 14:35.400 --> 14:40.920 were ready to deal with ACIC when I say oh and not that many applications spaces kind of needed 14:40.920 --> 14:45.880 that kind of stuff and the reason I grouped together those four things at the beginning of the 14:45.880 --> 14:52.920 talk was to kind of explain that you know if you since common applications just don't really 14:52.920 --> 14:56.440 have a need for a vector IO and don't really have a need for the direct IO which is a 14:56.440 --> 15:01.240 pessimization unless you also have a whole bunch of other infrastructure to drive enough concurrency 15:01.240 --> 15:06.600 it means that ACIC was also just wasn't that interesting from typical applications so 15:06.600 --> 15:12.760 and you know there were a few big database companies and they in the 90s they figured out how to 15:12.760 --> 15:18.760 use ACIC's POSIX AO on the big unit systems but if you look closely they didn't just use POSIX AO 15:18.760 --> 15:23.480 they also secretly negotiated with all the commercial units vendors and had them add all kinds 15:23.480 --> 15:30.120 of secret interfaces so they could make it work properly I think we need to find the right 15:30.120 --> 15:36.360 set of interfaces and do them in a nice way that ideally would work the same way across the 15:36.360 --> 15:42.760 BSTs and Lumos they have so much code and common that it would it would be a shame if they 15:42.760 --> 15:49.240 can shut in compatible yeah I also experiment with some other operating systems that I'm 15:49.240 --> 15:56.680 relevant to this talk so yeah work a mode is what you can see if you use POSIX AO on any of 15:56.680 --> 16:00.920 these operating systems you're going to see these IO work processes running and you can 16:01.960 --> 16:05.400 envision 18 you have to get to say how many there are in 19 we're going to try and make it so 16:05.400 --> 16:10.520 that it automatically scales and yeah so that's where all the just reading rights are going to be 16:10.520 --> 16:17.000 happening on behalf of the normal query execution processes which will hopefully be threads 16:17.000 --> 16:23.000 as well but that's a separate topic the moment this is a process is so 16:40.520 --> 16:54.040 sorry about this I managed to copy in place when I was making the PDF some of the 16:54.040 --> 17:00.600 wrong slides okay I think my time was up so I'm just going to very quickly come to the 17:01.320 --> 17:04.600 I'm going to skip a lot of slides and come straight to the kind of 17:05.560 --> 17:11.160 question topics like I'm trying to find I'm trying to make contact with people who would 17:12.280 --> 17:18.840 you might be able to help make decisions in this area so this kind of two levels to do really 17:18.840 --> 17:23.160 good asynchronous IO and all these systems this kind of two levels that need some significant 17:23.160 --> 17:29.400 hacking and at least decisions there's the user space level where I think the main choices are 17:29.400 --> 17:35.720 either to actually implement IO uing or subset of IO uing as it's been defined on legs 17:38.520 --> 17:43.080 either exactly the same for compatibility or something generally similar where a thin layer 17:43.080 --> 17:48.120 could could work the same way or to look at the work that was done on KQ KQ supports 17:51.560 --> 17:58.440 consuming completion events through KQ the funny thing about that is that of the five operating 17:58.520 --> 18:06.280 systems that took three BST's KQ system KQ's great I love KQ but all of the other systems like 18:06.280 --> 18:14.280 Max and Neppiosti I can be a ste they took KQ but they deleted the completion event support so 18:14.280 --> 18:20.280 they can only really do readiness for sockets and so on all that works fine and that's very widely used 18:20.280 --> 18:25.560 across a lot of software but it means that it looks like you could write something that would use 18:26.520 --> 18:31.400 KV for completion events but it wouldn't be portable for those other systems I was a little bit disappointed by that 18:32.760 --> 18:40.440 and then yeah so the choice really of like how we should respond to IO uing to me it seems like 18:40.440 --> 18:46.840 either we implement it or we figure out how to make KQ able to do all the things you need to do to do 18:46.840 --> 18:53.000 that stuff efficiently and I'll just finish up with the final thing which is so that's the surface 18:53.000 --> 18:56.280 level question of how we should be done we can also do both of those things it might be that you 18:56.280 --> 19:00.920 can do KQ make KQ a little bit better but also provide an interface that lets software be ported 19:00.920 --> 19:07.320 and from Linux that uses IO uing sound faces I'm not sure about that but that's not really the hard part 19:07.320 --> 19:13.880 I'm just going to super quickly the last thing is oh sorry please no it's just we have a 19:13.880 --> 19:23.720 two minutes before the next yes sure so I'll just close up okay yeah okay yeah yeah yeah we are 19:23.720 --> 19:30.200 running very very well okay so I'll just say goodbye then the final thing is to actually drive 19:30.200 --> 19:37.000 asynchronous IO all the way through to the driver we need to change the VSA for VFS layer and that's 19:37.000 --> 19:42.840 going to be kind of tricky the net BSD guys seem to have a project plan for that I would like to 19:42.840 --> 19:49.480 figure out how we could make that work the same across the extremely similar code in previous 19:49.480 --> 19:54.920 D and presumably open BSD as well if anyone knows anything about this project I'd be interested 19:54.920 --> 20:01.160 in talking to them about it and that's it yeah