WEBVTT

00:00.000 --> 00:14.040
Hi everybody, again, thank you for coming to my talk and I'm going to be talking about my

00:14.040 --> 00:22.160
experiences trying to make asynchronous Iowa work on all of the BSTs and Solaris, Alumos

00:22.160 --> 00:29.960
and actually 10 operating systems. Most of which didn't work but some of which gave me

00:29.960 --> 00:35.080
a lot of ideas about work that could be done and I've done some experimentation and

00:35.080 --> 00:40.760
I want to talk about that and see if I can make contact with other people who might be interested

00:40.760 --> 00:46.280
in the same source of problems. So I actually do most of my post-paste development on a

00:46.280 --> 00:55.960
3BSD box and I've made a few small patches to 3BSD and I was punished for the commit

00:55.960 --> 01:02.600
bit for that but I haven't, I'm not a particularly prolific contributor to 3BSD. I would

01:02.600 --> 01:10.840
like to do more work and some of this is experimentation in that direction. Okay, so I wonder

01:10.840 --> 01:20.680
if I can change page down, sorry we're going to have to scroll. So the talk has cut

01:20.760 --> 01:26.040
up into five sections. The first three have slides and the second one just sort of discussion.

01:28.040 --> 01:37.400
So what databases want from the IOS stack? We can watch the giant world scroll past.

01:38.120 --> 01:44.920
So firstly, if you go back to the origins of Postgres, it was started in 1986 at Berkeley

01:45.880 --> 01:53.320
and that's the same time in place that BSD was happening. So a lot of the early work in Postgres

01:53.320 --> 02:02.440
was done on and and other related projects was done on BSD systems and some various spin-off

02:02.440 --> 02:08.920
systems, sun machines and dynamic machines and they were in that sort of general family.

02:09.720 --> 02:15.160
And the people that worked on that wrote a lot of papers about how terrible the whole

02:15.160 --> 02:21.160
file system interface of Unix was because Unix didn't allow them to have direct access

02:21.160 --> 02:29.960
to storage devices, to disks in particular, which goes back to a decision made at the really

02:29.960 --> 02:35.800
at the very beginning of Unix when they were trying to fit it into eight kilobytes of Colonel

02:35.880 --> 02:39.880
memory or whatever it was back in the day on a PDP seven or whatever it was.

02:41.720 --> 02:49.080
And there was a lot more control in Malthix and you know as the AT&T guys famously stripped

02:49.080 --> 02:53.800
down to the absolute minimum system, they throughout a lot of stuff that all contemporary

02:53.800 --> 02:59.880
operating systems could do and it's taken a long time for the Unix world to kind of catch up with

02:59.960 --> 03:06.360
that which is I think is an interesting bit of history. So I'm going to go through four things

03:06.360 --> 03:12.600
that all databases want from a from a disks subsystem. You can see this in Oracle and DB2

03:12.600 --> 03:18.760
and SQL Server and MySQL, everything. Postgres is fairly late to coming to this game of trying to

03:18.760 --> 03:24.280
do direct IO and the other three things that make up this sort of group of four things that we want

03:24.360 --> 03:33.000
from disks. So the first thing is direct IO which means not using Colonel Buffers for data.

03:34.600 --> 03:41.320
So on the left you can see Postgres reading a bunch of blocks from disk and pulling them all the

03:41.320 --> 03:46.840
way up into just by calling the read system call or PREV or one of those kind of system calls.

03:47.240 --> 03:53.560
Pulling that through the Colonel Buffer Pool and inter Postgres's own Buffer Pool.

03:54.520 --> 03:59.480
It's fairly unusual for software for user space programs to have a Buffer Pool and that's kind

03:59.480 --> 04:03.240
of where a lot of these special requirements come from. Almost all applications don't have these

04:03.240 --> 04:10.280
requirements right. So we have a lot of opinions about when IO should be happening and the rate

04:10.280 --> 04:15.000
of IO and all kinds of stuff like that that normally most applications just lead to the Colonel and

04:15.000 --> 04:21.320
we don't really want to have a secondary Buffer Pool that we're kind of fighting with for

04:22.840 --> 04:29.320
resources for RAM. We also don't really want to have the CPU involved in every read or write.

04:29.320 --> 04:34.200
We don't really want to be copying stuff in and out of the Colonel PageCash or Buffer Pool as

04:34.200 --> 04:37.160
you might call it. I think I'm using Linux terminology when I say PageCash maybe.

04:37.400 --> 04:43.400
And so on the right, unfortunately I have to slide, sorry, scroll.

04:46.440 --> 04:51.000
Using direct IO means that you basically just turn all that off and you hope, depending on the

04:51.000 --> 04:58.280
file system and drivers and so on. You hope that all of your IO becomes DMA transfers directly

04:59.560 --> 05:05.880
transferring data from storage devices into or out of user space accessible memory that's

05:05.960 --> 05:10.520
mapped into the space in this case, the PushBestMuffable. But the interesting thing about

05:10.520 --> 05:15.000
direct IO is that it's both an optimization and a pessimization. It's an optimization in the sense

05:15.000 --> 05:25.880
that the CPU is not involved. It just builds a NVME or SCSI read or write command and sends it to the

05:25.880 --> 05:32.520
device which means pretty much putting it into another queue somewhere and telling the device to

05:32.520 --> 05:39.160
do things through the stack of device driver and so, a cam system and device driver and so on.

05:39.160 --> 05:43.320
But unfortunately if you use synchronous calls to do direct IO, you then have to wait the longest

05:43.320 --> 05:48.600
time possible. There's no other cache helping you. So every single read or write call is going to

05:48.600 --> 05:56.840
take the full latency store. So direct IO gets a whole lot of CPU out of the way and gets a lot of

05:56.920 --> 06:08.760
cycles out of the way. But you then have to stall. So that's bad. And we'll address that with

06:10.040 --> 06:14.280
slide after this one. The second of the four things I wanted to talk about is vector IO and this

06:14.280 --> 06:19.960
is a thing that or scatter gather IO. Not that many user space programs care about scatter gather IO

06:19.960 --> 06:25.320
because normally if you want to read a large amount of data into memory you arrange to have a

06:25.320 --> 06:30.840
buffer that's the right size and you just do your read or write. But databases because they have

06:30.840 --> 06:36.360
again because they have buffer pools. They are really opinionated about where memory should go.

06:36.360 --> 06:44.360
So they have a bunch of buffer replacement algorithm problems that exist also in the current

06:44.360 --> 06:50.040
with the kernel's buffer pool. And the buffer is that they might find or might be able to allocate

06:50.040 --> 06:56.600
to the data that you're trying to read in might not be contiguous. And yet we still want to generate

06:56.600 --> 07:02.360
the minimum number of IO's. If you're using a cloud provider you might be paying for a certain number

07:02.360 --> 07:09.240
of IO's or if you're using physical hardware. This is a certain number of IO's you can do.

07:09.240 --> 07:17.000
It would be ridiculous if you had to do multiple IO's. Generate more IO's and take longer because

07:17.640 --> 07:21.720
the buffers that you find to put the data into are not contiguous when you're reading a

07:21.720 --> 07:27.720
large contiguous chunk of disk. So we want to better generate scatter gather IO commands at the

07:27.720 --> 07:35.800
driver level and have them executed as a single DMA transfer straight into your non-contiguous

07:35.800 --> 07:40.200
buffers in memory. And all the drivers can do that. That's part of the SCSI and NVMe protocol

07:40.200 --> 07:45.560
and so on. It's just something we want to do from user space and not have it get lost somewhere

07:45.640 --> 07:50.200
in the IO's stack and converted into a four loop that's actually doing multiple IO's.

07:50.200 --> 07:56.440
So that's the second of four things that AWS is one from a disk storage stack. The third thing

07:56.440 --> 08:05.160
is asynchronous IO and we'll talk in a moment about what that looks like but this is about separating

08:05.160 --> 08:13.480
the submission of kind of starting IO's from the completion which means finding out the result.

08:13.480 --> 08:18.680
It's an error. Possibly waiting for it to happen if it hasn't happened yet but if you align or

08:18.680 --> 08:23.880
if your program is sufficiently clever and it does things ahead of time then hopefully you get

08:23.880 --> 08:32.280
to consume the conditions later without going off CPU. So the goal of SXRSO is really to be able

08:32.280 --> 08:38.920
to use direct IO effectively to hide the latency problem and solve that customization problem.

08:39.880 --> 08:45.880
I think it's really interesting that so many other operating systems are even included

08:45.880 --> 08:53.320
a meagodoss. That was one of my early learning experiences. Even meagodoss had async and as IO's

08:53.320 --> 08:59.560
as a first class thing that you could use in an application so I remember games where it managed

08:59.560 --> 09:04.600
to get a couple of floppy drives going at once without difficulty because it wasn't blocking

09:04.680 --> 09:11.640
in the way that tradition. I have to speed this up because of the debacle with my Mac video

09:11.640 --> 09:18.120
not working. So the fourth thing that I've grouped together is four things that are a little

09:18.120 --> 09:24.200
bit special about database disco is that we want to make sure that all rights can be done concurrently

09:24.200 --> 09:30.360
so we don't want to have if we're writing data out of our buffer pool to disk and it's consists

09:30.360 --> 09:35.240
of many separate chunks. We want to control the amount of concurrency and we don't want anything

09:35.240 --> 09:40.840
like a, you know, like an i-node level lock to cause those rights to be serialized which is

09:40.840 --> 09:47.480
something that happens in a lot of operating systems when you're using direct IO that becomes

09:47.480 --> 09:52.920
terrible because serializing a bunch of things is an incredible, you know, actual hardware

09:52.920 --> 09:58.040
transfer waiting for completion will be very bad. Another example that so that's when

09:58.120 --> 10:02.680
writing from the buffer pool the other kind of disk IO that databases all do is writing logs

10:03.240 --> 10:08.520
and this it's a, I think it's a fairly complicated sort of thing but to get the maximum amount of

10:08.520 --> 10:14.760
throughput and lowest latency that there's some kind of heuristic where you want to get, you want

10:14.760 --> 10:21.320
to be able to start writing a chunk of transaction log and then more transactions happen and you

10:21.320 --> 10:25.640
start writing before the previous lock had finished you start more and you want those to be

10:25.640 --> 10:31.240
dumping how many as well. So that's an area where a lot of file systems have funny locking

10:31.240 --> 10:37.240
and moidal out that to happen for various reasons. So that's the fourth of the four things that

10:37.240 --> 10:41.480
I think are a little bit unusual about database disk IO. There are some more things that I think

10:41.480 --> 10:47.880
which should have, I've done some experimentation in this area since times running short I'm actually

10:48.440 --> 10:54.920
going to skip the set on the left there and say that being able to register buffers and just

10:55.000 --> 10:58.680
the stuff on the right, being able to register buffers and register file descriptors so that they

10:58.680 --> 11:05.880
don't have to be pinned or held each time you do a read from a file if you've opened a file and

11:05.880 --> 11:09.720
and you're going to be doing a whole of reads from that file and each time you do that if you call

11:09.720 --> 11:15.080
a reads system called inside there you'll see that the file descriptor has to be looked up and

11:15.080 --> 11:20.200
and like reference count adjustments. There's some small improvements like that that can be done

11:20.200 --> 11:27.160
and you can see that happening in Windows or Linux in very high performance programs where they're

11:27.160 --> 11:34.840
really getting down to removing unnecessary CPU work. So I'd like to be able to find some

11:35.480 --> 11:38.840
programming interfaces that would allow all of those tricks and these other tricks over here

11:38.840 --> 11:45.880
that I haven't got time to talk about. So the reason I'm interested in all this is because in

11:45.880 --> 11:50.440
Postgres 18 which just came out we've finally released something that we've been working on for

11:50.440 --> 11:56.120
a few years which is a subsystem we're in the process of converting all of Postgres's disco.

11:56.120 --> 11:59.880
Eventually we'll do network as well we have prototypes but I'm not talking about network here

12:00.600 --> 12:06.760
it's pretty similar in many ways but I'm just going to focus on disks. So the idea is that

12:07.640 --> 12:12.360
in the past I'm going all the way back to 1986 we did all our other by simply calling reading

12:12.360 --> 12:18.600
write or p read and p write and in the last couple of years we've released in a series of steps

12:20.520 --> 12:26.440
support for doing VectorDO and in 18 which just came out recently we've got this thing where you

12:26.440 --> 12:33.960
can set this new setting IOMFID. On Linux there's an option IOU ring and that will use this new

12:35.000 --> 12:39.640
kind of universal system call that they have in Linux which allows you to start all kinds of

12:39.640 --> 12:47.000
operations and the interesting ones here are just reads and writes and that's done without

12:49.560 --> 12:54.040
well in the best of cases it depends on the file system and all kinds of other things like

12:54.040 --> 13:00.840
this compression involved this might not be true but in simple cases when you start an IOU it

13:00.840 --> 13:06.200
doesn't run any kind of like a kernel thread to do there it doesn't use all the traditional code

13:06.200 --> 13:12.680
files that would implement a synchronous read or write it just sends the you know converts

13:12.680 --> 13:16.600
your instruct you know converts the logical blocks into physical blocks and does the you know

13:16.600 --> 13:21.080
bureaucracy like that and then it pretty much just starts the I would pushing it down through the

13:21.080 --> 13:26.120
driver and returns control and then later you can you can you know wait for the completion events

13:26.920 --> 13:31.880
and all that happens without the assistance of any extra you know any scheduling overheads due to

13:31.880 --> 13:36.040
internal you know worker pulls or anything like that there's nothing like that in the best of cases

13:36.040 --> 13:40.520
which is what we want to what we want to happen we also have IOMFID equals worker and this is what's

13:40.520 --> 13:48.280
used at the moment on free BST or any of the LBSTs or LIMOS and that's just a simple portable

13:48.280 --> 13:56.120
system that we need as as a baseline and that has a pool of worker processes that do the IOs for you

13:56.120 --> 14:00.440
so they finish up spending their time just consuming from a queue and sitting in reading write

14:00.440 --> 14:08.680
calls during them for you so you can get on with doing other stuff. I've gotten experimental

14:08.680 --> 14:15.080
POSIX AO patch which works only on free BST and I'll talk about why that is I tried to get it

14:15.080 --> 14:20.280
working on whole bunch of other Linuxes you would think from the name that it would be portable but

14:20.280 --> 14:27.160
POSIX AO is I think essentially a failure I've got a few theories for why it failed I mean

14:28.120 --> 14:35.400
the basic sort of reason I think is that not that many applications or even programming languages

14:35.400 --> 14:40.920
were ready to deal with ACIC when I say oh and not that many applications spaces kind of needed

14:40.920 --> 14:45.880
that kind of stuff and the reason I grouped together those four things at the beginning of the

14:45.880 --> 14:52.920
talk was to kind of explain that you know if you since common applications just don't really

14:52.920 --> 14:56.440
have a need for a vector IO and don't really have a need for the direct IO which is a

14:56.440 --> 15:01.240
pessimization unless you also have a whole bunch of other infrastructure to drive enough concurrency

15:01.240 --> 15:06.600
it means that ACIC was also just wasn't that interesting from typical applications so

15:06.600 --> 15:12.760
and you know there were a few big database companies and they in the 90s they figured out how to

15:12.760 --> 15:18.760
use ACIC's POSIX AO on the big unit systems but if you look closely they didn't just use POSIX AO

15:18.760 --> 15:23.480
they also secretly negotiated with all the commercial units vendors and had them add all kinds

15:23.480 --> 15:30.120
of secret interfaces so they could make it work properly I think we need to find the right

15:30.120 --> 15:36.360
set of interfaces and do them in a nice way that ideally would work the same way across the

15:36.360 --> 15:42.760
BSTs and Lumos they have so much code and common that it would it would be a shame if they

15:42.760 --> 15:49.240
can shut in compatible yeah I also experiment with some other operating systems that I'm

15:49.240 --> 15:56.680
relevant to this talk so yeah work a mode is what you can see if you use POSIX AO on any of

15:56.680 --> 16:00.920
these operating systems you're going to see these IO work processes running and you can

16:01.960 --> 16:05.400
envision 18 you have to get to say how many there are in 19 we're going to try and make it so

16:05.400 --> 16:10.520
that it automatically scales and yeah so that's where all the just reading rights are going to be

16:10.520 --> 16:17.000
happening on behalf of the normal query execution processes which will hopefully be threads

16:17.000 --> 16:23.000
as well but that's a separate topic the moment this is a process is so

16:40.520 --> 16:54.040
sorry about this I managed to copy in place when I was making the PDF some of the

16:54.040 --> 17:00.600
wrong slides okay I think my time was up so I'm just going to very quickly come to the

17:01.320 --> 17:04.600
I'm going to skip a lot of slides and come straight to the kind of

17:05.560 --> 17:11.160
question topics like I'm trying to find I'm trying to make contact with people who would

17:12.280 --> 17:18.840
you might be able to help make decisions in this area so this kind of two levels to do really

17:18.840 --> 17:23.160
good asynchronous IO and all these systems this kind of two levels that need some significant

17:23.160 --> 17:29.400
hacking and at least decisions there's the user space level where I think the main choices are

17:29.400 --> 17:35.720
either to actually implement IO uing or subset of IO uing as it's been defined on legs

17:38.520 --> 17:43.080
either exactly the same for compatibility or something generally similar where a thin layer

17:43.080 --> 17:48.120
could could work the same way or to look at the work that was done on KQ KQ supports

17:51.560 --> 17:58.440
consuming completion events through KQ the funny thing about that is that of the five operating

17:58.520 --> 18:06.280
systems that took three BST's KQ system KQ's great I love KQ but all of the other systems like

18:06.280 --> 18:14.280
Max and Neppiosti I can be a ste they took KQ but they deleted the completion event support so

18:14.280 --> 18:20.280
they can only really do readiness for sockets and so on all that works fine and that's very widely used

18:20.280 --> 18:25.560
across a lot of software but it means that it looks like you could write something that would use

18:26.520 --> 18:31.400
KV for completion events but it wouldn't be portable for those other systems I was a little bit disappointed by that

18:32.760 --> 18:40.440
and then yeah so the choice really of like how we should respond to IO uing to me it seems like

18:40.440 --> 18:46.840
either we implement it or we figure out how to make KQ able to do all the things you need to do to do

18:46.840 --> 18:53.000
that stuff efficiently and I'll just finish up with the final thing which is so that's the surface

18:53.000 --> 18:56.280
level question of how we should be done we can also do both of those things it might be that you

18:56.280 --> 19:00.920
can do KQ make KQ a little bit better but also provide an interface that lets software be ported

19:00.920 --> 19:07.320
and from Linux that uses IO uing sound faces I'm not sure about that but that's not really the hard part

19:07.320 --> 19:13.880
I'm just going to super quickly the last thing is oh sorry please no it's just we have a

19:13.880 --> 19:23.720
two minutes before the next yes sure so I'll just close up okay yeah okay yeah yeah yeah we are

19:23.720 --> 19:30.200
running very very well okay so I'll just say goodbye then the final thing is to actually drive

19:30.200 --> 19:37.000
asynchronous IO all the way through to the driver we need to change the VSA for VFS layer and that's

19:37.000 --> 19:42.840
going to be kind of tricky the net BSD guys seem to have a project plan for that I would like to

19:42.840 --> 19:49.480
figure out how we could make that work the same across the extremely similar code in previous

19:49.480 --> 19:54.920
D and presumably open BSD as well if anyone knows anything about this project I'd be interested

19:54.920 --> 20:01.160
in talking to them about it and that's it yeah