WEBVTT 00:00.000 --> 00:11.760 Thank you, everyone. Thank you for having me today. It's a great opportunity, great honor 00:11.760 --> 00:20.120 to be with you all today. Take your leads and open source answers, yes. Yeah, so thank you. 00:20.120 --> 00:24.240 My name is Fred Wake for the Knust. I'm a 7-bass driver for France, and I've been working 00:24.240 --> 00:33.040 on Cess since 2014 at French University in Estifloin, and I joined Clyseau last year recently. 00:33.040 --> 00:39.240 And so our talk, my talk today, will be on Cess S3, as three storage, with dynamic placement 00:39.240 --> 00:46.680 and optimized retention. So we'll go through over the different points here, so we'll 00:46.680 --> 00:52.920 start with storage glasses and get to know how we can use a Lewis scripting of the radios 00:52.920 --> 00:59.360 gate weight for dynamic placement, and also life cycle policies for optimized retention. 00:59.360 --> 01:05.760 And then we'll get to see the configuration to set this all up, and then we'll go through 01:05.760 --> 01:12.920 some demo. So storage glasses, there can be multiple storage glasses, set closer, link 01:12.920 --> 01:18.720 to varied data placement schemes, replication, and a rager coding. And we can also use 01:18.760 --> 01:24.920 some compression if we want, with some different storage glasses. And also I mentioned here 01:24.920 --> 01:30.360 Intel QAT, because that's how you can offload the process of the compression, for example, 01:30.360 --> 01:34.900 when you don't want your radios gate weights being too much of a loaded with compute 01:34.900 --> 01:41.040 compressions. So this enables us three clients to place objects based on criteria, like 01:41.040 --> 01:46.400 performance, access frequency, durability, and cost. You can so have multiple storage 01:46.440 --> 01:50.760 classes with multiple data placement scheme, and you won't have the same latency depending 01:50.760 --> 01:56.560 on whether this is a rager coding or replication. So yeah, and regarding cost, then you could 01:56.560 --> 02:04.880 use less storage if you get to those objects being recorded on a rager coding that pulls 02:04.880 --> 02:09.400 rather than replication for example. So there are three clients, set out of the storage 02:09.400 --> 02:15.600 upon writing an object, but they often choose incorrectly or sometimes do not specify any 02:15.760 --> 02:22.000 storage class at all. So the idea is, what if we chose for them? When the client does not 02:22.000 --> 02:30.480 specify it or incongruously forces it, we could then automatically assign a storage class to an 02:30.480 --> 02:38.160 object based on the object type, for example, his name, the object size, the tenant, or even 02:38.160 --> 02:50.640 the user, the bucket to where the object is going to, and the upload method. So by doing 02:50.640 --> 02:57.920 this, storage efficiency and excess performance would be insured right from the moment the 02:57.920 --> 03:04.960 data is written, that's the idea with Lua scripting. So as three gateways can execute Lua 03:05.040 --> 03:11.920 scripts on the fly, so when request arrived, put request arrives, based on the context, 03:11.920 --> 03:18.720 pre request, post request, background, get data, put data. Those different contexts you can 03:19.760 --> 03:28.160 have examples and explications on this page down below here, Lua scripting in the documentation. 03:28.160 --> 03:34.240 So the Lua scripts can read and modify object metadata on the fly, and therefore we can 03:34.240 --> 03:40.080 dynamically set or modify an object storage class upon writing according to certain criteria. 03:41.360 --> 03:48.240 Where Lua scripting can help, this is an example of one, for example, solutions, 03:49.280 --> 03:54.000 expecting uniform performance, regardless of the size and the number of objects that they write 03:54.000 --> 04:00.160 to S3 buckets, to a single buckets. A typical example that has been backup in recovery, 04:01.120 --> 04:09.840 which stores data objects, but also metadata objects, backups, catalogs, index, locks, 04:09.840 --> 04:17.280 pointers, and in the same very same bucket, but there are all from very different size, 04:17.280 --> 04:22.960 very very small objects and also big objects, the actual content of the backups. 04:23.920 --> 04:29.360 So the challenge is how do we get that to be performance? Keep the performance, 04:30.080 --> 04:38.000 even though the bucket gets to millions of objects. So the idea here is to differentiate 04:38.000 --> 04:45.440 that those objects and automatically direct small objects to replicate it pool and 04:46.000 --> 04:53.440 replicated pools and bigger ones to EC pools. So since VBR, a version 12, for example, 04:54.800 --> 05:01.600 being back in recovery now groups a small metadata time to single S3 objects to address this 05:01.600 --> 05:09.920 kind of design flow. So there are other applications that could benefit from this kind of scenario 05:10.000 --> 05:15.840 LOS groups. So LOS groups could provide an additional protection layer also, 05:15.840 --> 05:21.200 beyond bucket ACLs, to block requests. Based on specific criteria, for example, user, tenants. 05:22.080 --> 05:28.800 For example, this could enforce read only or write only bucket access on publicly exposed S3 05:28.800 --> 05:34.720 gateways as opposite to internal gateways that you could use to write or to read. 05:35.680 --> 05:41.280 So I mean, my colleagues here and also Marshall have been working on this PR to 05:41.280 --> 05:48.720 eventually add this ability for LOS groups to block requests. And there are other use case 05:48.720 --> 05:55.040 that you can find on in the upstream dock and also some code samples to trace requests for specific 05:55.040 --> 06:00.720 bucket applied default metadata would not specified by the client. Log operations only when 06:00.720 --> 06:07.440 errors appear on capture operation traces for an analytics, for example, for instance. 06:10.480 --> 06:16.080 What about LOOA performance and reliability, of course, for a request, now we add a 06:16.080 --> 06:22.240 stripped that needs to be run. So that might have some consequences, of course. So what about 06:22.240 --> 06:27.760 CPU and RAM conceptions? So initial test, that way made out of course, so should that LOOA 06:28.720 --> 06:36.480 does not add much latency, like some sort of tens of microseconds. Also in squid, there's 06:37.920 --> 06:46.080 execution of a script in a context, can you use up to 1 and 20k of RAM? This is configurable. 06:47.200 --> 06:54.800 And what if the script fails on a timeout, which by default is 1 second or syntax error? Well, 06:54.880 --> 07:02.240 it script fails on non-fattle for specific lines. Clients will receive a normal response, 07:02.240 --> 07:09.840 as if the script was implied at all. So it won't break the upload, the activity of the client. 07:11.200 --> 07:17.920 There are also some things, there's also something to notice here, there's a performance improvement 07:17.920 --> 07:23.040 as being worked on recently in the last two weeks to improve the performance of 07:23.040 --> 07:28.880 bycaching the LOOA bytecode, which also reduces the number of network calls to read the 07:28.880 --> 07:36.480 rate of subject containing the LOOA script. So that was for the data placement optimization. 07:37.120 --> 07:43.200 But now we'll be talking about lifecycle policies, and that's the optimized retain retention. 07:43.200 --> 07:52.160 The idea is to move some, based on some criteria to move some objects to other storage classes. 07:52.720 --> 07:58.480 That's the first point transition, objects between storage classes, based on criteria that 07:58.480 --> 08:04.640 can be days for how long the industrial objects have been here in the cluster, or object size 08:04.640 --> 08:09.280 greater than or less than. That's interesting because you can do both at the same time. 08:09.840 --> 08:13.520 Just to make sure that if you move the threshold, the size of the object, 08:14.240 --> 08:20.800 some of them will move to the other pool and the other way around. So that's good to know. 08:20.800 --> 08:28.240 So that's since we've squeezed. Yes, the idea is to optimize storage for railway use data. 08:29.360 --> 08:38.000 Also, they let non-curren versions of objects, eventually retaining some versions of the objects, 08:38.560 --> 08:43.040 and also free up some space by cleaning up incomplete multipart uploads. 08:44.400 --> 08:49.600 You know, that's some, when some MPU fails, and the client never comes back to restart, 08:49.680 --> 08:55.520 resume its workload, and it will leave a lot of parts in the cluster. Well, in the bucket. 08:56.640 --> 09:02.640 That will, you will have to take care of removing. So lifecycle rules can also use tags or 09:02.640 --> 09:09.280 prefixes to apply only to specific objects. So this is an example of a rule that you apply 09:09.280 --> 09:14.640 per bucket on a specific bucket that will clean up multipart upload parts of the 10 days. 09:15.520 --> 09:21.600 Move objects to deep archive storage class, as we've shared it to EC pool, 09:22.160 --> 09:30.640 raise your code pool, 8 plus 3, after 30 days, and the objects after 3, 6, 5 days. 09:30.640 --> 09:37.200 So I just added this be careful because I will empty your cluster of bucket, of course, 09:37.200 --> 09:42.800 from all of your objects too. So that may not be the best move to do what this is just for 09:43.440 --> 09:49.200 proof of concept. So the configurations look, the configuration looks like this. 09:50.560 --> 09:55.840 We start by creating set pools of different kinds, so replicate it, raise your coding, 09:55.840 --> 10:03.920 and another type of raise your coding, 8 plus 3, and we will also add storage class 10:03.920 --> 10:10.160 and default storage class. And you can verify the configuration with this command on the right. 10:12.800 --> 10:20.800 So yeah, configuring the data index and on EC pools in the zone. 10:22.720 --> 10:29.520 And is it here that, yes, we set the compression also, we can set the compression LZ4 here on 10:30.320 --> 10:34.400 ZSTD for example, and then apply the configuration. 10:34.480 --> 10:46.880 We also need to send a configuration file for the different rules that we will have for the 10:46.880 --> 10:56.000 LUS strips. To say for example, choose this storage class if the file of the object has this size 10:56.080 --> 11:04.080 or match this pattern, for example, or is this from this bucket, oh, going to this bucket, 11:04.080 --> 11:10.640 or a connection from this tenant, for example. So these are the rules that the LUS strips 11:10.640 --> 11:17.040 will be using. So we can add this configuration file to the global configuration service 11:17.920 --> 11:24.000 of RGW. And then we will apply the service and then redeploy the rate as gateways. 11:26.480 --> 11:33.120 Yes, and then we can check that on inside the container and we can actually list the five configuration 11:33.120 --> 11:43.600 file. You could also prefer to just mount a file that would be somehow static on the rate 11:43.600 --> 11:48.880 as gateways. So you don't have to redeploy the service to apply a new rule, right? So you 11:48.880 --> 11:52.400 would just log into the rate as gateways and modify the file. Just make sure it's not used 11:52.960 --> 11:58.960 because the VM is using a temporary file and the end, sorry, the, I know the link will be 11:58.960 --> 12:04.240 broken and so inside the container, you won't see the modifications that you've been doing 12:04.240 --> 12:10.880 on the host. That's a typical thing that I've been encountering. So you could do the configuration 12:10.880 --> 12:20.000 here or directly on the host, mounting the file. So here we are giving our rate as gateways the 12:20.960 --> 12:26.560 Louis scripts. Here is an example script that I've made. You can find it on my GitHub if you want to 12:26.560 --> 12:30.880 use it. Just make sure I've been also using it. I've been also using it to produce this script. So just 12:30.880 --> 12:38.160 make sure to double check double read the script before we're running it. Please, yep. So 12:40.560 --> 12:48.720 then here we apply a lifecycle policy. So I'm not going to deepen to that. Oh, by the way, if you 12:48.720 --> 12:53.280 download the presentation, the slides, you will get those our text box. So you don't want to 12:53.280 --> 13:00.400 copy past some of the comments in the course easier to do. So we can check the lifecycle status. 13:00.400 --> 13:05.840 It's a background process that will run from time to time. Looking over at the different objects 13:05.840 --> 13:15.120 and get to know, see what needs to be done over these different objects. So at the beginning it's 13:16.080 --> 13:21.280 an initial state. And then you can start the process manually with this command and then 13:21.280 --> 13:31.440 list again and see that the lifecycle. Yeah, so Louis scripts logs everything at debug 13:31.440 --> 13:37.520 suitable you level 20. So if you want to see how your script is working, if your roles are 13:37.520 --> 13:43.600 being matched or not, just make sure to increase up to this higher like one of the highest 13:43.680 --> 13:51.760 probably highest debug level to see what it does. So yes, the idea is to monitor the pull activity 13:52.320 --> 14:02.400 create a different size objects from 16k, 4.7 megabyte here on this example and push object to 14:02.400 --> 14:12.560 S3 storage and get to see our roles match or not. Yeah, so we can also list every objects in the bucket 14:12.560 --> 14:20.240 with this below command here and get to know which class storage class was associated to them. 14:21.840 --> 14:31.440 Yeah, so here are the logs. For example, object size bigger than two megabytes gets this one 14:31.440 --> 14:36.560 above object size above two megabytes as a sign of the default storage class which by default 14:36.560 --> 14:43.040 anything in this example is it's standard and frequent access and it goes to the EC4 plus two 14:43.040 --> 14:52.000 pool. Another one here is a 475 kilobytes so below two megabytes is a sign that the SC storage 14:52.000 --> 15:00.480 class standard and goes to the three times the replicate pool. We can see here how the match goes 15:01.120 --> 15:10.400 no match here so storage class by default here there's a match so storage class route for rule one here 15:11.280 --> 15:22.400 so it will go to the standard storage class. Yes and every multiple uploads goes to a specific 15:22.400 --> 15:32.240 storage class here that I named deep archive which is forced. Why is it forced? It's because when 15:32.240 --> 15:40.160 an object is uploaded with an MPU method so multi-part upload, Lewis scripts won't be able to 15:40.160 --> 15:47.520 identify the size of the object at the moment it starts the client starts sending the object. 15:48.400 --> 15:54.560 So you have to predefined pre-choose somehow some storage class or MPU upload. 15:55.840 --> 16:03.520 So since this this is MPU it's probably a big object right so the idea is to send it to 16:04.160 --> 16:10.320 deep archive but then again if you want to read if it's not a big object it's small one. You 16:10.320 --> 16:15.520 can also have this lifecycle policy that you will set on your bucket and then the object will be 16:15.520 --> 16:22.640 yet again removed like not removed the move transition to another storage class. 16:25.200 --> 16:33.040 Okay so how to check how can we check the optimizer retention. You know this if you're using 16:33.040 --> 16:39.760 staff you know this command which is rate is df or sfdf it will show you the different object how many 16:39.760 --> 16:46.480 objects you have in the different pools and also what place they are using in your cluster 16:47.360 --> 16:55.360 in the different pools. But of course if you have to wait like 30 days or 365 days to 16:55.360 --> 17:00.720 know and to check whether the object has been transitioned from one storage class to another you would 17:00.720 --> 17:07.360 take a lot of time to just to make sure that it works properly. So there's this configuration 17:07.360 --> 17:13.680 that you can use the setting or you will see d bug interval. This is a dev setting that was 17:13.680 --> 17:20.880 meant just to on this purpose like for this purpose. To check if our configuration or lifecycle 17:20.880 --> 17:28.400 policy applies correctly. So by doing this you would say for example that it will turn 17:29.120 --> 17:35.840 any day like 24 hours into a single second. Okay so if you set a policy that will do 17:36.320 --> 17:43.120 something transition objects after 15 days sorry 15 days you would get the results in 15 seconds 17:43.840 --> 17:54.800 rather than having to wait for two weeks. Yeah and so on this example we set a transition at 15 17:55.680 --> 18:03.920 after 15 days so 15 seconds and expiration after 30 days or here seconds. 18:04.800 --> 18:15.440 So here the object is written at 8 o'clock 12 minutes and 5 to 6 seconds and like 15 seconds later 18:16.240 --> 18:28.080 at 13 minutes and 11 seconds. Okay so lifecycle is checking if the object is expired. If the 18:28.080 --> 18:38.080 30 days if it's here for more than 30 days and it says is expired no 0. 15 days says yes it has expired. 18:38.080 --> 18:45.520 So I'm friends in front of the file to the deep archive which is the rule that we set. 18:46.480 --> 18:57.200 And so like 15 seconds later here at 26 the objects match like the matches both is expired one 18:57.280 --> 19:07.360 is expired one. Okay so now the object is deleted. Right so that's how you read the laws. 19:07.360 --> 19:17.040 Like but this is yeah that's the lifecycle standard log. So we can have this quick demo 19:17.840 --> 19:26.800 here we have a set cluster set status on the left and the pool the sorry the pools here 19:27.920 --> 19:35.760 only 30 rupees PGZ each this is a small really lab cluster so nothing big here. 19:38.560 --> 19:43.920 I didn't expect me that quick. Yeah cool. 19:44.480 --> 19:56.240 So yes we've made this series of files different size and we will upload them here using our 19:56.240 --> 20:08.960 clone. I will not run this with a bunch of content. So now we see that our pools are empty zero objects 20:09.520 --> 20:15.200 nothing stored nothing used in those space used in our cluster. So I'll just start 20:17.680 --> 20:30.640 what is the what? 2000. Okay well so I'll send different objects from different size and then we'll get 20:30.640 --> 20:42.720 to see that some data is now being stored here. I'm going to surprise that this was not 20:42.720 --> 20:51.600 the numbers that I was expecting but anyway that's the demo effect even. So yes and you see that 20:51.600 --> 20:58.800 okay so by default they went to the warm data over to megabytes and went to the hot data 20:58.880 --> 21:06.000 below this threshold of two megabytes. And then after some time of 15 minutes we've seen the different 21:06.960 --> 21:15.840 the data moving to the the archive pool. So now it has been fast but after 15 seconds the data 21:15.840 --> 21:24.720 has already been moved to the the S3 archive data pool which uses 8 plus 3 the displacement scheme. 21:24.720 --> 21:33.440 So the idea is again is to for data the pool probably that is there for 15 days here in this 21:33.440 --> 21:42.400 example. If it's not expected to be accessed anymore or maybe less frequency for equivalently 21:42.400 --> 21:48.640 and then you should better move the data to another storage class. Then you could ask but why 21:49.200 --> 21:56.640 do we still have some some data here? It should have moved right because this policy moves any 21:56.640 --> 22:06.720 data to the archive data pool after 15 days. That's the reason for that is that the lifecycle 22:06.720 --> 22:14.960 policy only copies object from one storage class to another so one pool to another. So it doesn't 22:14.960 --> 22:21.600 move it does copy right. So what we'll get rid of the data here is the garbage collector. 22:22.800 --> 22:29.280 The garbage collector will see that does no point in keeping this data. 22:30.880 --> 22:38.800 So if we run the garbage collector we'll see almost all data and probably all data being removed 22:38.800 --> 22:47.680 why because there also has been 30 seconds already. So that's expected but yeah if I run the tool again 22:50.000 --> 23:00.720 run the alkaline command again we'll see the data going to oh I'm not in the dd files folder 23:00.720 --> 23:14.160 and that's the reason why I was uploading more than expected. Yeah so let's do 23:15.040 --> 23:33.360 now the gc will come into play and remove some of the data but not all because the 30 seconds are 23:33.440 --> 23:45.440 not there but yet. Let's remove this to this manually. So now all pools would be 23:49.520 --> 23:50.000 empty. 24:04.320 --> 24:11.840 Yeah demo I've been doing this like four times this morning or at least that's why. 24:11.840 --> 24:35.920 See some of the data goes to the hot pool and some goes to the one pool depending on the size 24:35.920 --> 24:52.000 and then yeah so well that's a demo effect that should have worked better than that but 24:52.000 --> 25:00.800 still that's the idea. Yeah I used to optimize data placement right from the start I mean right 25:00.800 --> 25:13.520 from the data when the data is injected in the cluster and also then get old data to occupy less space 25:13.520 --> 25:23.520 in our cluster. Yes as they get less access. Yeah five minutes left. Well it's time to say thank you. 25:24.240 --> 25:30.640 So yes I acknowledge means to you've all of the shifts he's the one that has a coded 25:30.640 --> 25:38.640 lieu as support in rate as gateways instead. Steven and Biocker for his work on RWOTO tiering 25:38.640 --> 25:45.680 and also inton he Dutchman Kirk Bruns for that talk on RWW lieu is scripting that you can find on 25:45.680 --> 25:51.600 YouTube and also this French guy a little on Bob which is a friend of mine. Casper is you might find 25:51.840 --> 25:58.480 a good stuff on his blog for sharing his experience on using VIM backup and recovery or 25:58.480 --> 26:06.960 replication with S3 storage so the idea is but that this idea that idea is good for any 26:08.160 --> 26:17.040 kind of storage class and stuff as many software expect Amazon named after Amazon storage class 26:18.000 --> 26:27.920 so we better go with Amazon storage class names yes so use a standard storage class for for example 26:27.920 --> 26:36.400 and VIM drives for objects that are less in size than 64K and then use standard infrequent access 26:36.400 --> 26:43.440 storage class on HDE's or hybrid drives for objects above this threshold and set the default 26:43.440 --> 26:54.960 storage class for standard that's the right way to use VIM on on on S3 storage okay thank you 27:01.680 --> 27:02.400 any questions 27:13.440 --> 27:18.080 thank you