WEBVTT

00:00.000 --> 00:11.760
Thank you, everyone. Thank you for having me today. It's a great opportunity, great honor

00:11.760 --> 00:20.120
to be with you all today. Take your leads and open source answers, yes. Yeah, so thank you.

00:20.120 --> 00:24.240
My name is Fred Wake for the Knust. I'm a 7-bass driver for France, and I've been working

00:24.240 --> 00:33.040
on Cess since 2014 at French University in Estifloin, and I joined Clyseau last year recently.

00:33.040 --> 00:39.240
And so our talk, my talk today, will be on Cess S3, as three storage, with dynamic placement

00:39.240 --> 00:46.680
and optimized retention. So we'll go through over the different points here, so we'll

00:46.680 --> 00:52.920
start with storage glasses and get to know how we can use a Lewis scripting of the radios

00:52.920 --> 00:59.360
gate weight for dynamic placement, and also life cycle policies for optimized retention.

00:59.360 --> 01:05.760
And then we'll get to see the configuration to set this all up, and then we'll go through

01:05.760 --> 01:12.920
some demo. So storage glasses, there can be multiple storage glasses, set closer, link

01:12.920 --> 01:18.720
to varied data placement schemes, replication, and a rager coding. And we can also use

01:18.760 --> 01:24.920
some compression if we want, with some different storage glasses. And also I mentioned here

01:24.920 --> 01:30.360
Intel QAT, because that's how you can offload the process of the compression, for example,

01:30.360 --> 01:34.900
when you don't want your radios gate weights being too much of a loaded with compute

01:34.900 --> 01:41.040
compressions. So this enables us three clients to place objects based on criteria, like

01:41.040 --> 01:46.400
performance, access frequency, durability, and cost. You can so have multiple storage

01:46.440 --> 01:50.760
classes with multiple data placement scheme, and you won't have the same latency depending

01:50.760 --> 01:56.560
on whether this is a rager coding or replication. So yeah, and regarding cost, then you could

01:56.560 --> 02:04.880
use less storage if you get to those objects being recorded on a rager coding that pulls

02:04.880 --> 02:09.400
rather than replication for example. So there are three clients, set out of the storage

02:09.400 --> 02:15.600
upon writing an object, but they often choose incorrectly or sometimes do not specify any

02:15.760 --> 02:22.000
storage class at all. So the idea is, what if we chose for them? When the client does not

02:22.000 --> 02:30.480
specify it or incongruously forces it, we could then automatically assign a storage class to an

02:30.480 --> 02:38.160
object based on the object type, for example, his name, the object size, the tenant, or even

02:38.160 --> 02:50.640
the user, the bucket to where the object is going to, and the upload method. So by doing

02:50.640 --> 02:57.920
this, storage efficiency and excess performance would be insured right from the moment the

02:57.920 --> 03:04.960
data is written, that's the idea with Lua scripting. So as three gateways can execute Lua

03:05.040 --> 03:11.920
scripts on the fly, so when request arrived, put request arrives, based on the context,

03:11.920 --> 03:18.720
pre request, post request, background, get data, put data. Those different contexts you can

03:19.760 --> 03:28.160
have examples and explications on this page down below here, Lua scripting in the documentation.

03:28.160 --> 03:34.240
So the Lua scripts can read and modify object metadata on the fly, and therefore we can

03:34.240 --> 03:40.080
dynamically set or modify an object storage class upon writing according to certain criteria.

03:41.360 --> 03:48.240
Where Lua scripting can help, this is an example of one, for example, solutions,

03:49.280 --> 03:54.000
expecting uniform performance, regardless of the size and the number of objects that they write

03:54.000 --> 04:00.160
to S3 buckets, to a single buckets. A typical example that has been backup in recovery,

04:01.120 --> 04:09.840
which stores data objects, but also metadata objects, backups, catalogs, index, locks,

04:09.840 --> 04:17.280
pointers, and in the same very same bucket, but there are all from very different size,

04:17.280 --> 04:22.960
very very small objects and also big objects, the actual content of the backups.

04:23.920 --> 04:29.360
So the challenge is how do we get that to be performance? Keep the performance,

04:30.080 --> 04:38.000
even though the bucket gets to millions of objects. So the idea here is to differentiate

04:38.000 --> 04:45.440
that those objects and automatically direct small objects to replicate it pool and

04:46.000 --> 04:53.440
replicated pools and bigger ones to EC pools. So since VBR, a version 12, for example,

04:54.800 --> 05:01.600
being back in recovery now groups a small metadata time to single S3 objects to address this

05:01.600 --> 05:09.920
kind of design flow. So there are other applications that could benefit from this kind of scenario

05:10.000 --> 05:15.840
LOS groups. So LOS groups could provide an additional protection layer also,

05:15.840 --> 05:21.200
beyond bucket ACLs, to block requests. Based on specific criteria, for example, user, tenants.

05:22.080 --> 05:28.800
For example, this could enforce read only or write only bucket access on publicly exposed S3

05:28.800 --> 05:34.720
gateways as opposite to internal gateways that you could use to write or to read.

05:35.680 --> 05:41.280
So I mean, my colleagues here and also Marshall have been working on this PR to

05:41.280 --> 05:48.720
eventually add this ability for LOS groups to block requests. And there are other use case

05:48.720 --> 05:55.040
that you can find on in the upstream dock and also some code samples to trace requests for specific

05:55.040 --> 06:00.720
bucket applied default metadata would not specified by the client. Log operations only when

06:00.720 --> 06:07.440
errors appear on capture operation traces for an analytics, for example, for instance.

06:10.480 --> 06:16.080
What about LOOA performance and reliability, of course, for a request, now we add a

06:16.080 --> 06:22.240
stripped that needs to be run. So that might have some consequences, of course. So what about

06:22.240 --> 06:27.760
CPU and RAM conceptions? So initial test, that way made out of course, so should that LOOA

06:28.720 --> 06:36.480
does not add much latency, like some sort of tens of microseconds. Also in squid, there's

06:37.920 --> 06:46.080
execution of a script in a context, can you use up to 1 and 20k of RAM? This is configurable.

06:47.200 --> 06:54.800
And what if the script fails on a timeout, which by default is 1 second or syntax error? Well,

06:54.880 --> 07:02.240
it script fails on non-fattle for specific lines. Clients will receive a normal response,

07:02.240 --> 07:09.840
as if the script was implied at all. So it won't break the upload, the activity of the client.

07:11.200 --> 07:17.920
There are also some things, there's also something to notice here, there's a performance improvement

07:17.920 --> 07:23.040
as being worked on recently in the last two weeks to improve the performance of

07:23.040 --> 07:28.880
bycaching the LOOA bytecode, which also reduces the number of network calls to read the

07:28.880 --> 07:36.480
rate of subject containing the LOOA script. So that was for the data placement optimization.

07:37.120 --> 07:43.200
But now we'll be talking about lifecycle policies, and that's the optimized retain retention.

07:43.200 --> 07:52.160
The idea is to move some, based on some criteria to move some objects to other storage classes.

07:52.720 --> 07:58.480
That's the first point transition, objects between storage classes, based on criteria that

07:58.480 --> 08:04.640
can be days for how long the industrial objects have been here in the cluster, or object size

08:04.640 --> 08:09.280
greater than or less than. That's interesting because you can do both at the same time.

08:09.840 --> 08:13.520
Just to make sure that if you move the threshold, the size of the object,

08:14.240 --> 08:20.800
some of them will move to the other pool and the other way around. So that's good to know.

08:20.800 --> 08:28.240
So that's since we've squeezed. Yes, the idea is to optimize storage for railway use data.

08:29.360 --> 08:38.000
Also, they let non-curren versions of objects, eventually retaining some versions of the objects,

08:38.560 --> 08:43.040
and also free up some space by cleaning up incomplete multipart uploads.

08:44.400 --> 08:49.600
You know, that's some, when some MPU fails, and the client never comes back to restart,

08:49.680 --> 08:55.520
resume its workload, and it will leave a lot of parts in the cluster. Well, in the bucket.

08:56.640 --> 09:02.640
That will, you will have to take care of removing. So lifecycle rules can also use tags or

09:02.640 --> 09:09.280
prefixes to apply only to specific objects. So this is an example of a rule that you apply

09:09.280 --> 09:14.640
per bucket on a specific bucket that will clean up multipart upload parts of the 10 days.

09:15.520 --> 09:21.600
Move objects to deep archive storage class, as we've shared it to EC pool,

09:22.160 --> 09:30.640
raise your code pool, 8 plus 3, after 30 days, and the objects after 3, 6, 5 days.

09:30.640 --> 09:37.200
So I just added this be careful because I will empty your cluster of bucket, of course,

09:37.200 --> 09:42.800
from all of your objects too. So that may not be the best move to do what this is just for

09:43.440 --> 09:49.200
proof of concept. So the configurations look, the configuration looks like this.

09:50.560 --> 09:55.840
We start by creating set pools of different kinds, so replicate it, raise your coding,

09:55.840 --> 10:03.920
and another type of raise your coding, 8 plus 3, and we will also add storage class

10:03.920 --> 10:10.160
and default storage class. And you can verify the configuration with this command on the right.

10:12.800 --> 10:20.800
So yeah, configuring the data index and on EC pools in the zone.

10:22.720 --> 10:29.520
And is it here that, yes, we set the compression also, we can set the compression LZ4 here on

10:30.320 --> 10:34.400
ZSTD for example, and then apply the configuration.

10:34.480 --> 10:46.880
We also need to send a configuration file for the different rules that we will have for the

10:46.880 --> 10:56.000
LUS strips. To say for example, choose this storage class if the file of the object has this size

10:56.080 --> 11:04.080
or match this pattern, for example, or is this from this bucket, oh, going to this bucket,

11:04.080 --> 11:10.640
or a connection from this tenant, for example. So these are the rules that the LUS strips

11:10.640 --> 11:17.040
will be using. So we can add this configuration file to the global configuration service

11:17.920 --> 11:24.000
of RGW. And then we will apply the service and then redeploy the rate as gateways.

11:26.480 --> 11:33.120
Yes, and then we can check that on inside the container and we can actually list the five configuration

11:33.120 --> 11:43.600
file. You could also prefer to just mount a file that would be somehow static on the rate

11:43.600 --> 11:48.880
as gateways. So you don't have to redeploy the service to apply a new rule, right? So you

11:48.880 --> 11:52.400
would just log into the rate as gateways and modify the file. Just make sure it's not used

11:52.960 --> 11:58.960
because the VM is using a temporary file and the end, sorry, the, I know the link will be

11:58.960 --> 12:04.240
broken and so inside the container, you won't see the modifications that you've been doing

12:04.240 --> 12:10.880
on the host. That's a typical thing that I've been encountering. So you could do the configuration

12:10.880 --> 12:20.000
here or directly on the host, mounting the file. So here we are giving our rate as gateways the

12:20.960 --> 12:26.560
Louis scripts. Here is an example script that I've made. You can find it on my GitHub if you want to

12:26.560 --> 12:30.880
use it. Just make sure I've been also using it. I've been also using it to produce this script. So just

12:30.880 --> 12:38.160
make sure to double check double read the script before we're running it. Please, yep. So

12:40.560 --> 12:48.720
then here we apply a lifecycle policy. So I'm not going to deepen to that. Oh, by the way, if you

12:48.720 --> 12:53.280
download the presentation, the slides, you will get those our text box. So you don't want to

12:53.280 --> 13:00.400
copy past some of the comments in the course easier to do. So we can check the lifecycle status.

13:00.400 --> 13:05.840
It's a background process that will run from time to time. Looking over at the different objects

13:05.840 --> 13:15.120
and get to know, see what needs to be done over these different objects. So at the beginning it's

13:16.080 --> 13:21.280
an initial state. And then you can start the process manually with this command and then

13:21.280 --> 13:31.440
list again and see that the lifecycle. Yeah, so Louis scripts logs everything at debug

13:31.440 --> 13:37.520
suitable you level 20. So if you want to see how your script is working, if your roles are

13:37.520 --> 13:43.600
being matched or not, just make sure to increase up to this higher like one of the highest

13:43.680 --> 13:51.760
probably highest debug level to see what it does. So yes, the idea is to monitor the pull activity

13:52.320 --> 14:02.400
create a different size objects from 16k, 4.7 megabyte here on this example and push object to

14:02.400 --> 14:12.560
S3 storage and get to see our roles match or not. Yeah, so we can also list every objects in the bucket

14:12.560 --> 14:20.240
with this below command here and get to know which class storage class was associated to them.

14:21.840 --> 14:31.440
Yeah, so here are the logs. For example, object size bigger than two megabytes gets this one

14:31.440 --> 14:36.560
above object size above two megabytes as a sign of the default storage class which by default

14:36.560 --> 14:43.040
anything in this example is it's standard and frequent access and it goes to the EC4 plus two

14:43.040 --> 14:52.000
pool. Another one here is a 475 kilobytes so below two megabytes is a sign that the SC storage

14:52.000 --> 15:00.480
class standard and goes to the three times the replicate pool. We can see here how the match goes

15:01.120 --> 15:10.400
no match here so storage class by default here there's a match so storage class route for rule one here

15:11.280 --> 15:22.400
so it will go to the standard storage class. Yes and every multiple uploads goes to a specific

15:22.400 --> 15:32.240
storage class here that I named deep archive which is forced. Why is it forced? It's because when

15:32.240 --> 15:40.160
an object is uploaded with an MPU method so multi-part upload, Lewis scripts won't be able to

15:40.160 --> 15:47.520
identify the size of the object at the moment it starts the client starts sending the object.

15:48.400 --> 15:54.560
So you have to predefined pre-choose somehow some storage class or MPU upload.

15:55.840 --> 16:03.520
So since this this is MPU it's probably a big object right so the idea is to send it to

16:04.160 --> 16:10.320
deep archive but then again if you want to read if it's not a big object it's small one. You

16:10.320 --> 16:15.520
can also have this lifecycle policy that you will set on your bucket and then the object will be

16:15.520 --> 16:22.640
yet again removed like not removed the move transition to another storage class.

16:25.200 --> 16:33.040
Okay so how to check how can we check the optimizer retention. You know this if you're using

16:33.040 --> 16:39.760
staff you know this command which is rate is df or sfdf it will show you the different object how many

16:39.760 --> 16:46.480
objects you have in the different pools and also what place they are using in your cluster

16:47.360 --> 16:55.360
in the different pools. But of course if you have to wait like 30 days or 365 days to

16:55.360 --> 17:00.720
know and to check whether the object has been transitioned from one storage class to another you would

17:00.720 --> 17:07.360
take a lot of time to just to make sure that it works properly. So there's this configuration

17:07.360 --> 17:13.680
that you can use the setting or you will see d bug interval. This is a dev setting that was

17:13.680 --> 17:20.880
meant just to on this purpose like for this purpose. To check if our configuration or lifecycle

17:20.880 --> 17:28.400
policy applies correctly. So by doing this you would say for example that it will turn

17:29.120 --> 17:35.840
any day like 24 hours into a single second. Okay so if you set a policy that will do

17:36.320 --> 17:43.120
something transition objects after 15 days sorry 15 days you would get the results in 15 seconds

17:43.840 --> 17:54.800
rather than having to wait for two weeks. Yeah and so on this example we set a transition at 15

17:55.680 --> 18:03.920
after 15 days so 15 seconds and expiration after 30 days or here seconds.

18:04.800 --> 18:15.440
So here the object is written at 8 o'clock 12 minutes and 5 to 6 seconds and like 15 seconds later

18:16.240 --> 18:28.080
at 13 minutes and 11 seconds. Okay so lifecycle is checking if the object is expired. If the

18:28.080 --> 18:38.080
30 days if it's here for more than 30 days and it says is expired no 0. 15 days says yes it has expired.

18:38.080 --> 18:45.520
So I'm friends in front of the file to the deep archive which is the rule that we set.

18:46.480 --> 18:57.200
And so like 15 seconds later here at 26 the objects match like the matches both is expired one

18:57.280 --> 19:07.360
is expired one. Okay so now the object is deleted. Right so that's how you read the laws.

19:07.360 --> 19:17.040
Like but this is yeah that's the lifecycle standard log. So we can have this quick demo

19:17.840 --> 19:26.800
here we have a set cluster set status on the left and the pool the sorry the pools here

19:27.920 --> 19:35.760
only 30 rupees PGZ each this is a small really lab cluster so nothing big here.

19:38.560 --> 19:43.920
I didn't expect me that quick. Yeah cool.

19:44.480 --> 19:56.240
So yes we've made this series of files different size and we will upload them here using our

19:56.240 --> 20:08.960
clone. I will not run this with a bunch of content. So now we see that our pools are empty zero objects

20:09.520 --> 20:15.200
nothing stored nothing used in those space used in our cluster. So I'll just start

20:17.680 --> 20:30.640
what is the what? 2000. Okay well so I'll send different objects from different size and then we'll get

20:30.640 --> 20:42.720
to see that some data is now being stored here. I'm going to surprise that this was not

20:42.720 --> 20:51.600
the numbers that I was expecting but anyway that's the demo effect even. So yes and you see that

20:51.600 --> 20:58.800
okay so by default they went to the warm data over to megabytes and went to the hot data

20:58.880 --> 21:06.000
below this threshold of two megabytes. And then after some time of 15 minutes we've seen the different

21:06.960 --> 21:15.840
the data moving to the the archive pool. So now it has been fast but after 15 seconds the data

21:15.840 --> 21:24.720
has already been moved to the the S3 archive data pool which uses 8 plus 3 the displacement scheme.

21:24.720 --> 21:33.440
So the idea is again is to for data the pool probably that is there for 15 days here in this

21:33.440 --> 21:42.400
example. If it's not expected to be accessed anymore or maybe less frequency for equivalently

21:42.400 --> 21:48.640
and then you should better move the data to another storage class. Then you could ask but why

21:49.200 --> 21:56.640
do we still have some some data here? It should have moved right because this policy moves any

21:56.640 --> 22:06.720
data to the archive data pool after 15 days. That's the reason for that is that the lifecycle

22:06.720 --> 22:14.960
policy only copies object from one storage class to another so one pool to another. So it doesn't

22:14.960 --> 22:21.600
move it does copy right. So what we'll get rid of the data here is the garbage collector.

22:22.800 --> 22:29.280
The garbage collector will see that does no point in keeping this data.

22:30.880 --> 22:38.800
So if we run the garbage collector we'll see almost all data and probably all data being removed

22:38.800 --> 22:47.680
why because there also has been 30 seconds already. So that's expected but yeah if I run the tool again

22:50.000 --> 23:00.720
run the alkaline command again we'll see the data going to oh I'm not in the dd files folder

23:00.720 --> 23:14.160
and that's the reason why I was uploading more than expected. Yeah so let's do

23:15.040 --> 23:33.360
now the gc will come into play and remove some of the data but not all because the 30 seconds are

23:33.440 --> 23:45.440
not there but yet. Let's remove this to this manually. So now all pools would be

23:49.520 --> 23:50.000
empty.

24:04.320 --> 24:11.840
Yeah demo I've been doing this like four times this morning or at least that's why.

24:11.840 --> 24:35.920
See some of the data goes to the hot pool and some goes to the one pool depending on the size

24:35.920 --> 24:52.000
and then yeah so well that's a demo effect that should have worked better than that but

24:52.000 --> 25:00.800
still that's the idea. Yeah I used to optimize data placement right from the start I mean right

25:00.800 --> 25:13.520
from the data when the data is injected in the cluster and also then get old data to occupy less space

25:13.520 --> 25:23.520
in our cluster. Yes as they get less access. Yeah five minutes left. Well it's time to say thank you.

25:24.240 --> 25:30.640
So yes I acknowledge means to you've all of the shifts he's the one that has a coded

25:30.640 --> 25:38.640
lieu as support in rate as gateways instead. Steven and Biocker for his work on RWOTO tiering

25:38.640 --> 25:45.680
and also inton he Dutchman Kirk Bruns for that talk on RWW lieu is scripting that you can find on

25:45.680 --> 25:51.600
YouTube and also this French guy a little on Bob which is a friend of mine. Casper is you might find

25:51.840 --> 25:58.480
a good stuff on his blog for sharing his experience on using VIM backup and recovery or

25:58.480 --> 26:06.960
replication with S3 storage so the idea is but that this idea that idea is good for any

26:08.160 --> 26:17.040
kind of storage class and stuff as many software expect Amazon named after Amazon storage class

26:18.000 --> 26:27.920
so we better go with Amazon storage class names yes so use a standard storage class for for example

26:27.920 --> 26:36.400
and VIM drives for objects that are less in size than 64K and then use standard infrequent access

26:36.400 --> 26:43.440
storage class on HDE's or hybrid drives for objects above this threshold and set the default

26:43.440 --> 26:54.960
storage class for standard that's the right way to use VIM on on on S3 storage okay thank you

27:01.680 --> 27:02.400
any questions

27:13.440 --> 27:18.080
thank you