WEBVTT 00:00.000 --> 00:02.000 You 00:30.000 --> 00:59.000 Yeah, it's full 4 p.m. 00:59.000 --> 01:06.000 So let's start with the next talk. 01:06.000 --> 01:24.000 Let's go. 01:24.000 --> 01:33.000 Awesome, thank you for the last 4 years for the last 4 years. 01:33.000 --> 01:39.000 I'm responsible for running our POSPASPASPIL infrastructure, which is supporting GitLab.com. 01:39.000 --> 01:46.000 The 8 years before that, I worked at the open source, and especially POSPASPIL consulting company. 01:46.000 --> 01:54.000 Today, I would like to talk to you about how we execute POSPASPIL and operation system upgrades at GitLab with zero downtime. 01:54.000 --> 01:58.000 And I would like to do that by answering a few questions. 01:58.000 --> 02:00.000 We start with POSPASPIL upgrades. 02:00.000 --> 02:02.000 How do they work and why are they hard? 02:02.000 --> 02:10.000 We talk about OS upgrades, what's the deal there, and then I would show you how we minimized the impact for our users. 02:10.000 --> 02:16.000 First of all, POSPASPIL upgrades are major upgrades to be major upgrades. 02:16.000 --> 02:17.000 Why are they hard? 02:17.000 --> 02:24.000 If you have a status service like an application server, you might get away with just changing the running process. 02:24.000 --> 02:30.000 For example, POSPASPIL, a pooling agent, you can just switch out the running process without any interference. 02:30.000 --> 02:36.000 Or if you have basically any application server, you just start up new application servers, paid all the old ones, 02:36.000 --> 02:40.000 and you can easily upgrade without having any impact of your users. 02:40.000 --> 02:44.000 But if you have a service that comes with a state, it's much, much harder, 02:44.000 --> 02:49.000 and especially for complex relational databases like POSPASPIL, 02:49.000 --> 02:54.000 when you do a major upgrade, you might not just change the process, not just the code that is running, 02:54.000 --> 02:58.000 you might need to change the data that is store. 02:58.000 --> 03:04.000 And if you have a large data base, you have to maybe a lot of data to change, or at least to go through. 03:04.000 --> 03:10.000 And for POSPASPIL, a major upgrade, it often occurs that the control structure, the structure, 03:10.000 --> 03:13.000 how the data is stored on this is optimized and get changed. 03:13.000 --> 03:19.000 So you have to rewrite your state, and that's not something that happens instantly. 03:19.000 --> 03:28.000 So, the recommended default method to upgrade the POSPASPIL database, 03:28.000 --> 03:32.000 which is still the default, and if that works for you, you should definitely do it. 03:32.000 --> 03:34.000 You start out with your database. 03:34.000 --> 03:39.000 Your data is stored in a binary format on this, that might depend on the libraries to use on your system. 03:39.000 --> 03:44.000 It depends on your architecture, so the data will look different on an x86 machine, then on a risk. 03:44.000 --> 03:51.000 For example, an Indian is, so this data cannot be easily transferred to a completely different architecture, 03:51.000 --> 03:54.000 and you cannot use it with a different major version of POSPASPIL. 03:54.000 --> 04:01.000 So the default method is you export this data into a logical format that is independent of your architecture and your version. 04:01.000 --> 04:07.000 For example, SPL will do the trick, but there's also an internal format that is slightly more optimal. 04:07.000 --> 04:13.000 So you start out by exporting your data, you go from the binary stored data to a logical representation, 04:13.000 --> 04:22.000 and then on your new server, you just import the data and it gets transferred again to the binary data on your system, 04:22.000 --> 04:25.000 and afterwards you have to recreate all helping structures. 04:25.000 --> 04:30.000 That's mostly indexes, but also statistics and so on. 04:30.000 --> 04:36.000 This is the safest method because you can be sure that all the data is possible, 04:36.000 --> 04:39.000 because it was exported once and then passed again. 04:39.000 --> 04:45.000 There are for many data types, you have special validation, for example, if you have JSON stored in your database, 04:45.000 --> 04:50.000 it gets passed again, so if anything is fishy there, you will notice that. 04:50.000 --> 04:56.000 And also all your helping structures are fresh, you get like square fresh indexes, no blood. 04:56.000 --> 05:01.000 But the problem is if you have a large database that clicks, takes quite some time. 05:01.000 --> 05:08.000 Our main database currently has like 40 terabytes of data, and for us this operation would take multiple days. 05:08.000 --> 05:13.000 If it works for your database, please do that. 05:13.000 --> 05:19.000 You can listen to the rest of my talk, but if that works, please do it. 05:19.000 --> 05:28.000 The next method is still fairly safe, but you have to think about a few things, and that's PG upgrade. 05:28.000 --> 05:35.000 PG upgrade is a tool that gets released with every post-pascale release, and it knows the current data structure, 05:35.000 --> 05:40.000 and it knows all previous data structures, so it will know which parts have to change. 05:40.000 --> 05:45.000 It goes through all your heap data, or your binary data on disk, and changes all the parts that need to be changed. 05:49.000 --> 05:53.000 Quite simple, it's reasonable, fast, it's reasonable to save. 05:53.000 --> 05:58.000 You have to think of a few things, for example, the helping structures like indexes will not be updated. 05:58.000 --> 06:04.000 So if the new post-pascale version comes with an optimized version of B3 indexes, and you want to profit from it, 06:04.000 --> 06:08.000 you have yourself, you have to restore, recreate the indexes. 06:08.000 --> 06:16.000 Or if you would like to go to a new operating system or different architecture, you cannot use this method safely, so keep that in mind. 06:16.000 --> 06:25.000 But still, if this fulfills your need, awesome, use this method, and don't look deeper into the next thing. 06:25.000 --> 06:32.000 So why can't we use this, why can't we use these methods for GitLab? 06:32.000 --> 06:37.000 We have actually done that before, so when I joined GitLab, we would use PG upgrade. 06:37.000 --> 06:44.000 But we had the business requirement before going to a new post-pascale version to make sure that it's operational, 06:44.000 --> 06:51.000 that all our data is correct, and all that the application is not, that we don't have a performance degradation, 06:51.000 --> 06:58.000 and so we needed to run significant number of tests, and the overall downtime for an upgrade was like four to six hours. 06:58.000 --> 07:07.000 So that was for us, not really feasible, and basically just to give you a little insight on our usage profile, 07:07.000 --> 07:13.000 we have over 50 million users around the world, basically in all time zones, and even our team, 07:13.000 --> 07:18.000 we have like over two and a half thousand people who all use the platform for their daily work. 07:18.000 --> 07:25.000 So the requirement we call was basically, we can't afford any downtime, so we had no budget for downtime, 07:25.000 --> 07:33.000 and also the requirement is that after we upgrade any component, if we then realize that we get a strong performance degradation, 07:33.000 --> 07:35.000 we have to be able to roll back. 07:35.000 --> 07:41.000 And many situations, if you do an upgrade, and you have a performance degradation, you have other means of optimizing, 07:41.000 --> 07:48.000 for example, you can buy the next larger cloud instance, if you're running on a hyperscaler, or you can buy the next larger piece of hardware, 07:48.000 --> 07:53.000 because we are basically running on quite the largest machines already, that's not an option. 07:53.000 --> 08:02.000 So if we would experience a performance degradation, we need to be able to roll back quite swiftly. 08:02.000 --> 08:07.000 So yeah, I mentioned zero downtime, but what is zero downtime? 08:07.000 --> 08:13.000 Because we are talking about the software as a service application, and nothing, there's no instant reaction. 08:13.000 --> 08:18.000 So if you press the button, like creating issue, the issue is not there instantly. 08:18.000 --> 08:24.000 It will take like 100 milliseconds, 200 milliseconds depending on where you are. 08:24.000 --> 08:33.000 So we cannot aim for zero, that's not possible, so we need a method to decide, like, we need a metric to decide what is zero downtime. 08:33.000 --> 08:38.000 And what we do is we define that we don't want any noticeable user impact. 08:38.000 --> 08:44.000 And the good thing here is we already have a metric for user, I had a metric for user impact already, 08:44.000 --> 08:49.000 and that is called AppDex, the application performance index. 08:49.000 --> 08:53.000 We do samples for different user interactions. 08:53.000 --> 09:00.000 So for example, for creating an issue, or running a CI job, and we define what is satisfying for the user. 09:00.000 --> 09:07.000 Like, you click the button, new issue, and after a few hundred milliseconds, the issue pops up. 09:07.000 --> 09:11.000 It feels snappy, that's like feeling satisfying. 09:12.000 --> 09:18.000 If you're clicking on create new issue, and you have to wait maybe one or two seconds, that does not feel too great. 09:18.000 --> 09:25.000 You're looking at the blank screen, or the progress indication, that does not feel great, but you tolerate it. 09:25.000 --> 09:33.000 If the action takes significantly longer, the user might get annoyed and press F5, try again, and that's a frustrating experience. 09:33.000 --> 09:40.000 So for a lot of actions within our application, we define these thresholds, and then we take continuously, we take samples for that. 09:40.000 --> 09:43.000 And calculate the AppDex. 09:43.000 --> 09:48.000 So for example, here you see my laser pointer is not really visible. 09:48.000 --> 09:55.000 But you see on the top line, there's satisfied count, so if one hundred percent of all requests would be satisfied. 09:55.000 --> 10:02.000 If that is factory, then we have like one hundred samples, one hundred samples divided by one hundred is one. 10:02.000 --> 10:05.000 So if everything is perfect, we get one. 10:05.000 --> 10:14.000 And if all samples would be frustrating, it's multiplied by zero, so we get a zero, we divide it by whatever we get a zero. 10:14.000 --> 10:21.000 So it's scaled from one to zero, zero is nobody satisfied, everyone is frustrated, and one would be everyone satisfied. 10:21.000 --> 10:26.000 That's the metric we have in already in place, and we page page for this metric. 10:26.000 --> 10:32.000 So if it goes down below 98 point something percent, people get page. 10:32.000 --> 10:34.000 Yeah, how do we achieve that? 10:34.000 --> 10:39.000 And there's a really cool method in post-pressure called logical replication. 10:39.000 --> 10:45.000 So basically the thing we saw at the beginning, where you transform the physically stored data into logical replication, 10:45.000 --> 10:50.000 into logical representation, can be used in a replicating format. 10:50.000 --> 10:55.000 And we use quite some automation to make it actually feasible. 10:55.000 --> 10:58.000 So what is logical replication? 10:58.000 --> 11:02.000 Yeah, or what does it give us? 11:02.000 --> 11:06.000 Unlike extreme replication, logical replication, because it uses the logical format, 11:06.000 --> 11:14.000 can replicate between different major post-pressure versions, and even different infrastructures, different architectures. 11:14.000 --> 11:25.000 So we can just clone our current production, upgrade it, bring it in sync again with the main production system, and then switch over later. 11:26.000 --> 11:28.000 One thing, doesn't come with restriction? 11:28.000 --> 11:31.000 Yeah, it comes with quite some restrictions. 11:31.000 --> 11:36.000 In the previous talk, when I talked about the trivial method, we used modular method, we used before. 11:36.000 --> 11:39.000 I go into more detail of the restrictions. 11:39.000 --> 11:41.000 You can watch a recording if you like. 11:41.000 --> 11:48.000 But for the scope of this talk, I will only go to the main restriction, which is during the logical replication, 11:48.000 --> 11:52.000 while it's enabled, we can't use any DDL. 11:53.000 --> 11:57.000 DDL is data definition language, so create table, author table, drop table. 11:57.000 --> 12:00.000 That's more possible, unfortunately. 12:00.000 --> 12:05.000 For GitLab, we solved that in a way that we have two features for that. 12:05.000 --> 12:14.000 One is a process feature that we block our delivery colleagues from deploying new GitLab versions that would alter the data. 12:14.000 --> 12:16.000 And they also get one week ahead. 12:16.000 --> 12:20.000 We're running, hey, next week or not allowed to change the database. 12:20.000 --> 12:26.000 And also we have a feature flag that you can use to block DDL from happening. 12:26.000 --> 12:28.000 That includes the deployments. 12:28.000 --> 12:31.000 But also we have some background work, for example, that do partitioning. 12:31.000 --> 12:35.000 We have really large multi-terabyte tables that get partitioned in the background. 12:35.000 --> 12:38.000 All these jobs are frozen during the time period. 12:38.000 --> 12:43.000 If you are, if you want to do a zero downtime upgrade for GitLab, you can use the same feature flag. 12:43.000 --> 12:47.000 If you use any other software, you might not have the problem in the first place, 12:47.000 --> 12:51.000 because most standards software does not a lot of DDL in the background. 12:51.000 --> 12:57.000 But you definitely have to check for your concrete application beforehand. 12:57.000 --> 13:03.000 So I work you through a simplified simplified version of the process. 13:03.000 --> 13:09.000 First we start out with our current version of PostgresQL database in this case, PostgresQL 16, 13:09.000 --> 13:12.000 and our application talking to it. 13:12.000 --> 13:15.000 Then we create a one-to-one copy. 13:15.000 --> 13:20.000 In our case, we just create new virtual machines from snapshots of the currently running ones. 13:20.000 --> 13:31.000 And we use the standard method of streaming replication to get the second instance synchronized. 13:31.000 --> 13:37.000 There's a good thing, PostgresQL by design, writes all of its data changes into something called the write-a-headlock. 13:37.000 --> 13:43.000 In the binary format, and you can just stream that to a different server to keep it on the same state as the source. 13:43.000 --> 13:52.000 So our starting system is called source, and the system we are replicating to is called target here. 13:52.000 --> 13:57.000 Then we can stop the replication and upgrade our target system. 13:57.000 --> 14:00.000 So we would run the program you saw before, PG upgrade. 14:00.000 --> 14:03.000 We upgrade the data on this. 14:03.000 --> 14:05.000 We can recreate the indexes. 14:05.000 --> 14:10.000 And afterwards we use logical replication to sync it again. 14:10.000 --> 14:18.000 So now we have the old database version running and new cluster with the new version, 14:18.000 --> 14:22.000 and we get the same data as I am. 14:22.000 --> 14:30.000 And yeah, once we are satisfied, we can just switch the application from connecting from the old one to the new one. 14:33.000 --> 14:41.000 And that's basically the state we had in end of 2023. 14:41.000 --> 14:45.000 And now to give a small look into the actual user impact. 14:45.000 --> 14:47.000 This is our app text. 14:47.000 --> 14:52.000 And yeah, to give you a good static view that's basically the nitpicking view. 14:52.000 --> 14:54.000 I'm looking at the top one percent. 14:54.000 --> 15:00.000 So we are seeing from 0.99 to 1.0 app text. 15:00.000 --> 15:04.000 So here is our degradation as well. 15:04.000 --> 15:06.000 98.8 percent. 15:06.000 --> 15:10.000 So when people get patched, this line is not invisible in this graph. 15:10.000 --> 15:12.000 It's a little bit below. 15:12.000 --> 15:15.000 And yeah, you see that's one week of data. 15:15.000 --> 15:19.000 And you see that it's not constantly at the same level. 15:19.000 --> 15:21.000 We sometimes have degradation. 15:21.000 --> 15:23.000 It can be the application database. 15:23.000 --> 15:25.000 Or maybe a server is 14. 15:25.000 --> 15:28.000 We need to start a new instance or we have more load. 15:28.000 --> 15:31.000 We start more pods, things like that. 15:31.000 --> 15:34.000 And this window here was one of our switch overs. 15:34.000 --> 15:38.000 So you can see the impact on our users is basically measurable. 15:38.000 --> 15:42.000 So during the switch over, we have a short degradation window. 15:42.000 --> 15:44.000 But it's less than the normal noise. 15:44.000 --> 15:47.000 So it's measurable, but it's not significant. 15:47.000 --> 15:55.000 So yeah, we were able to do post-precipeal upgrades with basically zero-down-time zero-user impacts. 15:55.000 --> 15:57.000 But that's old news. 15:57.000 --> 16:01.000 So today, the focus is what did we do for improvement since then? 16:01.000 --> 16:03.000 Or what can we improve further? 16:03.000 --> 16:05.000 And we have two main things here. 16:05.000 --> 16:07.000 One is more of a business requirement. 16:07.000 --> 16:09.000 And that was that the switch over. 16:09.000 --> 16:12.000 When we are switching over from the new version to the new version. 16:12.000 --> 16:14.000 This was a point of no return. 16:14.000 --> 16:16.000 We couldn't switch back. 16:16.000 --> 16:19.000 And in the past, we had the problems that after switch over. 16:19.000 --> 16:21.000 We had a performance degradation. 16:21.000 --> 16:24.000 And the requirement here is to remove this uncertainty. 16:24.000 --> 16:25.000 This risk. 16:25.000 --> 16:30.000 So we need to change the process in order to be able to roll back even after this switch over. 16:30.000 --> 16:33.000 And the second thing is something that I really want. 16:33.000 --> 16:38.000 Because this whole process only upgrades, post-precipeal. 16:38.000 --> 16:41.000 Not any libraries, not the operating system. 16:41.000 --> 16:46.000 So I would like to combine both to reduce the labor involved. 16:46.000 --> 16:53.000 Okay, how do we remove or move the point of no return to a later state? 16:53.000 --> 16:56.000 And that's relatively easy. 16:56.000 --> 16:59.000 All the technology we need is already in place. 16:59.000 --> 17:06.000 Because after the switch over, we can reverse the replication and stream all the data back to the old 17:06.000 --> 17:07.000 old clusters. 17:07.000 --> 17:09.000 So the old clusters kept in swing. 17:09.000 --> 17:14.000 And if we come to the conclusions that we can't fix the new cluster, we can roll back. 17:14.000 --> 17:19.000 In reality, we would put a lot of people into optimizing the queries and making a new cluster perform. 17:19.000 --> 17:22.000 But if that's no longer possible, we have an option left. 17:22.000 --> 17:31.000 So after the upgrade, we would leave the old cluster alive and replicate by a streaming 17:31.000 --> 17:36.000 via a logical replication in the old cluster. 17:36.000 --> 17:38.000 And we can operate in monitor. 17:38.000 --> 17:43.000 And if we come to the conclusion that we really have to roll back, we can switch back to the old one 17:43.000 --> 17:48.000 without losing any data. 17:48.000 --> 17:53.000 So now to the OS upgrade. 17:53.000 --> 17:54.000 Why OS upgrades a problem? 17:54.000 --> 18:01.000 Why can't I just create a new server with a new Linux version of my favorite distribution 18:01.000 --> 18:04.000 and just switch over to this version? 18:04.000 --> 18:08.000 There can be multiple problems with libraries, but the major thing here is something called 18:08.000 --> 18:09.000 Collation. 18:09.000 --> 18:13.000 Like new operating systems, system versions normally come with a new C library for most 18:13.000 --> 18:18.000 it's the GLIPC and the GLIPC provides something called the system wide collation. 18:18.000 --> 18:21.000 And that's the sort or how you order strings. 18:21.000 --> 18:23.000 And on first glance, it should be super easy. 18:23.000 --> 18:30.000 Like if you want to order a single character, like an ABC, it's obvious you have L, first the A, then the B, then the C. 18:30.000 --> 18:35.000 It becomes a much more problematic if you have lower case in upper case. 18:35.000 --> 18:40.000 If you want to order a list of lower and upper case letters, do you want to have first A lower case 18:40.000 --> 18:45.000 or do you want to order ABCD until the end and then start with the upper case letters? 18:45.000 --> 18:50.000 But it's more complicated if you have, for example, strings of numbers, what should come first? 18:50.000 --> 18:54.000 0, 1, or 2, or with special characters. 18:54.000 --> 18:58.000 And unfortunately, there's not one perfect collation that never changes. 18:58.000 --> 19:00.000 Unfortunately, the collation changes regularly. 19:00.000 --> 19:07.000 So when you get a new operating system, you have to accept that your strings will be sorted differently 19:07.000 --> 19:08.000 than before. 19:08.000 --> 19:11.000 And if you have a stateless application, that's not a big deal. 19:11.000 --> 19:19.000 But if you have, like indexes that were created with one collation and all your data was ordered after one ordering pattern. 19:19.000 --> 19:23.000 And then you switch to a new operating system and it starts to order stuff otherwise. 19:23.000 --> 19:27.000 Then you have the problem that you can't find values that already in your index. 19:27.000 --> 19:32.000 If you use the index for search, that's already annoying because you can't find the stuff you already have. 19:32.000 --> 19:42.000 But if you use the index to, for example, and force a constraint, like a unique constraint, you are now able to put tablets in your database, even so they should be unique. 19:42.000 --> 19:47.000 And that will break your data and corrupt your data. 19:47.000 --> 19:52.000 Yeah, how do we solve this problem? 19:52.000 --> 19:55.000 The first thing is, it's not applying for all indexes. 19:55.000 --> 20:00.000 For example, there are some simple data types where you don't have a special collation like an integer. 20:00.000 --> 20:05.000 It's quite easy to sort an integer you can do it in a like a decimal representation or binary. 20:05.000 --> 20:07.000 There's not much to it. 20:07.000 --> 20:16.000 But basically, the simple solution is to rebuild all of your indexes to complex or collation based data structures, mostly strings. 20:16.000 --> 20:23.000 So the good thing is if you do the upgrade, which I suggested in the beginning, the dumpry store upgrade basically, you don't have the problem. 20:23.000 --> 20:26.000 All of the indexes will be rebuilt anyways. 20:27.000 --> 20:33.000 But if you use any of the other upgrade methods, you have to do it yourself manually. 20:33.000 --> 20:46.000 And if you would just rebuild all indexes for GitLab, it will also take multiple days, so that's not feasible for us. 20:46.000 --> 20:49.000 So what do we do to make it feasible? 20:50.000 --> 20:59.000 Along with for the upgrade, we create a new system, we do a test upgrade with a production copy, and then we use a postgresical internal functions called AM check. 20:59.000 --> 21:04.000 And we check which indexes would be corrupted if we upgrade. 21:04.000 --> 21:10.000 Then we take, make a list out of all this indexes and recreate them. 21:10.000 --> 21:18.000 Should it be fairly easy operation and we can just recreate them, also we just take the list and start with our upgrade process. 21:18.000 --> 21:22.000 If the recreation of all of those would take too long, we have to optimize. 21:22.000 --> 21:34.000 You can, for example, try to find if some of these indexes might not be needed at all, or you can try to use different index type, or maybe you can break down an index, maybe a one large index for the whole table. 21:34.000 --> 21:43.000 You can do a lot of partial, partial indexes that are only indexing certain, certain areas, certain, certain spans of a table, things like that. 21:43.000 --> 21:54.000 And also you can do something like lazy recreate. For example, if you come to conclusion, we have really large index, and it's used for to optimize full text search on all of your issues. 21:54.000 --> 22:00.000 And the negative effect, if this index would slightly corrupt would be that. 22:00.000 --> 22:10.000 In a totally weird event, somebody couldn't find an issue that has all the German umlauter in it, and yeah, you can decide, okay, that's not super critical. 22:10.000 --> 22:19.000 We are fine with not creating this index during the upgrade window. We are fine if it takes to the next Monday morning, because we will have low data corruption. 22:19.000 --> 22:30.000 We only will have a slide function degradation here. So if you're looking into that beforehand, you can just make sensible decisions. 22:30.000 --> 22:36.000 So to give you a perspective how you normally do it, we do all these upgrades on the weekend, because that's the lowest load. 22:36.000 --> 22:41.000 So face for us, still not no load unfortunately, but the lowest. 22:41.000 --> 22:48.000 Yeah, and we would do the on Saturday morning, we would start to all the steps required for an upgrade. 22:48.000 --> 22:55.000 And then we have still until Sunday to do additional maintenance operations like index with creations. 22:55.000 --> 23:02.000 So depending on how long the first steps take, we have at least 12 to 24 hours for that. 23:02.000 --> 23:10.000 So we will recreate all indexes, and afterwards we run this internal function A amp check again to make sure that we really have no corruption. 23:10.000 --> 23:18.000 And we mostly have most of the time we also have time for additional running additional sanity checks. 23:18.000 --> 23:25.000 Cool, and now I would like to work through the full process how we did last year with all the improvements. 23:26.000 --> 23:35.000 We are going from possibly 16 to 17 and from Ubuntu 20 to 2022 to 2022 or 4. 23:35.000 --> 23:40.000 Okay, we start out with our database again and the application running to it. 23:40.000 --> 23:45.000 And because it's oversimplified, I give you a little glimpse what's behind the symbols. 23:45.000 --> 23:54.000 So the GitLab acting on the left, that's our application stack, most of it runs on Kubernetes, except our radius. 23:54.000 --> 24:02.000 And the database for this case, it's nine large instances, and one smaller one for taking snapshots. 24:02.000 --> 24:06.000 And we have distributed them across three availability zones. 24:06.000 --> 24:20.000 And we start out with 20 or four and possibly 16. On the top right, I have like a traffic light icon to show you if at the current phase data definition language is allowed. 24:20.000 --> 24:27.000 And for us, it's quite important because multiple times a week, my colleagues deploy new versions and sometimes they want to change to schema. 24:27.000 --> 24:33.000 So one of my requirements was to keep the phase where they can't execute the DL to the minimal minimum. 24:33.000 --> 24:38.000 So the first step is we created test cluster to get all our metrics to know what we are dealing with. 24:38.000 --> 24:43.000 The test cluster can be minimal in our case, it's then like at least three nodes. 24:43.000 --> 24:50.000 Also, yeah, that's not affecting production at all. It's already starting out with the newer S version, the new prosperous version. 24:50.000 --> 24:56.000 And then we do a mock update, upgrade. So we upgrade our test cluster. 24:56.000 --> 25:02.000 We get all the correct execution times, we need to know exactly how long each step takes. 25:02.000 --> 25:09.000 And also we get a list of all the correct indexes because we run the AM check tooling. 25:09.000 --> 25:16.000 If we have all metrics that we need, we can schedule our upgrade, we remove the test cluster and we create the actual target cluster. 25:16.000 --> 25:25.000 In our case, again, it's nine large nodes, over three availability zones and one backup node. 25:25.000 --> 25:28.000 And now comes the step where we have to disable the DL. 25:28.000 --> 25:37.000 Because now we have to switch from streaming replication, like where we just sent the actual byte data from the source to the target cluster. 25:37.000 --> 25:44.000 Now the source cluster has to translate this binary data into a little logical representation and have sent it to target. 25:44.000 --> 25:52.000 And DDL would break the process. So now begins to face where my colleagues are no longer allowed to deploy new schema changes. 25:52.000 --> 25:59.000 That's by the way, that's Saturday morning. In this case. 25:59.000 --> 26:09.000 Now we stop the replication, we upgrade our target cluster, so we run KG upgrade runs like 2020 something minutes, I guess. 26:09.000 --> 26:14.000 Then we re-synchronize again, we have the same state, we get fresh data again. 26:14.000 --> 26:20.000 And then we can do all the additional steps, so we can do a full re-index to make sure we don't have the corrupted indexes. 26:20.000 --> 26:26.000 Afterwards, we can run Analyze, Analyze goes through all your data and creates statistics, which are possible. 26:26.000 --> 26:28.000 It's super important to have fresh statistics, it's a possible node. 26:28.000 --> 26:31.000 Can I do an index? Can I have to read the full table? 26:31.000 --> 26:37.000 Or if you have partition tables, it knows, oh for this query, I have to look into these petitions. 26:37.000 --> 26:48.000 And also, yeah, we run the corruption check to make sure that our index mitigations were successful. 26:48.000 --> 26:59.000 And now we can start with the switch over. The first thing is, we will, the application talks, in this step, the application talks to the source cluster to the PG-161. 26:59.000 --> 27:05.000 And then we gradually start to load balance, read only queries to the new cluster. 27:05.000 --> 27:13.000 So we start with one replica, so all of the queries get scattered across all of the standby in the old cluster, the source cluster, 27:13.000 --> 27:18.000 and one standby of the target cluster gets in the load balance. 27:18.000 --> 27:22.000 And that's really cool because we have a nice dashboard and we can see the performance. 27:22.000 --> 27:24.000 We can make an applet to Apple's comparison. 27:24.000 --> 27:30.000 How much we make sure all of the standby get roughly the same number of connections. 27:30.000 --> 27:33.000 And we can Apple to Apple compare the performance. 27:33.000 --> 27:38.000 We can see the new standby does it have more CPU load, does it do more IOPS, things like that. 27:38.000 --> 27:45.000 In general, if we are going to a new possible version performance increases, but sometimes not because query plan flips. 27:45.000 --> 27:51.000 And then we can have hours and hours of time without any impact that we can optimize the queries. 27:51.000 --> 27:58.000 So during the process, we have like the back and developers on call and call them in if you find the performance problem. 27:58.000 --> 28:02.000 And we can optimize the queries long before we do the full switch over. 28:02.000 --> 28:08.000 Yeah, if we are satisfied, we get all of the read load to the target cluster. 28:08.000 --> 28:14.000 So all of the rides still happen to source cluster, and then they propagate of a logical replication to the target cluster. 28:14.000 --> 28:17.000 And all the read load goes to the target cluster already. 28:17.000 --> 28:20.000 And yeah, we can monitor the performance. 28:20.000 --> 28:23.000 If it looks great, we run our full end to end. 28:24.000 --> 28:28.000 And they are quite quite significant. 28:28.000 --> 28:31.000 It basically uses all of the features that Hitler has. 28:31.000 --> 28:34.000 It creates issues, it does, it does, it does. 28:34.000 --> 28:38.000 Yeah, basically calls calls a lot of the functions, also writeable. 28:38.000 --> 28:42.000 And also measures not only that it works, but also the performance. 28:42.000 --> 28:48.000 And that's really tricky because this construct here, all the data is written to the source cluster and then replicated to target cluster. 28:48.000 --> 28:52.000 It's improved, this increases the latency. 28:52.000 --> 28:58.000 So we even needed to make our QA test a bit more resilient so that works still works. 28:58.000 --> 29:04.000 But now that's a fairly fairly great method to test the complete functionality. 29:04.000 --> 29:12.000 And also to get inside of the performance because we again can compare. 29:12.000 --> 29:17.000 And if you are satisfied with the performance, all tests are successful. 29:17.000 --> 29:19.000 We can do the full switch over. 29:19.000 --> 29:24.000 Now we also, we break the, now we break the replication from source to target. 29:24.000 --> 29:31.000 And we move all of the, we load balance all of the right queries to the target cluster as well. 29:31.000 --> 29:35.000 And this part basically was for two years ago, we also had that. 29:35.000 --> 29:38.000 But now comes the important upgrade. 29:38.000 --> 29:45.000 Now we don't have to, to pray because we still have a way to roll back should perform. 29:45.000 --> 29:50.000 And we keep that running for the full of Monday because during Sunday, as mentioned, we don't have too much load. 29:50.000 --> 29:55.000 So if there would be edge cases that only happen during peak hours, we might not find them on Sunday. 29:55.000 --> 30:00.000 So we still, on Monday when the load starts, we first see a peak on Europe work hours. 30:00.000 --> 30:04.000 So when European start work, we see the load rising. 30:04.000 --> 30:09.000 And then we see when the US east coast starts to work. And then we see when the US west coast start working. 30:09.000 --> 30:14.000 And this is, this is the critical phase. Like we have to use west coast starts to work. 30:14.000 --> 30:17.000 And still people in the east coast works, people in Europe still works. 30:17.000 --> 30:20.000 That's normally our peak hours a day. 30:20.000 --> 30:27.000 And there we have to really have to monitor that everything is optimal or we have to on the fly optimized. 30:27.000 --> 30:30.000 Yeah, and we, we run that for the full of Monday. 30:30.000 --> 30:35.000 Still logical replication is needed for that. So we can keep the old cluster in sync. 30:35.000 --> 30:45.000 And Tuesday morning, Europe time, if everything worked as expected, we stopped the replication, we removed the old cluster. 30:45.000 --> 30:48.000 And now we are fully on 17. 30:48.000 --> 30:53.000 And because we stopped the logical replication, DDL is possible again. 31:00.000 --> 31:03.000 Also my rush quite through that. 31:03.000 --> 31:07.000 So I, I really hope you have some questions for me. 31:07.000 --> 31:15.000 If you, if you want to deep dive into that, we have here an overview about our database infrastructure. 31:15.000 --> 31:21.000 Underneath, I didn't mention that before, all of the actions you have seen are not executed manually. 31:21.000 --> 31:25.000 We have everything in Ansible Playbox. We do that multiple times a year. 31:25.000 --> 31:27.000 It would be super tedious to do it manually. 31:27.000 --> 31:29.000 And it would introduce new human error every time. 31:29.000 --> 31:34.000 So every single incident ends up with playbooks linked to the repo you can see there. 31:34.000 --> 31:41.000 And also for us in a fairly large company from my point of view, I have to coordinate this with a lot of people. 31:41.000 --> 31:47.000 I have to coordinate this with a lot of people. I have to coordinate with delivery. 31:47.000 --> 31:52.000 Some of our larger customers would like to be informed of such things. 31:52.000 --> 31:58.000 And so I have to coordinate with a lot of a lot of different different stakeholders in the company. 31:58.000 --> 32:02.000 And also I have to set up schedules for on call. 32:02.000 --> 32:06.000 And we at GitLab, we use GitLab to organize ourselves. 32:06.000 --> 32:10.000 So we use GitLab issues and it affects to organize such upgrades. 32:10.000 --> 32:15.000 And I have, we have a large template, an issue template. 32:15.000 --> 32:18.000 So when we do a new upgrade, we create a new issue based on this template. 32:18.000 --> 32:25.000 And it has all the checklists to inform what's much requests to create in order to change the roles. 32:25.000 --> 32:28.000 And basically all the steps are in there. 32:28.000 --> 32:32.000 This thing is fairly large and it will most likely not work for your organization is. 32:32.000 --> 32:38.000 But if you have to organize such an upgrade would be really interesting for you to go through it. 32:38.000 --> 32:42.000 Also, I oversimplified a few steps to press it in the time. 32:42.000 --> 32:48.000 If you are more interested in the actual, for example, in the actual caveats of logical replication, 32:48.000 --> 32:54.000 there are a few things about why sequences are a problem I have for recording there. 32:55.000 --> 32:58.000 Yeah, slide text are already updated, upload it. 32:58.000 --> 32:59.000 And I have two versions. 32:59.000 --> 33:00.000 One is the one you are seeing right now. 33:00.000 --> 33:03.000 And one is an extended version with roughly doubled number of slides. 33:03.000 --> 33:06.000 We have additional explanations for you. 33:06.000 --> 33:07.000 Yeah. 33:07.000 --> 33:09.000 And now I really hope you have questions. 33:09.000 --> 33:11.000 You can approach the during the event. 33:11.000 --> 33:12.000 I'm sometimes at the post-presquail booth. 33:12.000 --> 33:15.000 Sometimes at the GitLab booth or running around. 33:15.000 --> 33:19.000 You can also write me something or you can ask right now. 33:19.000 --> 33:22.000 Because apparently we have some time for that. 33:25.000 --> 33:26.000 Yes. 33:26.000 --> 33:27.000 Thank you. 33:31.000 --> 33:32.000 Who is first? 33:32.000 --> 33:33.000 Somewhere right there. 33:38.000 --> 33:39.000 Okay. 33:39.000 --> 33:40.000 I have the following question. 33:40.000 --> 33:43.000 Like you said you are going to 17. 33:43.000 --> 33:51.000 Basically you are staying in one version behind the tip of progress as far as I understand. 33:51.000 --> 33:54.000 But during those periods when you do the upgrade. 33:54.000 --> 33:59.000 Do you also try to see whether you can upgrade to the, 33:59.000 --> 34:03.000 like to the newest version. 34:03.000 --> 34:08.000 Not in order to upgrade just to see whether you will have any problems with that. 34:08.000 --> 34:09.000 Okay. 34:09.000 --> 34:10.000 I summarized the question. 34:10.000 --> 34:13.000 You mentioned that we are going to post-presquail 17, 34:13.000 --> 34:15.000 which is one version behind the current stable. 34:15.000 --> 34:19.000 And you ask me if you have considered going to the latest version. 34:20.000 --> 34:22.000 Okay. Awesome. 34:22.000 --> 34:23.000 Yeah. 34:23.000 --> 34:31.000 In general, we have this policies that we want to go to a post-presquail version before that. 34:31.000 --> 34:36.000 That we want to go to a post-presquail version before the next version comes out. 34:36.000 --> 34:38.000 That's the ideal thing. 34:38.000 --> 34:45.000 But not, but we would not want to go to version before it's older than like half a year. 34:46.000 --> 34:48.000 I really like every new feature. 34:48.000 --> 34:53.000 But I don't have the capacity to find all the cool new bugs in production. 34:53.000 --> 34:55.000 So I'm really happy if somebody else finds them first. 34:55.000 --> 34:59.000 So we have a certain mandatory delay before we go to the new version. 34:59.000 --> 35:02.000 For post-presquail 18, it's a bit different. 35:02.000 --> 35:06.000 Because in post-presquail 18, there are optimizations that we would really love to have. 35:06.000 --> 35:09.000 We have a problem called lightweight log contention. 35:09.000 --> 35:12.000 So post-presquail has explicit logs, which are used in queries. 35:12.000 --> 35:14.000 You can say log table or something. 35:14.000 --> 35:18.000 But there's also an internal construct called lightweight logs. 35:18.000 --> 35:21.000 And that's biting us and our peak hours sometimes. 35:21.000 --> 35:24.000 And post-presquail 18 comes with some optimizations. 35:24.000 --> 35:26.000 So I would really like to go to 18. 35:26.000 --> 35:32.000 So we might make an out-of-band upgrade and basically right now start planning for that. 35:32.000 --> 35:33.000 Awesome. 35:33.000 --> 35:35.000 Thank you for the question. 35:36.000 --> 35:38.000 Hi. 35:38.000 --> 35:44.000 Did I understand correctly that the period that you cannot perform DDL is three days? 35:44.000 --> 35:49.000 And if so, is that not a problem for your change processes? 35:49.000 --> 35:52.000 I couldn't copy a story. 35:52.000 --> 35:53.000 Okay. 35:53.000 --> 36:04.000 Did I understand correctly that the period in which you cannot perform DDL is three days? 36:04.000 --> 36:08.000 And if so, is that not a problem for your change processes? 36:08.000 --> 36:09.000 Yes. 36:09.000 --> 36:10.000 That's correct. 36:10.000 --> 36:17.000 The time frame we can perform DDL is from Saturday to Tuesday morning. 36:17.000 --> 36:18.000 That's the phase. 36:18.000 --> 36:19.000 That's something we choose. 36:19.000 --> 36:20.000 We could make a charter. 36:20.000 --> 36:22.000 But then we don't have a rollback option anymore. 36:22.000 --> 36:23.000 Quick rollback. 36:23.000 --> 36:28.000 Not only before, we only headed to Sunday, Sunday evening. 36:28.000 --> 36:30.000 But we couldn't rollback on Monday. 36:30.000 --> 36:33.000 So for us to trade off was we want to have a larger rollback window. 36:33.000 --> 36:36.000 And therefore we are fine with having the DDL log for longer. 36:36.000 --> 36:38.000 And it's not a large problem. 36:38.000 --> 36:40.000 There are basically two things that can't happen. 36:40.000 --> 36:43.000 We can't deploy new GitLab versions that would change the schema. 36:43.000 --> 36:46.000 And not every GitLab minor update changes schema. 36:46.000 --> 36:48.000 So that's not a huge problem. 36:48.000 --> 36:52.000 And also we can't do background re-indexing. 36:52.000 --> 36:54.000 Which is also not a problem. 36:54.000 --> 36:55.000 We just pause it for the time. 36:55.000 --> 36:58.000 And we can't in the same category. 36:58.000 --> 37:02.000 We can't do background partitioning. 37:02.000 --> 37:03.000 Because we have fairly large tables. 37:03.000 --> 37:07.000 We've been petitions in the background create new new sub tables. 37:07.000 --> 37:09.000 And that's also not a problem. 37:09.000 --> 37:11.000 We pre-create them for like a month or so. 37:11.000 --> 37:15.000 So we could, for this case, we could put DDL of four months. 37:15.000 --> 37:16.000 Or was that any problem? 37:18.000 --> 37:19.000 Awesome. 37:19.000 --> 37:20.000 Thank you for the question. 37:20.000 --> 37:31.000 I have a question related to the rebuilding of the indexes. 37:31.000 --> 37:36.000 And the corruption you mentioned that is maybe possible after the DDLBC upgrade. 37:36.000 --> 37:38.000 Do actually, did it happen to you? 37:38.000 --> 37:39.000 Do you have that experience? 37:39.000 --> 37:41.000 Or is it purely a political? 37:41.000 --> 37:47.000 And in your steps you had a great, you upgraded the database first. 37:47.000 --> 37:50.000 Then enable logical replication and then rebuild the indexes. 37:50.000 --> 37:53.000 So I wonder whether this data corruption could happen. 37:53.000 --> 37:55.000 Whether the audition would be different. 37:55.000 --> 37:59.000 Like first building indexes and then enabling logical replication. 37:59.000 --> 38:00.000 Okay. Awesome. 38:00.000 --> 38:03.000 First question was regarding index corruption. 38:03.000 --> 38:05.000 Did we had index corruption? 38:05.000 --> 38:07.000 And the answer is doing our tests. 38:07.000 --> 38:08.000 Yes. 38:08.000 --> 38:11.000 For the current upgrade going from Ubuntu 20 to 2022, 38:11.000 --> 38:14.000 they are like 20 indexes corrupted. 38:14.000 --> 38:18.000 But it never materialized as a problem because we found out 38:18.000 --> 38:21.000 during our test upgrade and we never went to production with that. 38:21.000 --> 38:23.000 We made the list of indexes. 38:23.000 --> 38:27.000 We optimized and we recreate all of them before we go to production. 38:27.000 --> 38:31.000 And the second question was regarding the sequence of events. 38:33.000 --> 38:36.000 So here when we start out when we create a test cluster, 38:36.000 --> 38:39.000 that's streaming replication because it's the most efficient one. 38:39.000 --> 38:41.000 And then we break it. 38:41.000 --> 38:44.000 We stop then we... 38:44.000 --> 38:46.000 Or that's test cluster. 38:46.000 --> 38:47.000 It's irrelevant. 38:47.000 --> 38:48.000 Okay, here. 38:48.000 --> 38:49.000 We start... 38:52.000 --> 38:53.000 Okay, here. 38:53.000 --> 38:54.000 That starts out. 38:54.000 --> 38:56.000 We create the test cluster. 38:56.000 --> 38:57.000 No. 38:57.000 --> 38:58.000 We create the target cluster. 38:58.000 --> 39:02.000 That starts out with streaming replication because it's just most efficient. 39:02.000 --> 39:05.000 And then we switch it to logical replication. 39:05.000 --> 39:09.000 During here, I mean already here, 39:09.000 --> 39:12.000 the indexes on the target cluster are corrupted. 39:12.000 --> 39:17.000 You couldn't send the live traffic to this cluster because you would get wrong answers. 39:17.000 --> 39:19.000 You couldn't switch over to that. 39:19.000 --> 39:21.000 But it does not matter because we have it in production, 39:21.000 --> 39:22.000 but we don't. 39:22.000 --> 39:24.000 The application does not talk to it. 39:24.000 --> 39:25.000 So at the current state it's corrupted, 39:25.000 --> 39:28.000 but the application does not talk to it. 39:28.000 --> 39:29.000 Answer that. 39:29.000 --> 39:31.000 Is that an answer to your question? 39:31.000 --> 39:32.000 Awesome. 39:32.000 --> 39:33.000 Thank you. 39:36.000 --> 39:37.000 No. 39:37.000 --> 39:38.000 Can you hear me? 39:41.000 --> 39:44.000 So you mentioned briefly that you used the right-to-head log 39:44.000 --> 39:47.000 of post-blastware for managing logical replication. 39:47.000 --> 39:51.000 How do you use any additional tool to handle that replication? 39:51.000 --> 39:55.000 And also managing the NN plus 1 server node use. 39:55.000 --> 39:57.000 Perhaps we're ending a failover. 39:59.000 --> 40:01.000 Just the right-to-head log by itself. 40:01.000 --> 40:04.000 The question is, we use the right-to-head log for logical replication? 40:04.000 --> 40:05.000 Yeah. 40:05.000 --> 40:07.000 Do you use any additional tools? 40:07.000 --> 40:10.000 No, there's something like PG-bound server, for instance. 40:10.000 --> 40:11.000 Okay. 40:11.000 --> 40:15.000 First of all, that misconception, we're not using the right-to-head log for logical replication. 40:15.000 --> 40:21.000 The right-to-head log basically contains the data for the person that I would like one to answer that. 40:21.000 --> 40:23.000 But you have an additional question. 40:23.000 --> 40:24.000 Okay. 40:24.000 --> 40:25.000 Awesome. 40:25.000 --> 40:26.000 Okay. 40:26.000 --> 40:27.000 The right-to-head log is not used for logical replication directly. 40:27.000 --> 40:33.000 For the replication itself, we use the post-blastware build in the future of logical replication. 40:33.000 --> 40:35.000 But we have a lot of different tooling. 40:35.000 --> 40:37.000 For example, we have connection tooling. 40:37.000 --> 40:39.000 We use PG-bound server for connection tooling. 40:39.000 --> 40:43.000 And we have a fleet of PG-bound server because it's CPU bound. 40:43.000 --> 40:48.000 And for management of our high availability, we use a patrony. 40:48.000 --> 40:51.000 And patrony is a post-blastware high availability toolkit. 40:51.000 --> 40:54.000 And it's also controlling the PG-bound server. 40:54.000 --> 40:59.000 There's a centralized single source of trust in our case console. 40:59.000 --> 41:05.000 And if we do the switch over, we basically just say, hey, patrony, switch over to this source. 41:05.000 --> 41:11.000 And patrony tells PG-bound server, hey, post-connections here, and redirects them to the other ones. 41:11.000 --> 41:12.000 Awesome. 41:12.000 --> 41:19.000 If you want to take a look at our Ansible, you see where we integrate, where we hook into the systems. 41:19.000 --> 41:20.000 Awesome. 41:20.000 --> 41:21.000 Thank you. 41:21.000 --> 41:23.000 Do we have a mic here? 41:23.000 --> 41:24.000 Hi. 41:24.000 --> 41:26.000 Oh, you have new people. 41:26.000 --> 41:27.000 Sorry. 41:28.000 --> 41:30.000 Thank you for the talk. 41:30.000 --> 41:33.000 I was really surprised to hear about that collation issue. 41:33.000 --> 41:35.000 I hadn't heard of that before. 41:35.000 --> 41:41.000 I wondered if postgres had ever considered dealing with collation itself somehow 41:41.000 --> 41:44.000 to make this whole upgrade process easier. 41:44.000 --> 41:45.000 Yeah. 41:45.000 --> 41:47.000 Is that ever been considered or? 41:47.000 --> 41:48.000 Yeah, definitely. 41:48.000 --> 41:53.000 First of all, I'm quite lucky that I know what a collation issue for quite a long time. 41:53.000 --> 41:58.000 Because some years ago, I worked for Postgres with Consulting Company and the collation for the German umlauded. 41:58.000 --> 42:04.000 Something like, you changed before, so we had a lot of German customers who had this problem like many years ago. 42:04.000 --> 42:09.000 And now it's biting more people because the collation for different characters was changed. 42:09.000 --> 42:12.000 And yeah, there are new and your Postgres version, there are different methods. 42:12.000 --> 42:15.000 Back in the day, you were bound to use the system-wide collation. 42:15.000 --> 42:17.000 But now you can use different collations. 42:17.000 --> 42:20.000 So there are, I see you. 42:20.000 --> 42:22.000 There are other collations. 42:22.000 --> 42:24.000 They can say, hey, Postgres deal. 42:24.000 --> 42:26.000 Ignore the system collation used this one. 42:26.000 --> 42:31.000 But it also comes with caveats to sum it up in one sentence. 42:31.000 --> 42:32.000 Okay. 42:32.000 --> 42:33.000 Yeah. 42:33.000 --> 42:34.000 Thank you. 42:36.000 --> 42:41.000 The default is that it uses the collation of your G-lipsy library, your system-sy library. 42:41.000 --> 42:43.000 And that's where the problem comes from. 42:43.000 --> 42:48.000 Because if you're making an operating system upgrade, anything changes to G-lipsy, 42:48.000 --> 42:51.000 you are about to find interesting behavior. 42:58.000 --> 43:05.000 Before the switchover, how did you distinguish the read and the write queries? 43:05.000 --> 43:08.000 I did put it a bit closer to the most before the switchover. 43:08.000 --> 43:13.000 Before the switchover, you had the read and write queries. 43:13.000 --> 43:17.000 How did you distinguish which or read and write queries? 43:17.000 --> 43:18.000 Awesome question. 43:18.000 --> 43:23.000 As before the switchover, we first switch over the write queries and then the read queries. 43:23.000 --> 43:24.000 How did we do it? 43:24.000 --> 43:30.000 And that's, unfortunately, for most of you, that's built in G-lip feature. 43:30.000 --> 43:35.000 So in G-lip the application in our Ruby on Redspack and it knows that certain queries 43:35.000 --> 43:38.000 will be read and write and certain queries will only read. 43:38.000 --> 43:43.000 So it connects to two pools, one read, read, write pool, which only talks to the primary. 43:43.000 --> 43:48.000 One read pool where we have a lot of stand-by and the Ruby application talks to all the stand-by. 43:48.000 --> 43:53.000 And the Ruby on Reds decides where to send it. 43:53.000 --> 43:55.000 Okay. 43:55.000 --> 43:56.000 Hi. 43:56.000 --> 43:59.000 Here. 43:59.000 --> 44:04.000 So can you explain a bit more how you the upgrade worked? 44:04.000 --> 44:08.000 Actually, the distribution upgrade, I didn't quite catch that. 44:08.000 --> 44:10.000 For the test, you said you cloned it. 44:10.000 --> 44:15.000 So I guess, do you have compute and storage separate? 44:15.000 --> 44:22.000 So you could just cloned with the 22 or did you do a distribution upgrade? 44:22.000 --> 44:28.000 Or did you just provision empty VMs and then re-cloned or re-sync everything? 44:28.000 --> 44:29.000 Okay. 44:29.000 --> 44:33.000 How do we get to the target cluster? 44:34.000 --> 44:39.000 Okay. We create new virtual machines with Ubuntu 2022. 44:39.000 --> 44:50.000 And we take snapshots from production and we create new machines from snapshots in production. 44:50.000 --> 44:52.000 Storage snapshots, sorry. 44:52.000 --> 44:57.000 So we run on GCP and we take a storage snapshot and then create machines from the snapshots, 44:57.000 --> 45:02.000 which with our currently 48-tera by disks that takes like half an hour, 45:02.000 --> 45:06.000 but it's way faster than restoring a PG-based backup. 45:11.000 --> 45:13.000 I have two questions. 45:13.000 --> 45:14.000 Yeah. 45:14.000 --> 45:18.000 One is how much effort you put into building this automation? 45:21.000 --> 45:24.000 The first iteration of it was more than half a year, 45:24.000 --> 45:29.000 and now we are working for it since 2022 or something. 45:29.000 --> 45:32.000 So we iterate over years over it. 45:32.000 --> 45:35.000 So it's a lot of work. 45:35.000 --> 45:41.000 Because if you make the wrong, there are certain error cases here that brings down time. 45:41.000 --> 45:44.000 And there are error cases that would bring data corruption. 45:44.000 --> 45:47.000 So you have to test it really thoroughly. 45:47.000 --> 45:50.000 And if we do test on our production, it takes a multi-day test. 45:50.000 --> 45:52.000 So yeah, there's a lot of effort into that. 45:52.000 --> 45:56.000 And so it's years, how big of a team do you have? 45:56.000 --> 45:58.000 Currently, it's five people. 45:58.000 --> 46:02.000 And from the majority of our work last year, went into that. 46:02.000 --> 46:04.000 But we have some fluctuation in the team, 46:04.000 --> 46:06.000 so if it would have been a standing team, 46:06.000 --> 46:09.000 maybe we would have had a bit more headroom for other things. 46:09.000 --> 46:11.000 But it's multiple people per year, 46:11.000 --> 46:14.000 if you have a really large system like we have. 46:14.000 --> 46:19.000 What the end of the second question is, would this be reusable outside? 46:19.000 --> 46:20.000 Yes. 46:20.000 --> 46:24.000 So I presented a very oversimplified architecture. 46:24.000 --> 46:29.000 And this architecture you can use for a lot of different use cases. 46:29.000 --> 46:32.000 Our automation is made for us. 46:32.000 --> 46:33.000 It's open source. 46:33.000 --> 46:34.000 You can clone the repo. 46:34.000 --> 46:35.000 You can look into it. 46:35.000 --> 46:40.000 But you most likely would not be able to use our ends of the playbooks one to one. 46:40.000 --> 46:43.000 But the concept for sure. 46:43.000 --> 46:47.000 But to repeat that, if you don't need zero downtime, 46:47.000 --> 46:49.000 if you can take an hour or so of downtime, 46:49.000 --> 46:52.000 I would optimize for one of the simpler approaches 46:52.000 --> 46:54.000 to optimize it to get minimal downtime. 46:54.000 --> 47:00.000 And only go to this endeavor if you have a hard requirement for like zero. 47:00.000 --> 47:01.000 Thank you. 47:01.000 --> 47:03.000 Here we go. 47:03.000 --> 47:06.000 I've got a question over here. 47:06.000 --> 47:10.000 All right. 47:10.000 --> 47:12.000 So before you've mentioned that, 47:12.000 --> 47:16.000 most of the services of GitLab are run on Kubernetes, 47:16.000 --> 47:19.000 I've already used the way to the site. 47:19.000 --> 47:22.000 And I've wanted to ask what exactly red is. 47:22.000 --> 47:23.000 It's useful. 47:23.000 --> 47:27.000 It's mostly for cache and job processing or something. 47:27.000 --> 47:29.000 Why are there use cases on red is? 47:29.000 --> 47:30.000 Okay. 47:30.000 --> 47:33.000 First this play, my focus is postgres VR. 47:33.000 --> 47:37.000 So for my understanding, we use red is for application caching. 47:37.000 --> 47:39.000 So Ruby on red's catcher stuff and red is. 47:39.000 --> 47:42.000 So it doesn't have to create a database for something, 47:42.000 --> 47:43.000 it just asks it. 47:43.000 --> 47:45.000 And for full disclosure, 47:45.000 --> 47:50.000 I didn't thought too much about Githli, Githli is our back end for storing actually 47:50.000 --> 47:51.000 Githli data. 47:51.000 --> 47:54.000 That's also not in Kubernetes at the moment. 47:54.000 --> 47:55.000 It can run. 47:55.000 --> 48:00.000 We have some use cases for that, but for Githli.com, it still runs on virtual machines as well. 48:00.000 --> 48:02.000 Okay, last question here. 48:02.000 --> 48:06.000 What is the procedure for minor version upgrade? 48:06.000 --> 48:07.000 Is it the same? 48:07.000 --> 48:09.000 No, for minor versions upgrade, minor version upgrades. 48:09.000 --> 48:12.000 You don't have the data does not change. 48:12.000 --> 48:15.000 So, for minor version upgrades, we just create new stand-byes, 48:15.000 --> 48:18.000 put the new minor version, put them into the load balancer, 48:18.000 --> 48:20.000 and then fade out the old ones. 48:20.000 --> 48:24.000 And if only the primary is left for only the old version, 48:24.000 --> 48:27.000 we have to do a switch over, which as we have seen, 48:27.000 --> 48:29.000 we can do without noticeable user impact, 48:29.000 --> 48:32.000 then we switch over and decommission the primary. 48:32.000 --> 48:35.000 So, it's a bit tedious because we have to create all these new nodes, 48:35.000 --> 48:38.000 put them in the load balancer, put the old ones out, 48:38.000 --> 48:42.000 but it's something we do just during normal working hours without 48:42.000 --> 48:46.000 a fraction of this preparation effort here. 48:50.000 --> 48:53.000 We've got one more question here, time for it. 48:53.000 --> 48:56.000 Hi, thanks for the talk. 48:56.000 --> 48:58.000 A small question. 48:58.000 --> 49:02.000 I assume that GCP has one more or another of the managed postgres. 49:02.000 --> 49:05.000 What's your considerations for not using that? 49:05.000 --> 49:07.000 Is it okay? 49:07.000 --> 49:11.000 Okay, the question is GCP itself offering for managed postgres here. 49:11.000 --> 49:14.000 Okay, I hope I don't step on somebody's toes here, 49:14.000 --> 49:18.000 but the GCP offering is like an 80% offering, 49:18.000 --> 49:21.000 like for to catch like a lot of people who want to have 49:21.000 --> 49:23.000 just a postgres skill instance, 49:23.000 --> 49:27.000 but for our scale, it's unfortunately not possible. 49:27.000 --> 49:30.000 I guess that's good sum up. 49:30.000 --> 49:32.000 Thank you. 49:32.000 --> 49:33.000 Thank you. 49:33.000 --> 49:36.000 Thank you.