53
Outage - Oct 14/15 (programming.dev)

Over the weekend we had a large intermittent outage, followed up by unplanned maintenance that I had put off for way too long.

Lemmy runs with several different services.

  • lemmy-ui (the reactesque frontend)
  • lemmy (the rust backend)
  • postgres (the data store for operations, comments, posts, etc)
  • pictrs (the image data store)

The outage concerns itself with the last one. We always knew we'd eventually need to migrate to an object based store, but Lemmy defaults to file based picture storage and that's what we stuck with up until now. This eventually caused the VPS that programming.dev is running on to seize up, and resulted in the outage over the weekend.

Saturday night I spent several hours testing out the object migration on the beta.programming.dev site in order to validate that it worked. During this time I struggled with some very obtuse ansible errors that I hadn't encountered before and so I was not able to start the migration that night. I delayed until the next morning (thank goodness).

I began work Sunday morning at 10:00 America/Denver time. Initially the migration started off quite well, but was moving incredibly slowly. Looking back on it now, the migration would have taken over 144 hours if I left it to do its thing. I let this run for about an hour before messaging the pictrs dev to understand why logs weren't showing up for the migration (even though objects were showing up in the store). Apparently lemmy-ansible is set to use 0.4.0 of pictrs, which not only is quite old, but doesn't have the ability to run migrations concurrently. There was the issue. I asked the dev is it was possible to stop a migration in the middle of the running, upgrade, and continue. They told me what changes I'd need to make, I made them, did the upgrade, and restarted the migration. It immediately failed. This was the start of my issues.

The server was now too full of data to do anything, including running apt update or apt install to install tools to assist me. I was able to attach more block storage, but I'm not enough of a linux guru to figure out how to mount it where the current pictrs filesystem would be able to take advantage of it. I had to result to copying the entire pictrs filesystem to a fresh ~500gb mount, fixing permissions, and then rerunning the migration from there. By the time I got to this point, it was about 12:30PM. The migration from then on took several hours.

After the migration completed, I needed to deploy the new stack with the correct settings. The ansible script needed to run apt though, and, well, that wouldn't work when the server was still full. At this point I was not confident in the migration and I also hadn't realized that you could do the migration while the site was running (big oversight from me). I therefore wanted to maintain the entire pictrs file store until I proved the object store was working. I created another block storage, copied the entire pictrs directory over to it again (another 20 minutes or so) and then deleted the original directory. I was now able to run the ansible script and deploy the new settings for pictrs, confident that I had a backup available in case something went wrong (this is not the main backup method, the server is backed up externally as well, but I didn't want to have to resort to those during the migration).

That completed the migration, some 5 hours after it originally started.

There were several things that exacerbated the issue that made it take several hours longer than I wanted.

  1. I let it go so long before doing the migration to object storage that the server was too full to even perform an apt update. This resulted in me not being able to install tools I needed, along with a host of other issues as mentioned
  2. pict-rs was at a very suboptimal version. If it had just been two minor versions newer it would have migrated perfectly fine, in a few hours.
  3. my limited knowledge around ansible led me on wild goose chases several times

Things I would change if I had to do it again:

  1. Dig in a bit deeper on the concurrency flag in the pictrs docs. It was not present in the original guide I followed (from a lemmy post on another instance), and thus I didn't realize that it wouldn't run with concurrency at all.
  2. Don't wait so long so that the server is full
  3. Migrate while the server is running. That would have been dumb in this case, since the server wouldn't stay up anyway, and could have caused other issues. But there was no reason to take the server down if it had been stable, and other instances have done so with no problems.
you are viewing a single comment's thread
view the rest of the comments
[-] towerful@programming.dev 15 points 1 year ago

The server was now too full of data to do anything,

This reminds me of something that I always mean to do but totally forget to.
Allocate 1gb of space for a blank/dummy file on every VM I run.
When you run into a VM that locks up due to disk space, delete (or resize) the file, get to work fixing the VM, then put the empty file back

[-] snowe@programming.dev 2 points 1 year ago

Oh that’s a good idea lol.

this post was submitted on 20 Oct 2023
53 points (100.0% liked)

Programming.dev Meta

2493 readers
2 users here now

Welcome to the Programming.Dev meta community!

This is a community for discussing things about programming.dev itself. Things like announcements, site help posts, site questions, etc. are all welcome here.

Links

Credits

founded 2 years ago
MODERATORS