this post was submitted on 06 Jan 2025
277 points (99.3% liked)

Lemmy.ca's Main Community

2852 readers
3 users here now

Welcome to lemmy.ca's c/main!

Since everyone on lemmy.ca gets subscribed here, this is the place to chat about the goings on at lemmy.ca, support-type items, suggestions, etc.

Announcements can be found at https://lemmy.ca/c/meta

For support related to this instance, use https://lemmy.ca/c/lemmy_ca_support

founded 4 years ago
MODERATORS
 

Hey everyone, and happy new year!

Sorry about that super long downtime there. Yesterday (Sunday) morning at 10:03AM PST our server suffered a physical hardware failure, apparently a power supply failure. Unfortunately despite opening a ticket with our hosting vendor (OVH) a few minutes later and them claiming to have 24/7 support, nobody looked at our ticket until this morning when their phone support lines opened and I called them.

They've now replaced a defective power supply and we're back online, after ~26 hours of being offline. Some pretty disappointing response times, to put it nicely.

We're planning to move away from OVH at the end of this month, onto proper enterprise grade hardware that we own and control. This will give us a HUGE boost in server resources and allow us to scale for the foreseeable future, while also giving us the control to resolve problems like this much quicker. Expect another follow up post about this in the next couple weeks once I've put together the migration plan.

Timeline:

  • Jan 5th 10:03am PST - We get alerts to the server being non-responsive.
  • Jan 5th 10:05am PST - I pull up the console via IPMI and it's completely non-responsive. Attempting to power off / on the server or do anything, does not work.
  • Jan 5th 10:15am PST - Initial support ticket created with OVH. I followed up a couple times over the next few hours, and got no response.
  • Jan 6th 6:32am PST - Called OVH, gave them the case number and asked them to investigate
  • Jan 6th 7:34am PST - I get notified they'll start their "intervention" in 15 minutes.
  • Jan 6th 11:04am PST - Call them again, the tech is still working on it and they'll get back to me with an update
  • Jan 6th 11:34am PST - "I was informed by our data centre technician that there is an issue with the power supply unit for the rack on which your server resides. Your server will come back online once they have replaced the power supply."
  • Jan 6th 12:17pm PST - We're back up finally!

Edit on Jan 7th @ 8:40am PST: We just had another outage of about an hour. Investigating with OVH.

top 50 comments
sorted by: hot top controversial new old
[–] saigot@lemmy.ca 95 points 3 days ago (1 children)

Lemmy.ca goes down and the PM resigns just a few hours later, obviously it's all a conspiracy to bury the news ^/s^

[–] wise_pancake@lemmy.ca 6 points 2 days ago

I think this is causation. Trudeau must be an avid Lemmy.ca user and the stress of this instance being down pushed him over the edge.

[–] Rusty@lemmy.ca 68 points 3 days ago
[–] mp3@lemmy.ca 69 points 3 days ago (1 children)

Thank you for handling the issue :)

I can't count how many times I opened my Lemmy.ca bookmark out of habit during the outage 😬

[–] OutlierBlue@lemmy.ca 16 points 3 days ago (1 children)

I kept thinking "I wonder how the Lemmy.ca outage is doing. I should get on and find ou-- oh yeah"

[–] saigot@lemmy.ca 24 points 3 days ago* (last edited 3 days ago) (1 children)

In case your unaware: https://status.lemmy.ca/ has the server status and updates on outages.

[–] OutlierBlue@lemmy.ca 13 points 3 days ago

Yeah I did a search for "is lemmy.ca down" and it listed that site. The updates were good, but I really wanted a board where everyone could complain and speculate.

[–] Skyline969@lemmy.ca 39 points 3 days ago* (last edited 3 days ago) (3 children)

Make sure you figure out what kind of service credit you’re eligible for due to their SLA snafu. Over 24 hours to replace a simple power supply… what a bunch of jokers. OVH has gone downhill quite a bit over the years sadly.

[–] Shadow@lemmy.ca 32 points 3 days ago

Yep, planning on it. It looks like we should qualify for one free month, which would be nice to make our migration a little easier.

[–] chargen@lemmy.ca 10 points 3 days ago (1 children)

Not the server power supply per the notes above, but some component that supplies power to the rack. Perhaps the PDU, or other distribution components? Depending on the rack, it could take a couple of hours to replace, but certainly the ticket should have been picked up far earlier for service!

load more comments (1 replies)

Nice quality hardware they have. Not even redundant power supplies it seems to have, and no monitoring either. I’ll keep my distance from them.

[–] wise_pancake@lemmy.ca 6 points 2 days ago

Glad we're back up!

Thanks for doing what you do, I don't know what it takes to run an instance but you all do a great job!

[–] pedz@lemmy.ca 21 points 3 days ago

I was following on status.lemmy.ca and saw that your hosting provider was OVH so I knew it could be a while.

I worked there a few years ago, as a "Customer Advocate", and although their infra is usually nice and affordable, it also comes with the major downside of having practically no support, and being at their mercy for any physical issue.

I remember trying to help people that didn't know about 'screen' and being told by my boss that I was there to help people with billing issues, not to do tech support. I could help with what was offered by OVH, like installing a new OS, but not tell people about screen. Every day we had reports with how many minutes we spent in the bathroom. Once I sent a meme to a co-worker and was warned that I could be fired for doing it again. It's one of the most horrible places I've worked; both for the customers and for the employees.

I still have a dedicated server with them and it's been trouble free for many years. However I'm aware that as soon as anything happens, I'll probably need to move my stuff somewhere else if time is of the essence.

[–] cybermass@lemmy.ca 7 points 3 days ago

Thanks for the update, I was wondering what was going on.

Glad to hear you're getting your own hardware now. I offered before but again if you need any help I can offer up some free labour on the weekends. I am a net admin so if you need any help with networking or security give me a shout.

[–] cheeseburger@lemmy.ca 16 points 3 days ago

Thank you for all your hard work and top notch comms, Lemmy.ca admin peeps. I spent the day at dbzer0, which is always fun, but happy to be back home!

[–] Adderbox76@lemmy.ca 21 points 3 days ago

I didn't know what to do with my hours at work. I was almost at the point of actually...working...gross.

[–] krnl386@lemmy.ca 20 points 3 days ago (1 children)

Wow, that’s pretty shitty service, considering that renting a physical server is likely not a small client thing. I’m also floored that a rack PDU failure was not detected and acted upon more proactively by their datacenter operations team, and necessitated a ticket to be opened by you. OVH really does seem to be the Temu of colo/hosting providers. Yikes!

[–] Shadow@lemmy.ca 17 points 3 days ago* (last edited 3 days ago)

They have a nice status page where you can see how many servers are down in a rack, I think that was a mis-comm between the tech + support since it only showed 1-2 down. No way a whole rack pdu failure should be down for that long.

[–] Rentlar@lemmy.ca 15 points 3 days ago

Ahhh been wanting to comment on the Troodoe shitstorm all day today. Super helpful that http://status.lemmy.ca/ kept us in the know.

Curious what the monthly price difference is going to be, before and after moving away from ovh.

[–] IronKrill@lemmy.ca 14 points 3 days ago

Thanks for all your hard work!

[–] AceTKen@lemmy.ca 17 points 3 days ago (1 children)

Pretty piss-poor "24 hr. service" from OVH. We're an MSP and will make sure to steer clear of them for future projects.

Glad to see everything back however!

[–] adespoton@lemmy.ca 10 points 3 days ago (1 children)

Took them 24 hours to respond and 7 minutes to fix. Sounds like 24/7 service to me?

[–] AceTKen@lemmy.ca 6 points 3 days ago (1 children)

Shit. I didn't read your comment as a joke initially and almost had a mini-rant about how we know if a server 3 provinces away is down for 5 minutes and go into a red alert.

That was dumb of me. Carry on.

[–] adespoton@lemmy.ca 2 points 2 days ago (1 children)

Sorry, that humour is from my 3 decades of asking people “what do you mean it failed? WHAT WEREN’T WE LOGGING THAT IT COULD HAVE SAT IN THAT STATE FOR 24 HOURS WITH NOBODY TELLING ME????”

Unfortunately I’ve been at my current company long enough that I start work with hundreds of improperly tuned notifications now.

It’s enough to have left me a bit fatalistic about the many ways monitoring can be screwed up both through good intentions and through ignorance and inattention.

[–] AceTKen@lemmy.ca 1 points 2 days ago

Totally get that. It's what made me spin up my company instead of staying a tech lead at massive MSPs. They don't give a fuck about procedures and properly taking care of systems, they care about making sure the client signs the next contract. That is all.

We do a full onboarding at every client and make sure every profile on every desktop and every server, switch, printer, and router are fully updated and up to spec. Everything is fucking perfect. It's why we only sit on (checks RMM) 12 tickets for over 1000 seats at any given time. NOTHING waits and we know everything inside and out. Some clients pay for perfectionism and adore it.

So yeah. Long-winded way of saying I totally appreciated the humour once I pulled my head out of my ass and in fact realized that it was humour.

[–] Mereo@lemmy.ca 17 points 3 days ago (1 children)

Welcome back! I was looking into OVH but after this outage... Naaah.

[–] corsicanguppy@lemmy.ca 10 points 3 days ago
[–] Jerry@feddit.online 14 points 3 days ago

Oh my. I do feel your pain. My Friendica instance was down at the same time as your Lemmy instance, for 2 days as well, except my issue was a corrupted database engine and corrupted database tables. Kindred spirits.

I'm glad you got it resolved!

[–] rbos@lemmy.ca 14 points 3 days ago

[Opens lemmy the 20th time today] Ahhhhhhhhhhh, that's the stuff. I was entirely too productive!

[–] GlassHalfHopeful@lemmy.ca 13 points 3 days ago

Well, thanks to this outage, I got a pretty good understanding of how often I turn on my Lemmy app. It turns out to be a whole heck of a lot. Haha.

Glad things are back up and running and that you all have a path forward to mitigate this in the future.

[–] Luci@lemmy.ca 13 points 3 days ago

Welcome back! OVH sucks, really happy to hear you're moving away from them!

[–] Sunshine@lemmy.ca 13 points 3 days ago

We’re back home!

[–] BlemboTheThird@lemmy.ca 11 points 3 days ago

I went back to using my .ee account during the downtime and had to see... shudder... Hexbear posts... I guess it was nice to see the clownshow hasn't changed at all

[–] jerkface@lemmy.ca 12 points 3 days ago
[–] phoenixz@lemmy.ca 10 points 3 days ago (2 children)

Any hosting you can recommend?

I need enterprise level hosting guaranteed in Canada and currently, our servers are on Rogers.

To say that Rogers is the fucking worst is an understatement as at this point I'm seriously considering the word "sabotage"

That does not include the fact that they're extremely overcharging for very low hardware specs, their support staff literally doesn't know nor care, even though were payong through the nose, and their "enterprise level" support literally means that they might look at your ticket within 5 business days and their SLA is "fuck you"

So with that in mind, and previous great experiences with ovh, I got myself a nice big fat OVH server for testing which is single handedly flying circles around 10 of our Rogers servers for a fraction of the cost of those servers at Rogers and so far its been awesome.

Not having a ticket looked at for 24 hours, however, is not possible for us, we need 24 support...

Anything you might be able to recommend?

[–] Shadow@lemmy.ca 9 points 3 days ago* (last edited 3 days ago)

Unfortunately no recommendations unless you want colo space to run your own infra in. I'd avoid any large corp like Rogers or Telus.

I'm sure ovh is fine if you have multiple servers and roll your own redundancy. You can also pay them for premium support.

load more comments (1 replies)
[–] RandAlThor@lemmy.ca 3 points 2 days ago

Lemmy went down again! WTH is going on?

[–] brianpeiris@lemmy.ca 7 points 3 days ago

Thanks for staying on top of it. Glad to hear you're moving to a better situation!

[–] veeesix@lemmy.ca 9 points 3 days ago

I was anxiously following the updates on status.lemmy.ca. Glad you guys are back online!

[–] ZC3rr0r@lemmy.ca 8 points 3 days ago

I am simultaneously happy to see you're all back up and running again and mourning the loss of the sudden increase in productivity I had while you were down.

[–] hefty4871@lemmy.ca 9 points 3 days ago

Phew! Welcome back!

[–] Camus@lemmy.ca 8 points 3 days ago

Good to see we are back

[–] JohnnyCanuck@lemmy.ca 9 points 3 days ago

Thanks for getting it back and all the work to keep it running!

[–] roserose56@lemmy.ca 9 points 3 days ago

We are finally back!!! Thank you guys!!

[–] LolaCat@lemmy.ca 9 points 3 days ago
[–] Album@lemmy.ca 8 points 3 days ago

Thank you! Slow response from OVH - you'd think they'd have monitoring for that.

[–] avidamoeba@lemmy.ca 8 points 3 days ago (1 children)

Happy new year!

They’ve now replaced a defective power supply and we’re back online, after ~26 hours of being offline.

It was a nice break from the firehose.

load more comments (1 replies)
[–] gianni@lemmy.ca 7 points 3 days ago (2 children)

I’m curious what the new hosting setup will look like?

[–] OutlierBlue@lemmy.ca 13 points 3 days ago

What would it look like?

[–] Shadow@lemmy.ca 12 points 3 days ago

Expect a post with details sometime in the next few weeks, once I stop procrastinating and start deploying things =)

load more comments
view more: next ›