Oh sorry, should have mentioned they have it hosted on proxmox and I have access to view the dashboard. I can see the resource usage you mentioned including history.

I have access to and full control over the proxmox container, but don't have any specific monitoring outside of the logging.

Unfortunately neither the resource usage nor the logs have given away anything. Resource usage is often all over the place. CPU spikes are common and always have been, and whenever there is downtime it's followed by a resource spike as federation catches up. Plus, federation is pretty random, especially when kbin fires a bunch of stuff at lemmy.world and messes everyone up.

Over the course of today I've done a lot of log reading, and I have identified one possible problem and made a tweak tonight. Time will tell if it helps.

Today was also particularly rocky as the host had various spots of downtime, mixed in with lemmy being down at times. I'll keep monitoring tomorrow and see if it's better, today was particularly bad.

[–] BalpeenHammer 3 points 6 months ago (3 children)

If the federation is done by a helper app then it may be possible to throttle it. At least it wouldn't choke out the machine and slow down other running processes.

[–] Dave 2 points 6 months ago (2 children)

The federation doesn't generally seem to be a problem, but many of the large instances do run inbound federation in a separate container.

My problem has been that I haven't managed to narrow it down to a component, so splitting out the containers may not help me troubleshoot. It's definitely on my list of things to try though, if I don't manage to narrow it down.

Currently we have had a run of 5 hours with no outages! So we are doing much better than yesterday. But I suspect that's probably just the host solving their issues.

[–] BalpeenHammer 3 points 6 months ago (1 children)

Maybe the host was the problem in the first place!

[–] Dave 2 points 6 months ago

To my knowledge, the host issues were only yesterday. The problems over the week or more before that were probably not a problem with the host. But now those are solved I can get back to trying to solve the original issue 😅