I agree with you. If it helps, here’s roughly what the whole company is doing right now:
- Product engineering is entirely focused on reliability and communications
- Support is doing ongoing support work, and pitching in extra to help us brute force customer communications
- Infra ops is entirely focused on reliability, incident management, and adding more people
- Framework teams are still working on Frameworks
There aren’t people working on new features. Reliability work sometimes manifests as features (like the status page). These contribute, though. We’re not working on anything except: stuff that makes apps on our infrastructure more resilient, and stuff that helps us communicate “are we broken or is your app broken?” to y’all.
This weeks’ outages have been pretty specialized Nomad/Consul/raft issues that not everyone can address. We have managed to add folks to help out and get them ramped up quicker than I expected, which is helpful. We’re in an architectural hole that we can’t “ops” our way out of. Fingers crossed we can get out from under this stuff any day now.