During a rutine restart of our production environment they didn’t restarted as expected. The operation team started to investigate the cause and discovered that one required resources didn’t respond during the restart sequence. We had moved the resources behind a firewall, and the servers where unable to reach the service and the current failover mechanism where unable to continue without this resource.
A temporary fix where mad to get the system back online and a permanent fix where done the next day to prevent it from happening again. We have identified a few areas we can improve to reduce the risk and impact of similar situations in the future and will continue to improve our service going forward.
We where down in about 32 minutes from 18:40 UTC to 19:12 UTC