On November 8th, 2023 at approximately 09:47 UTC, SMC suffered a complete outage. This outage resulted in the downtime of all services hosted on SMC and the downtime of the SMC Management Engine and the SMC dashboard.
The incident lasted 38 minutes after which it was automatically resolved and all services were restored. This is SMC' first outage event of 2023.
SMC utilizes several tactics to ensure uptime. A component of this is load balancing and failover. This service is currently provided by Cloudflare at the DNS level. Cloudflare sends health check requests to SMC servers at specified intervals. If it detects that one of the servers is down, it will remove the A record from entry.nws.nickorlow.com for that server (this domain is where all services on SMC direct their traffic via a CNAME).
At around 09:47 UTC, Cloudflare detected that our servers in Texas (Austin and Hill Country) were down. It did not detect an error, but rather an HTTP timeout. This is an indication that the server may have lost network connectivity. When Cloudflare detected that the servers were down, it removed their A records from the entry.nws.nickorlow.com domain. Since SMC Pennsylvania servers have been undergoing maintenance since August 2023, this left no servers able to serve requests routed to entry.nws.nickorlow.com, resulting in the outage.
SMC utilizes UptimeRobot for monitoring the uptime statistics of services on SMC and SMC servers. This is the source of the statistics shown on the SMC status page.
UptimeRobot did not detect either of the Texas SMC servers as being offline for the duration of the outage. This is odd, as UptimeRobot and Cloudflare did not agree on the status of SMC servers. Logs on SMC servers showed that requests from UptimeRobot were being served while no requests from Cloudflare were shown in the logs.
No firewall rules existed that could have blocked the healthcheck traffic from Cloudflare for either of the SMC servers. There was no other configuration found that would have blocked these requests. As these servers are on different networks inside different buildings in different parts of Texas, their networking equipment is entirely separate. This rules out any failure of networking equipment owned by SMC. This leads us to believe that the issue may have been caused due to an internet traffic anomaly, although we are currently unable to confirm that this is the cause of the issue.
This is being actively investigated to find a more concrete root cause. This postmortem will be updated if any new information is found.
A similar event occurred on November 12th, 2023 lasting for 2 seconds.
The common factor between both of these servers is that they both use Spectrum for their ISP and that they are located near Austin, Texas. The Pennsylvania server maintenance will be expedited so that we have servers online that operate with no commonalities.
SMC will also investigate other methods of failover and load balancing.
Last updated on November 16th, 2023