nws-site/templates/blogs/11-08-2023-postmortem.html

90 lines
3.2 KiB
HTML
Raw Normal View History

2024-08-29 20:29:37 +00:00
<h1>NWS Incident Postmortem 11/08/2023</h1>
2024-05-15 20:12:14 +00:00
<p>
2024-08-29 20:29:37 +00:00
On November 8th, 2023 at approximately 09:47 UTC, NWS suffered
2024-05-15 20:12:14 +00:00
a complete outage. This outage resulted in the downtime of all
2024-08-29 20:29:37 +00:00
services hosted on NWS and the downtime of the NWS Management
Engine and the NWS dashboard.
2024-05-15 20:12:14 +00:00
</p>
<p>
The incident lasted 38 minutes after which it was automatically
2024-08-29 20:29:37 +00:00
resolved and all services were restored. This is NWS' first
2024-05-15 20:12:14 +00:00
outage event of 2023.
</p>
<h2>Cause</h2>
<p>
2024-08-29 20:29:37 +00:00
NWS utilizes several tactics to ensure uptime. A component of
2024-05-15 20:12:14 +00:00
this is load balancing and failover. This service is currently
provided by Cloudflare at the DNS level. Cloudflare sends
2024-08-29 20:29:37 +00:00
health check requests to NWS servers at specified intervals. If
2024-05-15 20:12:14 +00:00
it detects that one of the servers is down, it will remove the
A record from entry.nws.nickorlow.com for that server (this domain
2024-08-29 20:29:37 +00:00
is where all services on NWS direct their traffic via a
2024-05-15 20:12:14 +00:00
CNAME).
</p>
<p>
At around 09:47 UTC, Cloudflare detected that our servers in
Texas (Austin and Hill Country) were down. It did not detect an
error, but rather an HTTP timeout. This is an indication that the
server may have lost network connectivity. When Cloudflare detected that the
servers were down, it removed their A records from the
2024-08-29 20:29:37 +00:00
entry.nws.nickorlow.com domain. Since NWS Pennsylvania servers
2024-05-15 20:12:14 +00:00
have been undergoing maintenance since August 2023, this left no
servers able to serve requests routed to entry.nws.nickorlow.com,
resulting in the outage.
</p>
<p>
2024-08-29 20:29:37 +00:00
NWS utilizes UptimeRobot for monitoring the uptime statistics of
services on NWS and NWS servers. This is the source of the
statistics shown on the NWS status page.
2024-05-15 20:12:14 +00:00
</p>
<p>
2024-08-29 20:29:37 +00:00
UptimeRobot did not detect either of the Texas NWS servers as being
2024-05-15 20:12:14 +00:00
offline for the duration of the outage. This is odd, as UptimeRobot
2024-08-29 20:29:37 +00:00
and Cloudflare did not agree on the status of NWS servers. Logs
on NWS servers showed that requests from UptimeRobot were being
2024-05-15 20:12:14 +00:00
served while no requests from Cloudflare were shown in the logs.
</p>
<p>
No firewall rules existed that could have blocked the healthcheck traffic from Cloudflare
2024-08-29 20:29:37 +00:00
for either of the NWS servers. There was no other configuration
2024-05-15 20:12:14 +00:00
found that would have blocked these requests. As these servers
are on different networks inside different buildings in different
parts of Texas, their networking equipment is entirely separate.
This rules out any failure of networking equipment owned
2024-08-29 20:29:37 +00:00
by NWS. This leads us to believe that the issue may have been
2024-05-15 20:12:14 +00:00
caused due to an internet traffic anomaly, although we are currently
unable to confirm that this is the cause of the issue.
</p>
<p>
This is being actively investigated to find a more concrete root
cause. This postmortem will be updated if any new information is
found.
</p>
<p>
A similar event occurred on November 12th, 2023 lasting for 2 seconds.
</p>
<h2>Fix</h2>
<p>
The common factor between both of these servers is that they both use
Spectrum for their ISP and that they are located near Austin, Texas.
The Pennsylvania server maintenance will be expedited so that we have
servers online that operate with no commonalities.
</p>
<p>
2024-08-29 20:29:37 +00:00
NWS will also investigate other methods of failover and load
2024-05-15 20:12:14 +00:00
balancing.
</p>
<p>Last updated on November 16th, 2023</p>