??
This commit is contained in:
parent
29aec69d4a
commit
10782342f2
111
out/blogs/nws-postmortem-11-8-23.html
Normal file
111
out/blogs/nws-postmortem-11-8-23.html
Normal file
|
@ -0,0 +1,111 @@
|
|||
<head>
|
||||
<title>Nicholas Orlowsky</title>
|
||||
<link rel="stylesheet" href="/style.css">
|
||||
<link rel="icon" type="image/x-icon" href="/favicon.ico">
|
||||
</head>
|
||||
<body>
|
||||
<nav>
|
||||
<a href="/">[ Home ]</a>
|
||||
<a href="/blog.html">[ Blog ]</a>
|
||||
<a href="/projects.html">[ Projects ]</a>
|
||||
<a href="/extra.html">[ Extra ]</a>
|
||||
<hr/>
|
||||
</nav>
|
||||
|
||||
<h1>NWS Incident Postmortem 11/08/2023</h1>
|
||||
|
||||
<p>
|
||||
On November 8th, 2023 at approximately 09:47 UTC, NWS suffered
|
||||
a complete outage. This outage resulted in the downtime of all
|
||||
services hosted on NWS and the downtime of the NWS Management
|
||||
Engine and the NWS dashboard.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The incident lasted 28 minutes after which it was automatically
|
||||
resolved and all services were restored. This is NWS' first
|
||||
outage event of 2023.
|
||||
</p>
|
||||
|
||||
<h2>Cause</h2>
|
||||
<p>
|
||||
NWS utilizes several tactics to ensure uptime. A component of
|
||||
this is load balancing and failover. This service is currently
|
||||
provided by Cloudflare at the DNS level. Cloudflare sends
|
||||
health check requests to NWS servers at specified intervals. If
|
||||
it detects that one of the servers is down, it will remove the
|
||||
A record from entry.nws.nickorlow.com for that server (this domain
|
||||
is where all services on NWS direct their traffic via a
|
||||
CNAME).
|
||||
</p>
|
||||
|
||||
<p>
|
||||
At around 09:47 UTC, Cloudflare detected that our servers in
|
||||
Texas (Austin and Hill Country) were down. It did not detect an
|
||||
error, but rather an HTTP timeout. This is an indication that the
|
||||
server has lost network connectivity. When it detected that the
|
||||
servers were down, it removed their A records from the
|
||||
entry.nws.nickorlow.com domains. Since NWS' Pennsylvania servers
|
||||
have been undergoing maintenance since August 2023, this left no
|
||||
servers able to serve requests routed to entry.nws.nickorlow.com,
|
||||
resulting in the outage.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
NWS utilizes UptimeRobot for monitoring the uptime statistics of
|
||||
services on NWS and NWS servers. This is the source of the
|
||||
statistics shown on the NWS status page.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
UptimeRobot did not detect either of the Texas NWS servers as being
|
||||
offline for the duration of the outage. This is odd, as UptimeRobot
|
||||
and Cloudflare did not agree on the status of NWS servers. Logs
|
||||
on NWS servers showed that requests from UptimeRobot were being
|
||||
served while no requests from Cloudflare were shown in the logs.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
No firewall rules existed that could have blocked this traffic
|
||||
for either of the NWS servers. There was no other configuration
|
||||
found that would have blocked these requests. As these servers
|
||||
are on different networks inside different buildings in different
|
||||
parts of Texas, their networking equipment is entirely separate.
|
||||
This rules out any hardware failure of networking equipment owned
|
||||
by NWS. This leads us to believe that the issue may have been
|
||||
caused due to an internet traffic anomaly, although we are currently
|
||||
unable to confirm that this is the cause of the issue.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
This is being actively investigated to find a more concrete root
|
||||
cause. This postmortem will be updated if any new information is
|
||||
found.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
A similar event occurred on November 12th, 2023 lasting for 2 seconds.
|
||||
</p>
|
||||
|
||||
<h2>Fix</h2>
|
||||
<p>
|
||||
The common factor between both of these servers is that they both use
|
||||
Spectrum for their ISP and that they are located near Austin, Texas.
|
||||
The Pennsylvania server maintenance will be expedited so that we have
|
||||
servers online that operate with no commonalities.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
NWS will also investigate other methods of failover and load
|
||||
balancing.
|
||||
</p>
|
||||
|
||||
<p>Last updated on November 16th, 2023</p>
|
||||
|
||||
<footer>
|
||||
<hr />
|
||||
<p style="margin-bottom: 0px;">Copyright © Nicholas Orlowsky 2023</p>
|
||||
<p style="margin-top: 0px; margin-bottom: 0px;">Hosting provided by <a href="https://nws.nickorlow.com">NWS</a></p>
|
||||
<p style="margin-top: 0px;">Powered by <a href="https://github.com/nickorlow/anthracite">Anthracite Web Server</a></p>
|
||||
</footer>
|
||||
</body>
|
121
out/blogs/side-project-10-20-23.html
Normal file
121
out/blogs/side-project-10-20-23.html
Normal file
|
@ -0,0 +1,121 @@
|
|||
<head>
|
||||
<title>Nicholas Orlowsky</title>
|
||||
<link rel="stylesheet" href="/style.css">
|
||||
<link rel="icon" type="image/x-icon" href="/favicon.ico">
|
||||
</head>
|
||||
<body>
|
||||
<nav>
|
||||
<a href="/">[ Home ]</a>
|
||||
<a href="/blog.html">[ Blog ]</a>
|
||||
<a href="/projects.html">[ Projects ]</a>
|
||||
<a href="/extra.html">[ Extra ]</a>
|
||||
<hr/>
|
||||
</nav>
|
||||
|
||||
<h1>Side Project Log 10/20/2023</h1>
|
||||
<p>This side project log covers work done from 8/15/2023 - 10/20/2023</p>
|
||||
|
||||
<h2 id="anthracite">Anthracite</h2>
|
||||
<a href="https://github.com/nickorlow/anthracite">[ GitHub Repo ]</a>
|
||||
<p>
|
||||
Anthracite is a web server written in C++. The site you're reading this on
|
||||
right now is hosted on Anthracite. I wrote it to deepen my knowledge of C++ and networking protocols. My
|
||||
main focus of Anthracite is performance. While developing anthracite,
|
||||
I have been exploring different optimization techniques and benchmarking
|
||||
Anthracite against popular web servers such as NGINX and Apache.
|
||||
Anthracite supports HTTP/1.1 and only supports GET requests to request
|
||||
files stored on a server.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Anthracite currently performs on par with NGINX and Apache when making
|
||||
1000 requests for a 50MB file using 100 threads in a Docker container.
|
||||
To achieve this performance, I used memory profilers to find
|
||||
out what caused large or repeated memory copies to occur. I then updated
|
||||
those sections of code to remove or minimize these copies. I also
|
||||
made it so that Anthracite caches all files it can serve in memory. This
|
||||
avoids unnecessary and costly disk reads. The implementation of this is
|
||||
subpar, as it requires that the server be restarted whenever the files
|
||||
it is serving are changed for the updates to be detected by Anthracite.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
I intend to make further performance improvements, specifically in the request
|
||||
parser. I also plan to implement HTTP/2.0.
|
||||
</p>
|
||||
|
||||
<h2 id="yacemu">Yet Another Chip Eight Emulator (yacemu)</h2>
|
||||
<a href="https://github.com/nickorlow/yacemu">[ GitHub Repo ]</a>
|
||||
<p>
|
||||
YACEMU is an interpreter for the CHIP-8 instruction set written in C. My main
|
||||
goal when writing it was to gain more insight into how emulation works. I had
|
||||
previous experience with this from when I worked on an emulator for a slimmed-down
|
||||
version of X86 called <a href="https://web.cse.ohio-state.edu/~reeves.92/CSE2421sp13/PracticeProblemsY86.pdf">Y86</a>.
|
||||
So far, I've been able to get most instructions working. I need to work on adding
|
||||
input support so that users can interact with programs running in yacemu. It has
|
||||
been fairly uncomplicated and easy to write thus far. After I complete it, I would
|
||||
like to work on an emulator for a real device such as the GameBoy (This might be
|
||||
biting off more than I can chew).
|
||||
</p>
|
||||
|
||||
<h2 id="nick-vim">Nick VIM</h2>
|
||||
<p>
|
||||
Over the summer while I was interning, I began using VIM as my primary
|
||||
text editor. I used a preconfigured version of it (<a href="https://nvchad.com/">NvChad</a>) to save time, as
|
||||
setting everything up can take a while. After using it for a few months, I began
|
||||
making my own configuration for VIM, taking what I liked from NvChad and leaving
|
||||
behind the parts that I didn't like as much.
|
||||
</p>
|
||||
|
||||
<img src="/blog-images/NickVIM_Screenshot.png" alt="Screenshot of an HTML file open for editing in NickVIM"/>
|
||||
|
||||
<p>
|
||||
One important part of Nick VIM was ensuring that it was portable between different
|
||||
machines. I wanted the machine to have as few dependencies as possible so that I
|
||||
could get NickVIM set up on any computer in a couple of minutes. This will be especially
|
||||
useful when working on my School's lab machines and when switching to new computers
|
||||
in the future. I achieved this by dockerizing Nick VIM. This is based on what one of
|
||||
my co-workers does with their VIM setup. The Docker container contains
|
||||
all the dependencies for each language server. Whenever you edit a file with Nick Vim,
|
||||
the following script runs:
|
||||
</p>
|
||||
|
||||
<code lang="bash">
|
||||
echo Starting container...
|
||||
cur_dir=`pwd`
|
||||
container_name=${cur_dir////$'_'}
|
||||
container_name="${container_name:1}_$RANDOM"
|
||||
docker run --name $container_name --network host -e DISPLAY=$DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix --mount type=bind,source="$(pwd)",target=/work -d nick-vim &> /dev/null
|
||||
|
||||
echo Execing into container...
|
||||
docker exec -w /work -it $container_name bash
|
||||
|
||||
echo Stopping container in background...
|
||||
docker stop $container_name &> /dev/null &
|
||||
</code>
|
||||
|
||||
<p>
|
||||
This code creates a new container, forwards the host's clipboard to the container, and
|
||||
mounts the current directory inside the container for editing.
|
||||
</p>
|
||||
|
||||
<h2 id="secane">Secane</h2>
|
||||
<p><a href="https://www.youtube.com/watch?v=tKRehO7FH_s">[ Video Demo ]</a></p>
|
||||
<p>
|
||||
Secane was a simple ChatGPT wrapper that I wrote to practice for the behavioral part of
|
||||
job interviews. It takes your resume, information about the company, and information about
|
||||
the role you're interviewing for. It also integrates with OpenAI's whisper, allowing you
|
||||
to simulate talking out your answers. I made it with Next.JS.
|
||||
</p>
|
||||
|
||||
<hr/>
|
||||
<p><strong>These projects had minimal/no work done on them:</strong> NWS, RingGold, SQUIRREL</p>
|
||||
<p><strong>These projects I will no longer be working on:</strong> Olney</p>
|
||||
|
||||
<footer>
|
||||
<hr />
|
||||
<p style="margin-bottom: 0px;">Copyright © Nicholas Orlowsky 2023</p>
|
||||
<p style="margin-top: 0px; margin-bottom: 0px;">Hosting provided by <a href="https://nws.nickorlow.com">NWS</a></p>
|
||||
<p style="margin-top: 0px;">Powered by <a href="https://github.com/nickorlow/anthracite">Anthracite Web Server</a></p>
|
||||
</footer>
|
||||
</body>
|
89
src/blogs/nws-postmortem-11-8-23.filler.html
Normal file
89
src/blogs/nws-postmortem-11-8-23.filler.html
Normal file
|
@ -0,0 +1,89 @@
|
|||
<h1>NWS Incident Postmortem 11/08/2023</h1>
|
||||
|
||||
<p>
|
||||
On November 8th, 2023 at approximately 09:47 UTC, NWS suffered
|
||||
a complete outage. This outage resulted in the downtime of all
|
||||
services hosted on NWS and the downtime of the NWS Management
|
||||
Engine and the NWS dashboard.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The incident lasted 28 minutes after which it was automatically
|
||||
resolved and all services were restored. This is NWS' first
|
||||
outage event of 2023.
|
||||
</p>
|
||||
|
||||
<h2>Cause</h2>
|
||||
<p>
|
||||
NWS utilizes several tactics to ensure uptime. A component of
|
||||
this is load balancing and failover. This service is currently
|
||||
provided by Cloudflare at the DNS level. Cloudflare sends
|
||||
health check requests to NWS servers at specified intervals. If
|
||||
it detects that one of the servers is down, it will remove the
|
||||
A record from entry.nws.nickorlow.com for that server (this domain
|
||||
is where all services on NWS direct their traffic via a
|
||||
CNAME).
|
||||
</p>
|
||||
|
||||
<p>
|
||||
At around 09:47 UTC, Cloudflare detected that our servers in
|
||||
Texas (Austin and Hill Country) were down. It did not detect an
|
||||
error, but rather an HTTP timeout. This is an indication that the
|
||||
server has lost network connectivity. When it detected that the
|
||||
servers were down, it removed their A records from the
|
||||
entry.nws.nickorlow.com domains. Since NWS' Pennsylvania servers
|
||||
have been undergoing maintenance since August 2023, this left no
|
||||
servers able to serve requests routed to entry.nws.nickorlow.com,
|
||||
resulting in the outage.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
NWS utilizes UptimeRobot for monitoring the uptime statistics of
|
||||
services on NWS and NWS servers. This is the source of the
|
||||
statistics shown on the NWS status page.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
UptimeRobot did not detect either of the Texas NWS servers as being
|
||||
offline for the duration of the outage. This is odd, as UptimeRobot
|
||||
and Cloudflare did not agree on the status of NWS servers. Logs
|
||||
on NWS servers showed that requests from UptimeRobot were being
|
||||
served while no requests from Cloudflare were shown in the logs.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
No firewall rules existed that could have blocked this traffic
|
||||
for either of the NWS servers. There was no other configuration
|
||||
found that would have blocked these requests. As these servers
|
||||
are on different networks inside different buildings in different
|
||||
parts of Texas, their networking equipment is entirely separate.
|
||||
This rules out any hardware failure of networking equipment owned
|
||||
by NWS. This leads us to believe that the issue may have been
|
||||
caused due to an internet traffic anomaly, although we are currently
|
||||
unable to confirm that this is the cause of the issue.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
This is being actively investigated to find a more concrete root
|
||||
cause. This postmortem will be updated if any new information is
|
||||
found.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
A similar event occurred on November 12th, 2023 lasting for 2 seconds.
|
||||
</p>
|
||||
|
||||
<h2>Fix</h2>
|
||||
<p>
|
||||
The common factor between both of these servers is that they both use
|
||||
Spectrum for their ISP and that they are located near Austin, Texas.
|
||||
The Pennsylvania server maintenance will be expedited so that we have
|
||||
servers online that operate with no commonalities.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
NWS will also investigate other methods of failover and load
|
||||
balancing.
|
||||
</p>
|
||||
|
||||
<p>Last updated on November 16th, 2023</p>
|
Loading…
Reference in a new issue