Skip to content

Commit 5df7652

Browse files
committed
blog: r2 migration blog post
Signed-off-by: flakey5 <[email protected]>
1 parent 8124f8d commit 5df7652

File tree

3 files changed

+199
-0
lines changed

3 files changed

+199
-0
lines changed

Diff for: apps/site/authors.json

+5
Original file line numberDiff line numberDiff line change
@@ -253,5 +253,10 @@
253253
"id": "AugustinMauroy",
254254
"name": "Augustin Mauroy",
255255
"website": "https://github.com/AugustinMauroy"
256+
},
257+
"flakey5": {
258+
"id": "flakey5",
259+
"name": "flakey5",
260+
"website": "https://github.com/flakey5"
256261
}
257262
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
---
2+
date: '2024-12-28T12:00:00.000Z'
3+
category: announcements
4+
title: Making Node.js Downloads Reliable
5+
layout: blog-post
6+
author: flakey5
7+
---
8+
9+
Last year, we shared [the details behind Node.js' brand new website](https://nodejs.org/en/blog/announcements/diving-into-the-nodejs-website-redesign).
10+
Today we're back, talking about the new infrastructure serving Node.js' release assets.
11+
12+
<!-- I don't really like this paragraph -->
13+
14+
This blog post goes into the nitty-gritty details on the web infrastructure behind Node.js, its history, and where it stands today.
15+
It also goes into specifics on what we had in mind and were prioritizing with this overhaul to our infrastructure.
16+
17+
<!-- I feel like there should be more intro here -->
18+
19+
## Some Brief History
20+
21+
At the start of the project in 2009, Node.js release assets (binaries, documentation) were stored on a publicly accessible S3 bucket.
22+
Around May of 2010, this was changed to be a VPS that hosted the release assets and the Node.js website.
23+
24+
<!-- this is awkward -->
25+
26+
When the io.js fork happened in 2014, a VPS was also used.
27+
28+
After the Node.js and io.js merge in 2015, io.js' VPS (which will be referred to as the origin server from hereon) was repurposed to host both io.js and Node.js releases along with the Node.js website, and remained that way up until recently.
29+
30+
The architecture looked like this:
31+
32+
![A diagram of the old infrastructure's diagram. Cloudflare is used as a cache and NGINX is used for serving static files.](/static/images/blog/announcements/old-release-asset-infra.png)
33+
34+
## Growing Pains
35+
36+
Nowadays, the nodejs.org domain sees over 3 billion requests and 2+ petabytes of traffic per month, with the majority of that going towards release assets.
37+
38+
<!-- cf dash screenshots? -->
39+
40+
This averages to about 1,157 requests per second, with an average bandwidth of 771 mb per second.
41+
42+
<details>
43+
<summary>Math</summary>
44+
<!-- todo double check w/ up to date numbers -->
45+
3,000,000,000 requests per month / 30 days / 24 hours / 60 minutes / 60 seconds = ~1157 requests/second.
46+
2,000,000,000 mb per month / 30 days / 24 hours / 60 minutes / 60 seconds = 771 mb/second.
47+
</details>
48+
49+
The origin server didn't have enough resources for this, and it struggled to keep up with the demand.
50+
The website being moved off of the origin server as part of the redesign effort did help, but it still wasn't enough.
51+
52+
Cloudflare caching was being used rather inefficiently as well.
53+
Due to there not being a great way to only purge what we needed to when a release was made, we ended up having to purge everything.
54+
Because Node.js has nightly releases, this means that everyday at midnight UTC the cache gets purged and the origin server gets flooded with requests, effectively DDoS'ing it.
55+
56+
There were also a handful of other issues with the origin server pertaining to its maintenance:
57+
58+
- Documentation of things running on the server was spotty; some things were well documented and others not at all
59+
- Changes performed in the [nodejs/build](https://github.com/nodejs/build) repository needed to be deployed manually by a Build WG member with access. There was also no guarantee that what's in the build repository is what's actually on the server.
60+
- There's no staging environment other than a backup instance.
61+
- Rollbacks could only be done via disk images through the VPS providers' web portal.
62+
63+
## Attempts Were Made
64+
65+
All of these issues combined created for scenarios where the origin server wasn't touched unless necessary, including a period in which it had over 3 years of uptime.
66+
These factors also contributed to incidents such as the one that occurred from [March 15th, 2023 to March 17th, 2023](/en/blog/announcements/node-js-march-17-incident), where the Node.js release assets were unavailable for 2 days due to the origin server being overloaded and improper caching rules.
67+
Between incidents such as that and the daily outages that were occuring, users were being effected and were painfully aware of the unreliability of the infrastructure.
68+
69+
This needed to be fixed.
70+
71+
However, attempts made to remediate these issues in this infrastructure could only go so far.
72+
The NGINX configuration was changed to modify its resource consumption and add in missing documentation.
73+
Cloudflare WAF was integrated to block spammy requests from repositories such as Artifactory.
74+
Load balancing changes were made to try to help lessen the load.
75+
But, these ultimately weren't as effective as they needed to be.
76+
77+
## The Rewrite
78+
79+
In August of 2023, [Claudio Wunder](https://github.com/ovflowd) and [I (flakey5)](https://github.com/flakey5) started working on a proof-of-concept for a new way to serve Node.js release assets.
80+
81+
The idea was "simple": create a new service that solved all of the issues we had with the previous infrastructure and showed no noticeable difference from the old infrastructure to users.
82+
In order to meet these requirements, we prioritized three main goals with this new service:
83+
84+
1. Reliabiliy: The service needs to be as close to 100% uptime as possible.
85+
2. Maintainability: Maintainers should not have to worry about things toppling over because they changed something. The service needs to be well-documented and as clean and simple as possible.
86+
3. Efficiency: Whatever platform was used, the service would need to make full use of it to provide the best performance and experience as possible.
87+
88+
In order to meet these requirements, we ended up using [Cloudflare Workers](https://developers.cloudflare.com/workers) and [R2](https://developers.cloudflare.com/r2) for a handful of reasons:
89+
90+
1. Workers and R2 are, and historically have been, reliable and fast.
91+
2. Workers takes care of the infrastructure for us, so we just need to maintain the service itself. This heavily lessens the cost of maintenance, especially for a team of volunteers.
92+
3. Node.js had already had previous usage of Cloudflare services; it makes sense to look into expanding it.
93+
4. Pricing. Cloudflare was gracious enough to provide Node.js with free access to Workers and R2 under [Project Alexandria](https://www.cloudflare.com/lp/project-alexandria).
94+
95+
In September of 2023, the proof-of-concept was ready to be reviewed, and an issue was made in the nodejs/build repository ([#3461](https://github.com/nodejs/build/issues/3461)) seeking approval from the Build WG.
96+
97+
After the Build WG discussed the change, it was approved and we started working on getting the service, which is now referred to as the Release Worker, deployed to the nodejs.org domain.
98+
99+
## The Journey to Production
100+
101+
With developing the Release Worker came a lot of trial and error and learning over numerous iterations of the service.
102+
It needed to have all of the same features and similar, if not the exact same, behaviors as its predecessor, NGINX.
103+
104+
### What It Needed To Do
105+
106+
For starters, it needed to be able to provide the latest releases of Node.js as soon as soon as they're released.
107+
108+
Secondly, it needed to handle routing correctly.
109+
Most assets don't have 1:1 mappings of their URL to where they are located in the file system.
110+
Where a URL maps to can even change depending on the Node.js version that's being requested.
111+
For instance,
112+
113+
| URL | File Path |
114+
| --------------------------------- | -------------------------------------- |
115+
| `/dist/index.json` | `nodejs/release/index.json` |
116+
| `/dist/latest-iron/...` | `nodejs/release/v20.x.x/...` |
117+
| `/docs/v0.1.20/...` | `nodejs/docs/v0.1.20/...` |
118+
| `/docs/v22.0.0/...` | `nodejs/release/v22.0.0/docs/api/...` |
119+
| `/dist/v0.4.9/node-v0.4.9.tar.gz` | `nodejs/release/node-v0.4.9.tar.gz` |
120+
| `/dist/v0.4.9/SHASUMS256.txt` | `nodejs/release/v0.4.9/SHASUMS256.txt` |
121+
122+
This behavior was created from multiple different changes throughout the release cycle and the way release assets were distributed, and was achieved through symlinks.
123+
However, R2 doesn't support symlinks, meaning we needed to come up with a solution on our own.
124+
125+
Finally, we needed to meet the reliability goal.
126+
To do this, we implemented three things:
127+
128+
1. Any request to R2 that fails is retried 3 times (in additon to the retries tha Workers already performs).
129+
2. A "fallback" system. Any request to R2 that fails all retries is rewritten to the old infrastructure.
130+
3. When an error does happen, it's recorded in [Sentry](https://sentry.io/welcome) and we're notified so we can take appropriate action.
131+
132+
### The Iterations
133+
134+
We first started off with an incredibly simple worker.
135+
Given a path in the URL, it checked if the file requested existed in the R2 bucket.
136+
If it did, it returned it.
137+
Otherwise, it rewrote the request back to the origin server.
138+
For requests that resulted in a directory listing, the worker just forwarded those over to the origin server.
139+
This iteration obviously didn't cover nearly any of the requirements, so it was back to the drawing board.
140+
141+
The second iteration was based off the popular R2 directory listing library [render2](https://www.npmjs.com/package/render2), developed by [Kotx](https://github.com/Kotx).
142+
The library worked well for the more generic use cases we needed to cover, however, it fell short in the more unique use cases that we had.
143+
So we forked it, adding what we needed for it to work for us.
144+
However, it became rather messy and thus fell short of our maintainability goal.
145+
146+
This led us to our third iteration, which was a complete rewrite while still keeping some aspects of render2.
147+
It worked for the most part, but this too was also a mess and didn't meet our maintainability goal.
148+
It was also designed in the exact way that we needed the service to behave as well.
149+
If we needed to change that behavior in any way, we would need to refactor significant portions of the codebase.
150+
We knew we could do better than this.
151+
152+
This led us to the fourth and current iteration of the Release Worker, which was yet again another rewrite.
153+
This time however, it was designed to be a lot more modular and with a middleware-centric design.
154+
This allowed for code that was a lot easier to keep track of and maintain, and, as of November 2024, is what is currently deployed to production.
155+
156+
## Maintainability
157+
158+
As said in our previous blog post on the website redesign, an open source project is only as good as its documentation.
159+
In order for the Release Worker to be maintainable, it needed to well documented.
160+
161+
This was achieved not only through thorough comments in the codebase but also documents such as,
162+
163+
- [README](https://github.com/nodejs/release-cloudflare-worker/tree/main/README.md)
164+
- [Collaborator Guide](https://github.com/nodejs/release-cloudflare-worker/tree/main/COLLABORATOR_GUIDE.md)
165+
- [Contributing Guide](https://github.com/nodejs/release-cloudflare-worker/tree/main/CONTRIBUTING.md)
166+
- [General Documentation](https://github.com/nodejs/release-cloudflare-worker/blob/main/docs) on things
167+
- [Standard Operating Procedures](https://github.com/nodejs/release-cloudflare-worker/blob/main/docs/sops) for things such as incident flows.
168+
169+
## What's next?
170+
171+
The work isn't done _just_ yet.
172+
We still want to,
173+
174+
- Look into any performance improvements that could be made.
175+
- This includes looking into integrating [Cloudflare KV](https://developers.cloudflare.com/kv/) for directory listings.
176+
- Have better tests and a better development environment ([PR!](https://github.com/nodejs/release-cloudflare-worker/pull/252)R)
177+
- Metrics to give us more visibility into how the Release Worker is behaving and if there's anything that we can improve.
178+
179+
## Thanks
180+
181+
Many people and organizations have contributed to this effort in many different ways.
182+
We'd like to thank:
183+
184+
- All of the [contributors and collaborators](https://github.com/nodejs/release-cloudflare-worker/graphs/contributors) that make this project possible.
185+
- [Cloudflare](https://cloudflare.com) for providing the infrastructure that serves Node.js' Website and the Release Worker. Also specifically to the R2 team for the technical support they have given us.
186+
- [Sentry](https://sentry.io/welcome/) for providing an open source license for their error reporting, monitoring, and diagnostic tools.
187+
- [OpenJS Foundation](https://openjsf.org) for their support and guidance.
188+
189+
> Here is your weekly reminder that the Node.js project is driven by volunteers. Therefore every feature that lands is because someone spent time (or money) to make it happen. This is called Open Governance.
190+
> <cite>[Matteo Collina, via social media](https://x.com/matteocollina/status/1770496526424351205?s=46&t=22eoAstJVk5l46KQXYEk5Q)</cite>
191+
192+
Want to get involved? [Check out the project on GitHub](https://github.com/nodejs/release-cloudflare-worker).
193+
194+
> Fun fact! Did you know that Node.js has [a status page](https://status.nodejs.org)? No? You're not alone! We've been rather bad at using it to communicate these issues to users, but we're working on improving that.
Loading

0 commit comments

Comments
 (0)