Thu Apr 02 2020

Just How Fast is EJS?

Article size is 20.5 kB and is a 13 min read

With one project most of our server-side rendered pages are served using the EJS ExpressJS plugin. Since we're using NodeJS this makes sense for us. EJS is fast, and easy to write. The amount of abstraction is also low since the library is fairly small, and I think everyone knows how big of a fan I am of the right (and less) abstractions.

For this project, EJS works well for the marketing site and administration app with pages being able to be rendered in single-digit milliseconds. But what about the really heavily hit stuff - the embedded widgets this project serves?

Quick Background - How The Embedded Widgets Work

For the embedded widgets, a script is loaded, which injects a styled and dynamically resized iframe into the page.

The script (#1) is needed to control the iframe - its styling, the auto resizing, etc.

The /embed endpoint (#2) loads our iframe content.

So what we're talking about here is #2.

Just cache it dude

The problem with caching in this scenario is when items fall out of the cache. For this project we don't want to risk the 1% percentile latency being bad. So that /embed endpoint should always be fast.

So let's benchmark.

Benchmark - Template Literal

First let's get a baseline. Let's compare EJS to the fastest thing we can - returning a string. No caching, although in production currently we use an LRU cache.

What if the endpoint was just a string literal?

We'll use autocannon with the default settings to run our test, using and endpoint with actual parameters taken from production.

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬─────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼─────────┤ │ Latency │ 2 ms │ 2 ms │ 4 ms │ 5 ms │ 2.25 ms │ 0.69 ms │ 15.7 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴─────────┘ ┌───────────┬─────────┬─────────┬────────┬─────────┬─────────┬────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼────────┼─────────┼─────────┼────────┼─────────┤ │ Req/Sec │ 2175 │ 2175 │ 3869 │ 4001 │ 3631 │ 555.19 │ 2174 │ ├───────────┼─────────┼─────────┼────────┼─────────┼─────────┼────────┼─────────┤ │ Bytes/Sec │ 3.54 MB │ 3.54 MB │ 6.3 MB │ 6.52 MB │ 5.91 MB │ 904 kB │ 3.54 MB │ └───────────┴─────────┴─────────┴────────┴─────────┴─────────┴────────┴─────────┘ Req/Bytes counts sampled once per second. 36k requests in 10.08s, 59.1 MB read

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 3 ms │ 3 ms │ 2.08 ms │ 0.31 ms │ 10.98 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Req/Sec │ 3755 │ 3755 │ 3945 │ 4015 │ 3920.82 │ 79.38 │ 3755 │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Bytes/Sec │ 6.12 MB │ 6.12 MB │ 6.43 MB │ 6.54 MB │ 6.39 MB │ 129 kB │ 6.12 MB │ └───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┴─────────┘ Req/Bytes counts sampled once per second. 43k requests in 11.08s, 70.2 MB read

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 3 ms │ 3 ms │ 2.06 ms │ 0.28 ms │ 11.24 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ │ Req/Sec │ 3887 │ 3887 │ 3965 │ 4033 │ 3969.73 │ 41.4 │ 3887 │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ │ Bytes/Sec │ 6.33 MB │ 6.33 MB │ 6.46 MB │ 6.57 MB │ 6.47 MB │ 66.9 kB │ 6.33 MB │ └───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ Req/Bytes counts sampled once per second. 44k requests in 11.08s, 71.1 MB read

You can see it gets faster each time as we make Turbofan happy and the JIT kicks in.

We can see that after JIT compilation our max latency is 11.24ms with the lowest requests/second over ten seconds being 3887.

On kind of unrelated tangent, I wanted to show something real quick. Logging in NodeJS is slow, even with the best of frameworks. How slow?

Well, using connect w/ log4js, logging a line like this with each request:

[2020-04-02T00:04:05.642] [DEBUG] express - ::1 - - "GET /embed?config=%7B%22tenantId%22%3A%22DTGXwVKr%22%2C%22apiHost%22%3A%22http%253A%252F%252Flocalhost%253A3001%22%2C%22urlId%22%3A%22http%253A%252F%252Flocalhost%253A63342%252Fwatchly-models%252Fscripts%252Flocal-test-lifetime.html%253F_ijt%253Dlh9s0h0fgv1t2n2ceibdot1c1g%22%7D HTTP/1.1" 200 1065 "http://localhost:63342/watchly-models/scripts/local-test-lifetime.html?_ijt=lh9s0h0fgv1t2n2ceibdot1c1g" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"

It has an impact - surprisingly. So let's disable logging for a second.

I ran it twice, here's the second run.

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 1 ms │ 1 ms │ 2 ms │ 3 ms │ 1.15 ms │ 0.44 ms │ 13.68 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Req/Sec │ 5051 │ 5051 │ 5467 │ 5623 │ 5429.28 │ 172.67 │ 5049 │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Bytes/Sec │ 7.78 MB │ 7.78 MB │ 8.42 MB │ 8.67 MB │ 8.36 MB │ 266 kB │ 7.78 MB │ └───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┴─────────┘ Req/Bytes counts sampled once per second. 60k requests in 11.08s, 92 MB read

Wow! Our max latency isn't much better, but our average is cut in half, and we go from 44k requests/second to 60.

Of course, Node has to take a lock to write to stdout since you'll have many workers writing logs, you don't want one log to be half written into another one. So we have a concurrency bottleneck with logging every request even if you have many workers.

Note that the impact is so profound just because what we're testing is so light, but it can build up. Generally "aggregated logging" or batched logs sent over the network are better for performance focused systems.

Benchmark - EJS

┌─────────┬──────┬──────┬───────┬──────┬────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼────────┼─────────┼──────────┤ │ Latency │ 4 ms │ 4 ms │ 7 ms │ 8 ms │ 4.4 ms │ 0.94 ms │ 19.83 ms │ └─────────┴──────┴──────┴───────┴──────┴────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬────────┬─────────┬────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼────────┼─────────┼────────┼─────────┤ │ Req/Sec │ 1327 │ 1327 │ 2223 │ 2269 │ 2072.73 │ 286.91 │ 1327 │ ├───────────┼─────────┼─────────┼─────────┼────────┼─────────┼────────┼─────────┤ │ Bytes/Sec │ 1.81 MB │ 1.81 MB │ 3.03 MB │ 3.1 MB │ 2.83 MB │ 391 kB │ 1.81 MB │ └───────────┴─────────┴─────────┴─────────┴────────┴─────────┴────────┴─────────┘ Req/Bytes counts sampled once per second. 23k requests in 11.08s, 31.1 MB read

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 4 ms │ 4 ms │ 5 ms │ 5 ms │ 4.11 ms │ 0.37 ms │ 12.36 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ │ Req/Sec │ 2079 │ 2079 │ 2225 │ 2259 │ 2201.73 │ 57.87 │ 2078 │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ │ Bytes/Sec │ 2.83 MB │ 2.83 MB │ 3.04 MB │ 3.08 MB │ 3 MB │ 79.2 kB │ 2.83 MB │ └───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ Req/Bytes counts sampled once per second. 24k requests in 11.08s, 33 MB read

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 4 ms │ 4 ms │ 5 ms │ 5 ms │ 4.07 ms │ 0.31 ms │ 13.22 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬────────┬─────────┬─────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼────────┼─────────┼─────────┼─────────┤ │ Req/Sec │ 2173 │ 2173 │ 2239 │ 2271 │ 2240.6 │ 28.83 │ 2173 │ ├───────────┼─────────┼─────────┼─────────┼────────┼─────────┼─────────┼─────────┤ │ Bytes/Sec │ 2.97 MB │ 2.97 MB │ 3.05 MB │ 3.1 MB │ 3.06 MB │ 38.9 kB │ 2.96 MB │ └───────────┴─────────┴─────────┴─────────┴────────┴─────────┴─────────┴─────────┘ Req/Bytes counts sampled once per second. 22k requests in 10.07s, 30.6 MB read

Wow, a huge difference. We can only serve half the requests in the same amount of time and latency is doubled across the board.

Again remember - it may seem crazy to optimize this. But for the nature of this project - a widget hosted on a third party page - can mean that we get flooded with tons of requests for uncached pages at once. Even with autoscaling this could cause an outage or at least a hiccup. Being able to handle twice the uncached traffic in a spike is a big deal.

So let's take a step back. What if we also add an LRU cache to our template literal?

First it starts off not so great, since the JIT hasn't warmed up yet.

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 4 ms │ 5 ms │ 2.24 ms │ 0.72 ms │ 24.13 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬────────┬────────┬─────────┬─────────┬─────────┬────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼────────┼────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Req/Sec │ 2145 │ 2145 │ 3897 │ 4087 │ 3659 │ 546.44 │ 2145 │ ├───────────┼────────┼────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Bytes/Sec │ 3.5 MB │ 3.5 MB │ 6.35 MB │ 6.66 MB │ 5.96 MB │ 890 kB │ 3.49 MB │ └───────────┴────────┴────────┴─────────┴─────────┴─────────┴────────┴─────────┘ Req/Bytes counts sampled once per second. 40k requests in 11.07s, 65.6 MB read

But then!

┌─────────┬──────┬──────┬───────┬──────┬─────────┬────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 3 ms │ 3 ms │ 2.06 ms │ 0.3 ms │ 11.41 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ │ Req/Sec │ 3849 │ 3849 │ 3967 │ 4037 │ 3970.8 │ 53.39 │ 3848 │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ │ Bytes/Sec │ 6.27 MB │ 6.27 MB │ 6.46 MB │ 6.58 MB │ 6.47 MB │ 86.9 kB │ 6.27 MB │ └───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ Req/Bytes counts sampled once per second. 40k requests in 10.08s, 64.7 MB read

Wait, isn't that the same as our string literal test? Let's check again.

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 3 ms │ 3 ms │ 2.08 ms │ 0.36 ms │ 11.58 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬────────┬────────┬─────────┬─────────┬─────────┬────────┬────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼────────┼────────┼─────────┼─────────┼─────────┼────────┼────────┤ │ Req/Sec │ 3699 │ 3699 │ 3975 │ 4051 │ 3932.28 │ 96.96 │ 3699 │ ├───────────┼────────┼────────┼─────────┼─────────┼─────────┼────────┼────────┤ │ Bytes/Sec │ 5.7 MB │ 5.7 MB │ 6.13 MB │ 6.25 MB │ 6.06 MB │ 150 kB │ 5.7 MB │ └───────────┴────────┴────────┴─────────┴─────────┴─────────┴────────┴────────┘ Req/Bytes counts sampled once per second. 43k requests in 11.08s, 66.6 MB read

Yup. Just res.send(${some} stuff) is faster than checking if an item is in the cache and sending it back in the case of our micro benchmark.

Just for giggles, what if we just used a regular map for the cache instead of this fancy lru cache?

┌─────────┬──────┬──────┬───────┬──────┬─────────┬────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 3 ms │ 3 ms │ 2.06 ms │ 0.3 ms │ 11.84 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬────────┬─────────┬─────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼────────┼─────────┼─────────┼─────────┤ │ Req/Sec │ 3779 │ 3779 │ 3977 │ 4023 │ 3961.73 │ 62.58 │ 3779 │ ├───────────┼─────────┼─────────┼─────────┼────────┼─────────┼─────────┼─────────┤ │ Bytes/Sec │ 5.82 MB │ 5.82 MB │ 6.13 MB │ 6.2 MB │ 6.1 MB │ 96.6 kB │ 5.82 MB │ └───────────┴─────────┴─────────┴─────────┴────────┴─────────┴─────────┴─────────┘ Req/Bytes counts sampled once per second. 44k requests in 11.08s, 67.1 MB read

At this point what we are seeing is that most of the time is not spent actually constructing the value, but to actually enqueue it into the event loop and send it over the wire.

So where do we go from here? Well, aside from making major architecture changes, let's revisit that logging thing. What if we disable logging for just that route? It's not a big deal - we have the logs in Nginx.

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 0 ms │ 1 ms │ 1 ms │ 1 ms │ 0.69 ms │ 0.49 ms │ 11.32 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬───────┬───────┬─────────┬─────────┬─────────┬────────┬───────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼───────┼───────┼─────────┼─────────┼─────────┼────────┼───────┤ │ Req/Sec │ 9103 │ 9103 │ 9447 │ 9663 │ 9438.91 │ 147.37 │ 9102 │ ├───────────┼───────┼───────┼─────────┼─────────┼─────────┼────────┼───────┤ │ Bytes/Sec │ 14 MB │ 14 MB │ 14.6 MB │ 14.9 MB │ 14.5 MB │ 227 kB │ 14 MB │ └───────────┴───────┴───────┴─────────┴─────────┴─────────┴────────┴───────┘ Req/Bytes counts sampled once per second. 104k requests in 11.09s, 160 MB read

Wow! Double the requests, half the average latency. But why? I originally thought the reason is that this request has a giant query string that gets logged. Observe - if we remove that config from the query string and keep the request logs:

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 3 ms │ 3 ms │ 2.04 ms │ 0.25 ms │ 11.12 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Req/Sec │ 4035 │ 4035 │ 4347 │ 4423 │ 4324.64 │ 103.49 │ 4034 │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Bytes/Sec │ 4.68 MB │ 4.68 MB │ 5.04 MB │ 5.13 MB │ 5.02 MB │ 119 kB │ 4.68 MB │ └───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┴─────────┘ Req/Bytes counts sampled once per second. 48k requests in 11.08s, 55.2 MB read

A little better, but not much! Logging, and express middleware, is just slow.

I think what we are going to do is what bought us the lowest latency - rewriting in Rust! Just kidding. Here:

Be careful caching in NodeJS. Thrashes GC, takes longer to get from cache than to reconstruct string after JIT.
Don't log these requests in the Node layer.
Template literals.
Cache at the proxy layer.
COMMENT YOUR OPTIMIZATIONS

I wouldn't apply these optimizations to other things, but really we're desperate to make sure this project is fast as it can be in certain areas. Engineering is all about trade-offs. Also, be a good developer. If you're going to write code that's tough to deal with, explain why:

worker.use('/embed', require('./routes/embed')); // OPTIMIZATION This is here so we don't log the requests, which is slow. worker.use(express.static(path.join(__dirname, 'public'))); // OPTIMIZATION This is here so we don't log the requests, which is slow. worker.use(log4js.connectLogger(logger, { level: 'debug' }));

Also, interestingly having the express.static router first slows things down by about 30%.

We could also probably get better performance out of optimizing a some code returned by the /embed endpoint, but that's for another day. :)