Recent Posts

Sun Apr 26 2020

Making FastComments Realtime

One great part of trying to sell a product is that you get a lot of great, free ideas. Sometimes that person even decides to be your customer - after you've bugged them - and your product wasn't good enough.

I contacted one company that basically sold pub-sub solutions, who was using a competitor's comment system on their site. Trying to sell them on FastComments they politely said no, however they suggested it'd be cool if FastComments was real time and that they'd give me a great discount on their pub-sub service.

Sales tactics aside, I looked at my TODO list. I had some things I really wanted to do, but I couldn't stop thinking about making FastComments real time. It bugged me in the shower, and while I was working on other features. So I pulled the plug and over the weekend - FastComments is now real time!

Did I use the potential customer's pub-sub service? No. It was far too expensive. Running an Nchan/Nginx cluster is a very easy, secure, and scalable way to handle fan out. I had experienced doing it with Nchan can handle way more connections and throughput than most applications will ever see. So, Nchan will handle our Websocket connections.

I decided on a pub-sub approach - there would be a publicly exposed /subscribe endpoint that your browser connects to. It cannot send messages over this connection - it can only receive. Then, there's a /publish endpoint that only listens on a local ip that the backend instances have access to.

Scaling Nchan

Scaling Nchan is relatively easy - so nothing very interesting here. You deploy more instances, add a Redis cluster, and then do the things required for your OS to make it accept more connections than default.

Deploying Nchan

Deployment takes about two seconds using Orcha, the build/deployment software I've been working on. For example, here's the complete file structure for defining a deployment of Nchan right now. I just push a change to the repo and done - the cluster updates with the latest configuration. Similar to terraform or ansible.

Orcha will be open sourced at some point.

FastComments WS Config

Architecture and Problems

The following problems had to be solved/decisions to be made:

  1. When to push new comment and vote events.
  2. What the UI should do when a new comment or vote has been received.
  3. How to re-render part of the widget safely to update the "Show New X Comments" button.
  4. How to do this without increasing the client side script size substantially.

When to push new comment and vote events.

This might sound like an easy question, but the problem is that when an event is pushed we need to know who to fire the event to. We solve that by using channels - the ID representing a URL (basically a clean version of the URL) - is what the client listens on for new messages.

The backend says, here's a new comment for this URL ID and Nchan forwards the message to all the respective clients.

So here's our first problem. The person that sent the comment is also on that page, so they'll get the event. How do you efficiently do a de-dupe in the UI to prevent that extra comment from showing? We can't do it server-side. Our fan-out mechanism should remain very simple and efficient, and Nginx/Nchan does not support this behavior. It also couldn't, since it doesn't know the event originated from the user. From Nchan's perspective, it originates from our servers.

Obviously we need to do a check - see if that comment is already in memory on the client. The approach I went with was to create a broadcast id on the client, pass that along with the request, and then that id is included in the event sent to each client. If the broadcast id in the client's list of broadcast ids, we know we can ignore that event.

This may seem like an extra step. Why not just not render the first comment and let the event from the pubsub system cause the comment to be rendered? The first reason is that this may cause some lag which does not feel satisfying. The other reason is that the event will not be broadcast if the comment had been marked as spam in realtime, for example. However, you'll still want to show the user the comment they just posted. So we can't rely on the message API for my own feedback when posting a comment - you're asking for trouble.

We do the same things for vote events to prevent a single click causing the number from jumping twice locally.

So to recap this approach does two things.

  1. Prevent race conditions.
  2. Ensure the correct people see the correct comments and vote changes in real time.

What the UI should do when a new comment or vote has been received.

First all, we definitely don't want to be rendering new comments as they are posted. This would be distracting from the content someone may be trying to read or interact with. So what we need to do is render little buttons like "5 New Comments - Click to View" in the right places.

What we want to do is show a count on the first visible parent comment like:

"8 New Comments - Click to show"

This means that if you could have:

Visible Comment (2 new comments, click to show)
    - Hidden New Comment
        - Hidden New Comment

So we need to walk the tree to find the first visible parent comment and then calculate how many hidden comments are under that parent.

We also need to insert the new comment at the right point in the tree, and mark it as hidden for now. This way if something triggers a full re-render of the tree the comment is not rendered.

If the received comment doesn't have a parent - like when replying to a page instead of a parent comment - we can just treat the page as the root and don't have to walk any tree.

We make an assumption: We should always receive the events in order so that we can just walk the tree backwards.

Then we just need to render that text/button. Clicking it then goes and changes the hidden flag on said comments and re-renders that part of the tree.

Here it is in action! It's pretty satisfying to watch and use.

How to do this without increasing the client side script size substantially?

This is always the fun part of FastComments - optimization. In this case the story is really boring. It just wasn't much code. The logic is pretty straight forward. There's some simple graph/tree traversal but that is pretty compact since our rendering mechanism had been kept simple.

I'm very happy with how things turned out - let me know what questions you have and I'll happily update this article. Cheers.

Thu Apr 02 2020

Just How Fast is EJS?

Article size is 21.2 kB - 102 kB and is a 14 min read

With FastComments most of our server-side rendered pages are served using the EJS ExpressJS plugin. Since we're using NodeJS this makes sense for us. EJS is fast, and easy to write. The amount of abstraction is also low since the library is fairly small and I think everyone knows how big of a fan I am of the right (and less) abstractions.

So EJS works well for our marketing site and administration app with pages being able to be rendered in single-digit milliseconds. But what about the really heavily hit stuff - the comment widget itself?

Quick Background - How The Comment Widget Works

To load the comment widget, your browser has to make the following requests:

  1. embed.min.js
  2. /embed endpoint
  3. comment-ui.min.js
  4. /comments API

The reason for a couple more calls than you might expect is because of the need of an iframe, for security reasons. That's explained in more detail in this post.

The script (#1) is needed to control the iframe - its styling, the auto resizing, etc.

The /embed endpoint (#2) loads our iframe content.

The comment-ui script (#3) contains the code for the widget and #4 is the api call to actually get the comments, current logged in user information, and translations.

So what we're talking about here is #2.

Just cache it dude

The problem with caching in this scenario is when items fall out of the cache. For FastComments to properly live up to its name, we don't want to risk the 1% percentile latency being bad. So that /embed endpoint should always be fast.

So let's benchmark.

Benchmark - Template Literal

First let's get a baseline. Let's compare EJS to the fastest thing we can - returning a string. No caching, although in production currently we use an LRU cache.

What if the endpoint was just a string literal?

FastComments Embed Endpoint as String Literal

We'll use autocannon with the default settings to run our test, using and endpoint with actual parameters taken from production.

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬─────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼─────────┤ │ Latency │ 2 ms │ 2 ms │ 4 ms │ 5 ms │ 2.25 ms │ 0.69 ms │ 15.7 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴─────────┘ ┌───────────┬─────────┬─────────┬────────┬─────────┬─────────┬────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼────────┼─────────┼─────────┼────────┼─────────┤ │ Req/Sec │ 2175 │ 2175 │ 3869 │ 4001 │ 3631 │ 555.19 │ 2174 │ ├───────────┼─────────┼─────────┼────────┼─────────┼─────────┼────────┼─────────┤ │ Bytes/Sec │ 3.54 MB │ 3.54 MB │ 6.3 MB │ 6.52 MB │ 5.91 MB │ 904 kB │ 3.54 MB │ └───────────┴─────────┴─────────┴────────┴─────────┴─────────┴────────┴─────────┘ Req/Bytes counts sampled once per second. 36k requests in 10.08s, 59.1 MB read
┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 3 ms │ 3 ms │ 2.08 ms │ 0.31 ms │ 10.98 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Req/Sec │ 3755 │ 3755 │ 3945 │ 4015 │ 3920.82 │ 79.38 │ 3755 │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Bytes/Sec │ 6.12 MB │ 6.12 MB │ 6.43 MB │ 6.54 MB │ 6.39 MB │ 129 kB │ 6.12 MB │ └───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┴─────────┘ Req/Bytes counts sampled once per second. 43k requests in 11.08s, 70.2 MB read
┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 3 ms │ 3 ms │ 2.06 ms │ 0.28 ms │ 11.24 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ │ Req/Sec │ 3887 │ 3887 │ 3965 │ 4033 │ 3969.73 │ 41.4 │ 3887 │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ │ Bytes/Sec │ 6.33 MB │ 6.33 MB │ 6.46 MB │ 6.57 MB │ 6.47 MB │ 66.9 kB │ 6.33 MB │ └───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ Req/Bytes counts sampled once per second. 44k requests in 11.08s, 71.1 MB read

You can see it gets faster each time as we make Turbofan happy and the JIT kicks in.

We can see that after JIT compilation our max latency is 11.24ms with the lowest requests/second over ten seconds being 3887.

On kind of an unrelated tangent, I wanted to show something real quick. Logging in NodeJS is slow, even with the best of frameworks. How slow?

Well, using connect w/ log4js, logging a line like this with each request:

[2020-04-02T00:04:05.642] [DEBUG] express - ::1 - - "GET /embed?config=%7B%22tenantId%22%3A%22DTGXwVKr%22%2C%22apiHost%22%3A%22http%253A%252F%252Flocalhost%253A3001%22%2C%22urlId%22%3A%22http%253A%252F%252Flocalhost%253A63342%252Fwatchly-models%252Fscripts%252Flocal-test-lifetime.html%253F_ijt%253Dlh9s0h0fgv1t2n2ceibdot1c1g%22%7D HTTP/1.1" 200 1065 "http://localhost:63342/watchly-models/scripts/local-test-lifetime.html?_ijt=lh9s0h0fgv1t2n2ceibdot1c1g" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"

It has an impact - surprisingly. So let's disable logging for a second.

I ran it twice, here's the second run.

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 1 ms │ 1 ms │ 2 ms │ 3 ms │ 1.15 ms │ 0.44 ms │ 13.68 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Req/Sec │ 5051 │ 5051 │ 5467 │ 5623 │ 5429.28 │ 172.67 │ 5049 │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Bytes/Sec │ 7.78 MB │ 7.78 MB │ 8.42 MB │ 8.67 MB │ 8.36 MB │ 266 kB │ 7.78 MB │ └───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┴─────────┘ Req/Bytes counts sampled once per second. 60k requests in 11.08s, 92 MB read

Wow! Our max latency isn't much better, but our average is cut in half and we go from 44k requests/second to 60. Anyway, you probably don't want to disable your logging. I'd rather just deploy an extra server, so let's continue the benchmarks with logging enabled. Also, the impact is so profound just because what we're testing is so light.

Benchmark - EJS

┌─────────┬──────┬──────┬───────┬──────┬────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼────────┼─────────┼──────────┤ │ Latency │ 4 ms │ 4 ms │ 7 ms │ 8 ms │ 4.4 ms │ 0.94 ms │ 19.83 ms │ └─────────┴──────┴──────┴───────┴──────┴────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬────────┬─────────┬────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼────────┼─────────┼────────┼─────────┤ │ Req/Sec │ 1327 │ 1327 │ 2223 │ 2269 │ 2072.73 │ 286.91 │ 1327 │ ├───────────┼─────────┼─────────┼─────────┼────────┼─────────┼────────┼─────────┤ │ Bytes/Sec │ 1.81 MB │ 1.81 MB │ 3.03 MB │ 3.1 MB │ 2.83 MB │ 391 kB │ 1.81 MB │ └───────────┴─────────┴─────────┴─────────┴────────┴─────────┴────────┴─────────┘ Req/Bytes counts sampled once per second. 23k requests in 11.08s, 31.1 MB read
┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 4 ms │ 4 ms │ 5 ms │ 5 ms │ 4.11 ms │ 0.37 ms │ 12.36 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ │ Req/Sec │ 2079 │ 2079 │ 2225 │ 2259 │ 2201.73 │ 57.87 │ 2078 │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ │ Bytes/Sec │ 2.83 MB │ 2.83 MB │ 3.04 MB │ 3.08 MB │ 3 MB │ 79.2 kB │ 2.83 MB │ └───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ Req/Bytes counts sampled once per second. 24k requests in 11.08s, 33 MB read
┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 4 ms │ 4 ms │ 5 ms │ 5 ms │ 4.07 ms │ 0.31 ms │ 13.22 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬────────┬─────────┬─────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼────────┼─────────┼─────────┼─────────┤ │ Req/Sec │ 2173 │ 2173 │ 2239 │ 2271 │ 2240.6 │ 28.83 │ 2173 │ ├───────────┼─────────┼─────────┼─────────┼────────┼─────────┼─────────┼─────────┤ │ Bytes/Sec │ 2.97 MB │ 2.97 MB │ 3.05 MB │ 3.1 MB │ 3.06 MB │ 38.9 kB │ 2.96 MB │ └───────────┴─────────┴─────────┴─────────┴────────┴─────────┴─────────┴─────────┘ Req/Bytes counts sampled once per second. 22k requests in 10.07s, 30.6 MB read

Wow, a huge difference. We can only serve half the requests in the same amount of time and latency is doubled across the board.

Again remember - it may seem crazy to optimize this. But remember that the nature of FastComments - a widget hosted on a third party page - can mean that we get flooded with tons of requests for uncached pages at once. Even with autoscaling this could cause an outage or at least a hiccup. Being able to handle twice the uncached traffic in a spike is a big deal.

So let's take a step back. What if we also add an LRU cache to our template literal?

FastComments Embed Endpoint as String Literal w/ Cache

First it starts off not so great, since the JIT hasn't warmed up yet.

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 4 ms │ 5 ms │ 2.24 ms │ 0.72 ms │ 24.13 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬────────┬────────┬─────────┬─────────┬─────────┬────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼────────┼────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Req/Sec │ 2145 │ 2145 │ 3897 │ 4087 │ 3659 │ 546.44 │ 2145 │ ├───────────┼────────┼────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Bytes/Sec │ 3.5 MB │ 3.5 MB │ 6.35 MB │ 6.66 MB │ 5.96 MB │ 890 kB │ 3.49 MB │ └───────────┴────────┴────────┴─────────┴─────────┴─────────┴────────┴─────────┘ Req/Bytes counts sampled once per second. 40k requests in 11.07s, 65.6 MB read

But then!

┌─────────┬──────┬──────┬───────┬──────┬─────────┬────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 3 ms │ 3 ms │ 2.06 ms │ 0.3 ms │ 11.41 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ │ Req/Sec │ 3849 │ 3849 │ 3967 │ 4037 │ 3970.8 │ 53.39 │ 3848 │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤ │ Bytes/Sec │ 6.27 MB │ 6.27 MB │ 6.46 MB │ 6.58 MB │ 6.47 MB │ 86.9 kB │ 6.27 MB │ └───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘ Req/Bytes counts sampled once per second. 40k requests in 10.08s, 64.7 MB read

Wait, isn't that the same as our string literal test? Let's check again.

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 3 ms │ 3 ms │ 2.08 ms │ 0.36 ms │ 11.58 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬────────┬────────┬─────────┬─────────┬─────────┬────────┬────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼────────┼────────┼─────────┼─────────┼─────────┼────────┼────────┤ │ Req/Sec │ 3699 │ 3699 │ 3975 │ 4051 │ 3932.28 │ 96.96 │ 3699 │ ├───────────┼────────┼────────┼─────────┼─────────┼─────────┼────────┼────────┤ │ Bytes/Sec │ 5.7 MB │ 5.7 MB │ 6.13 MB │ 6.25 MB │ 6.06 MB │ 150 kB │ 5.7 MB │ └───────────┴────────┴────────┴─────────┴─────────┴─────────┴────────┴────────┘ Req/Bytes counts sampled once per second. 43k requests in 11.08s, 66.6 MB read

Yup. Just res.send(${some} stuff) is faster than checking if an item is in the cache and sending it back in the case ouf our micro benchmark.

Just for giggles, what if we just used a regular map for the cache instead of this fancy lru cache?

┌─────────┬──────┬──────┬───────┬──────┬─────────┬────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 3 ms │ 3 ms │ 2.06 ms │ 0.3 ms │ 11.84 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬────────┬─────────┬─────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼────────┼─────────┼─────────┼─────────┤ │ Req/Sec │ 3779 │ 3779 │ 3977 │ 4023 │ 3961.73 │ 62.58 │ 3779 │ ├───────────┼─────────┼─────────┼─────────┼────────┼─────────┼─────────┼─────────┤ │ Bytes/Sec │ 5.82 MB │ 5.82 MB │ 6.13 MB │ 6.2 MB │ 6.1 MB │ 96.6 kB │ 5.82 MB │ └───────────┴─────────┴─────────┴─────────┴────────┴─────────┴─────────┴─────────┘ Req/Bytes counts sampled once per second. 44k requests in 11.08s, 67.1 MB read

At this point what we are seeing is that most of the time is not spent actually constructing the value, but to actually enqueue it into the event loop and send it over the wire.

So where do we go from here? Well, aside from making major architecture changes, let's revisit that logging thing. What if we disable logging for just that route? It's not a big deal - we have the logs in Nginx.

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 0 ms │ 1 ms │ 1 ms │ 1 ms │ 0.69 ms │ 0.49 ms │ 11.32 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬───────┬───────┬─────────┬─────────┬─────────┬────────┬───────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼───────┼───────┼─────────┼─────────┼─────────┼────────┼───────┤ │ Req/Sec │ 9103 │ 9103 │ 9447 │ 9663 │ 9438.91 │ 147.37 │ 9102 │ ├───────────┼───────┼───────┼─────────┼─────────┼─────────┼────────┼───────┤ │ Bytes/Sec │ 14 MB │ 14 MB │ 14.6 MB │ 14.9 MB │ 14.5 MB │ 227 kB │ 14 MB │ └───────────┴───────┴───────┴─────────┴─────────┴─────────┴────────┴───────┘ Req/Bytes counts sampled once per second. 104k requests in 11.09s, 160 MB read

Wow! Double the requests, half the average latency. But why? I originally thought the reason is that this request has a giant query string that gets logged. Observe - if we remove that config from the query string and keep the request logs:

┌─────────┬──────┬──────┬───────┬──────┬─────────┬─────────┬──────────┐ │ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │ ├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼──────────┤ │ Latency │ 2 ms │ 2 ms │ 3 ms │ 3 ms │ 2.04 ms │ 0.25 ms │ 11.12 ms │ └─────────┴──────┴──────┴───────┴──────┴─────────┴─────────┴──────────┘ ┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬────────┬─────────┐ │ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Req/Sec │ 4035 │ 4035 │ 4347 │ 4423 │ 4324.64 │ 103.49 │ 4034 │ ├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼────────┼─────────┤ │ Bytes/Sec │ 4.68 MB │ 4.68 MB │ 5.04 MB │ 5.13 MB │ 5.02 MB │ 119 kB │ 4.68 MB │ └───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴────────┴─────────┘ Req/Bytes counts sampled once per second. 48k requests in 11.08s, 55.2 MB read

A little better, but not much! Logging is just slow.

I think what we are going to do is what bought us the lowest latency - rewriting in Rust! Just kidding. Here:

  1. Don't bother caching in NodeJS. Thrashes GC, takes longer to get from cache than to reconstruct string after JIT.
  2. Don't log these requests in the Node layer.
  3. Template literals.
  4. Cache at the proxy layer.

I wouldn't apply these optimizations to other things, but really we're desperate to make sure FastComments is fast as it can be in certain areas. Engineering is all about trade offs. Also, be a good developer. If you're going to write code that's tough to deal with, explain why:

worker.use('/embed', require('./routes/embed')); // OPTIMIZATION This is here so we don't log the requests, which is slow. worker.use(express.static(path.join(__dirname, 'public'))); // OPTIMIZATION This is here so we don't log the requests, which is slow. worker.use(log4js.connectLogger(logger, { level: 'debug' }));

Also, interestingly having the express.static router first slows things down by about 30%.

We could also probably get better performance out of optimizing a some of the code returned by the /embed endpoint, but that's for another day. :)

Mon Feb 17 2020

Efficient Data Structures for MMO Game Back-ends in Java

Article size is 14.1 kB - 18.6 MB and is a 11 min read

... and how I almost rewrote the backend in Rust, but didn't.

Java gurus beware - none of this is new news. However, it is interesting and I ran into it recently for the first time dealing with large heaps. Most Java applications I've worked on either had data on the heap very short-lived like per-request or the heap allocation was very temporary (like batch processing).

The Context

However, for the past six months I've been working on TDWorld. TDWorld can best be described as a global-scale tower defense MMO. That's a mouthful, but that's what it is. It's a location based game where you claim buildings and connect them. For example, let's say you claim your house and then the supermarket down the street. You connect them, and if your creeps make it from your house to the super market you get in-game currency. The game plays on a real map - using real roads and buildings. Your enemies can take a building in between your house and that supermarket, level it up, and stop your creeps from reaching their destination. Hence the strategy part.

The Problem

We have a few requirements. We need to be able to simulate vehicle traffic, in the form of in-game creeps, and also support queries that allow towers to efficiently find the nearby creeps to attack. On top of this we need to sync state with your phone. It's not a small task.

On top of that, we don't have a lot of money. I started with MapBox for the traffic routing and quickly racked up thousands of dollars in API requests during testing. This was a big no-no, so we now run our own instance of the routing server.

For a quick example, here's a very early version of the client that rendered an android TextureView over MapBox's SurfaceView by using reflection to grab its contents and overwrite it (Android devs will know how dirty of a hack this is). But it lets us demonstrate the game. Don't worry, I'm getting to the technical stuff soon.

This example is playing the game in Redwood City. Ignore the graphics. Just see how the creeps follow the roads and the towers have to shoot them. Everyone on their phone should see the same thing as you.

TDWorld Example!

This demonstrates the kind of queries we need to support. For this we make heavy use of David Moten's rtree.

That's not so much what this post is about, I can write a separate post about immutable rtrees. This is about reducing memory, because oh man did we have a problem with that.

Technical Juicy Stuff

Here are all the problems I had, as of last Friday:

  1. Seeding was super slow. On launch we want to have NPCs in every major city in the US (before we launch globally) proportional to the population of said city, randomly distributed from the city's center. It took almost 30 minutes to seed this data and build the indexes, which was too slow to iterate and develop on.
  2. Saving and restoring game state was super slow. Like a minute to checkpoint to disk and ten minutes to restore.
  3. We were using way too much memory. For around 300k towers and a few thousand creeps we saw heap sizes of 14gb. This could equal hundreds of gigabytes in the real world - needlessly expensive! Also, the connections themselves will take a lot of memory so can't use all resources just on the game engine.

Efficiency is important here because of scale. A poorly coded backend could mean six months of work wasted if we go down on launch day and nobody comes back to the app - and we don't have the money or time to scale horizontally. So to "get it out there" we need to have a backend that'll scale on one machine, up to a million concurrent connections, and handle tens of millions of actors (creeps) moving around concurrently. For concurrency we shard in-memory geographically, but I'll cover this in a separate post. If we need to scale horizontally it's just a matter of building an orchestrator to move the shards between instances - and no we can't back this with a traditional database. Millions of concurrent writes with a high write concern basically mean that the app layer needs to have its own in-memory database that checkpoints to disk (and that's what we have).

There was a point where I blamed Java for my poor performance, and I decided a trial rewrite in Rust. I figured I'd just 1:1 port the Java code and see what the performance difference was.

I ported ALL of the structures and then the seeding code. This was a super painful two days and I am not a Rust fan yet. I think I get Rust, it's just stressful and not so fun to write.

Anyway, the code written in Rust blew away the code in Java. It could seed the data for a thousand cities, and create all the NPC users, in seconds.

Then I realized I had forgotten to pull out my profiler. N00b mistake. A quick dig into JDK Mission Control showed a number of problems.

I had read the hot path in the Java code a hundred times and couldn't see why the seeding was so slow. The code was so trivial in fact I just didn't think of using a profiler. I suppose I thought it was Java's fault because I was fighting the garbage collector for a while, but I should have known better because switching to Shenandoah fixed that.

(By the way, Shenandoah is awesome, we went from many second pauses to never seeing more than a few hundred milliseconds STW pauses even during serialization/check pointing to disk)

So anyway. The profiler. Here is the cause of my first problem, the seeding being slow. During seeding we enqueue a request to route between the source and target towers - and there can be hundreds of thousands of these requests. The downstream routing server is in C++ but it can still only handle so many requests concurrently so we queue them up in memory in the game backend using a custom task executor.

And that executor - uses a ConcurrentLinkedQueue.

Router Class Fields

So what's wrong with that? Well, every time we enqueue a request we check the count of the queued items to see if we should execute that one right away:

Router Class Oopsie

The problem is ConcurrentLinkedQueue is implemented using a linked list. We call size() on it A LOT, and each time it was iterating through the whole queue. I'm not sure why the JDK doesn't optimize this. All we had to do was use an AtomicInteger to keep track of the count.

Router Class Fix

So that's #1 solved. The Java version was now faster than Rust.

So now #2

Here's where we talk about data structures, sorry.

Ok - I wrote the title of this and then realized it takes a while to actually get to what the title's about. Thanks for getting this far.

Most of what we need to store in memory is the routing information. Each route - for example between your house and the grocery store - consists of a bunch of steps.

These steps are basically objects with the properties {duration, speed, and lat/lng}. And we store A LOT of these. Whenever a creep is primed to move again, we get the next step and then update its position in the rtree.

We had originally represented geographical position information using doubles. This gives us a very high degree of accuracy - but at a cost!

Before we optimize, let's look at some stats. We can optimize forever, so we have to quantify when we're done. Let's 5gb of heap is where we'd be happy with our testing dataset.

Note that the creep/tower count is just to some info about one shard that I have running, it doesn't mean that much.

Before Optimization:


Size on disk: 2.322gb

Memory: 14gb

Serialization (checkpoint and app startup) Time

To Disk (nvme): 69540ms

From Disk: 10 minutes

Wow! Over a minute to save! That's not too bad, however think about the fact that it's taking that long because it's doing something intensive. The game engine will already check for how many threads are free and pause one of the shards if needed to serialize to disk - however we don't want to do that for a long time.

So first thing's first. double in the JVM world is 8 bytes. Float is 4. Also, we store a lot of Strings but don't really make use of the String class. Let's change everything to float and byte[].

GitHub Changes

Not too bad. Now let's measure what the affect was.

After changing double to float and String to byte[]

CREEPS=2910 TOWERS=263278 DURATION=202.0

Size on disk: 1.49gb

Memory: 12gb

Serialization (checkpoint and app startup) Time

To Disk: 64443ms

From Disk: 9.5 minutes

Alright, we gave up about 10 meters of accuracy (changing geographical coordinate representation) from double to float and got 2gb of ram and minor serialization improvements.

Not great.

However, one thing I realized was that each Route has many Step[] arrays. Each Step just contains a few floats and a Position object now (which also has two floats).

Route Obj Before Optimization

So basically our route information is a series of floats. Each entry in that Step[] array is also eight bytes for the pointer. That really adds up. Let's get rid of those objects.

Route Obj Using float[]

Then we just need some utilities to access this newly formatted data:

Route Obj Utils

So what in the Kotlin is going on here? Well, we're basically serializing our data at rest into an array. This is much more compact than a bunch of objects and is a common tactic for databases written in Java - use primitives as much as possible.

This means that this:

  "route": [
      "duration": 2,
      "speed": 5,
      "position": {
        "lat": 76.4,
        "lng": -170
      "duration": 4,
      "speed": 3,
      "position": {
        "lat": 76.2,
        "lng": -170.1


  "route": [

Not only is this much nicer for the garbage collection it's much better for serialization in general - and not only to disk but over the network.

You'd think that the JVM is optimizing things like this already, maybe. The JVM does magical things. But behold.

After redesigning the Route storage

Before Optimizations After
Size on disk 2.322gb 873mb
Memory 14gb 1.8gb
To Disk Time 69540ms 4183ms
From Disk Time 10 minutes 24 seconds

Amazing. Our shard size on disk has gone down from over two gigabytes to 873mb. Note that the app is still running in the background and at 527314 creeps the size on disk is still less than a gigabyte.

Our memory usage went from 14gb to less than two.

Startup time went from ten minutes to 24 seconds, and writing to disk went from over a minute to four seconds.

This was a super productive past couple days and I'm very happy with the result. I'm hoping that soon we can show off the full game and prepare for launch.

Next I'll be doing a load test with simulating millions of active clients and having the backend continually keep them in sync. I suspect that will be much harder. The websocket library I'm using currently dies at 1k connections on my Windows machine (that's 32 threads and 64gb of ram). Hopefully it performs well in a POSIX environment.

If it doesn't, we'll most likely be fronting the game engine with Nginx/Nchan. That's what we did for and it works well.


Sat Feb 08 2020

Being on the Homepage of Hacker News

Article size is 4.2 kB - 187 kB and is a 3 min read

I have to say, this was a big moment for me. When I started to gather feedback for FastComments it was terrifying. I started small on Reddit and got around 25 comments. When I first opened that thread I was terrified. I almost didn't want to read the feedback. What if it was terrible, what if it made me think I was wasting my time?

But, for the most part it was helpful. I think people found some issues right away that I fixed that night. So then the next step was Hacker News.

The post is here: Show HN: FastComments

Over 111 votes and 65 comments! Wow! I had gotten over the anxiety of sharing the project at this point. I posted the link on Friday night on a whim, expecting to only get a few up votes and comments. Instead I woke up and it was #13 on the homepage:

On the front page!

Heck, I was happy it wasn't flagged or something. You never know what will happen.

Overall it was really helpful. I got feedback on:

  • The pricing model
  • The design
  • The marketing site text
  • The feature set and also various competitors I hadn't known about, including open source projects

Also, I got to see how much traffic you actually get when this happens. I think the peak load on the server (not enough traffic to enable horizontal scaling yet, although that's coming) was like .3, and this isn't a big server. The app is well optimized though, so it's expected not to fall over. At the peak I think I was seeing something like five requests a second. Not terribly high.

People also left hundreds of comments on the Demo Page, so they were pretty engaged.

You'll see the comment count on the demo page is less now since a lot of them expired, which is one of the features of FastComments.

Using the Analytics page of the dashboard on FastComments we can see when the spike occurred:


I posted the link on Jan 31st, and on Feb 1st over four thousand people viewed the demo page, 67 people left comments, 5 people voted, and 141 commenter accounts were created.

I also got one paying customer. I added the "add FastComments to a website" link to the demo page a little to late.

Another self plug - the Daily shows the overall traffic to the homepage and user retention time: Email Top Email Bottom

As you can see there were 12k hits to the homepage, around 11k being unique. also shows session time. It looks like someone left the tutorial open in one of my apps so that's why that's on the top.

I also got a ton of requests from random scripts that people were writing. One person sent thousands of requests at once as some kind of test.

I saw requests from HTTP clients written in Go, Ruby, and Node. Who knows what people were doing. If you make an API people will abuse it.

Overall this was a really good and useful experience. I don't think you should expect to get your users from HN but the feedback was useful in my case.

Sat Feb 08 2020

Fixng Nginx 502 Gateway Timeout With proxy_pass During Deployments

Article size is 2.6 kB and is a 2 min read

I use Nginx a lot. One issue I've always never been able to figure out is when I use it as a proxy to some backend like Node or Java is that when I restart/redeploy those services Nginx thinks they are down for ~10 seconds.

Recently I realized that Nginx is creating an "upstream" declaration when you use proxy_pass.

View documentation for upstream here.

The thing about ngx_http_upstream_module is that it supports a configuration parameter called fail_timeout. What Nginx uses this for is if the upstream server is not available Nginx won't allow calls to it during this duration. The upstream module allows you to define many backends for a single proxy, meaning you basically create a load balancer.

When you use proxy_pass without explicitly using ngx_http_upstream_module you can think of that as creating a load balancer with only one available node.

So, if you restart your service and Nginx tries to proxy a request to it - and it fails - Nginx will consider that server unavailable for fail_timeout. And since you only have one server in your "load balancer" Nginx can't forward that request to anything. So it returns a 502.

What you can do is set fail_timeout to 0. Then Nginx will never consider the backend unavailable. I'd only do this if you have a single upstream server. Even then it is risky in a high load environment because then if the server becomes overwhelmed it might not have a grace period to recover.

However, if you're in a high load environment I trust you have more than one upstream node :).

So here's what you came for:

upstream backend { server localhost:3001 fail_timeout=0; } server { location / { proxy_pass http://backend; } }

Wed Feb 05 2020

How We Localized FastComments Without Slowing It Down

Article size is 3.6 kB and is a 3 min read

To start - note that I would not recommend the techniques outlined here to most companies. FastComments is an unusual beast where we are optimizing for milliseconds. Most web applications will be fine with adding 100ms of latency as a trade-off for localization support... but not us.

So what did we do for FastComments?

What I've seen in web apps so far is if you want to localize the client it does a lot of parsing of browser/system settings and then fetches a JSON file containing a map of translation id to localized string. The application is usually structured in a way that this network request is part of the framework setting itself up, meaning that you can't fetch your translations concurrently while fetching application state/data. They do get cached however - but we need to optimize for initial page load.

FastComments already doesn't use a framework (other than what a native browser provides) and fetching initial state is relatively simple for us so we have a lot of freedom.

We've already gotten to a state where our /comments endpoint is no longer RESTful in that it returns more than just comments. It returns configuration for rendering the widget, for starters. We can do this since fetching the comment objects is very simple and fast - if it took seconds we'd have to have a separate request for initializing the widget so we could show some content quickly.

So we already have a fast endpoint that's following a convention where we can tell the API to include additional information via query params. For example, for localization we added a '?includei10n' flag in which case the client will do locale detection and send back the appropriate set of translations. This response is gzipped by Nginx just like the JS file. We only pass this flag for initial /comment calls from the widget.

So that's basically it. The client script dropped a few hundred bytes in size (gzipped) and the API response grew a little bit - but we didn't add any additional requests to load the widget.

So far it seems this has increased our response time by a couple milliseconds. However after repeated requests the JIT seems to make this difference even smaller.

Another thing we looked at was addressing the endpoint that loads the initial iframe. Right now it's rendered with EJS. It turns out that EJS is indeed pretty fast. We tried replacing the template engine with just string concatenation - it's hardly faster than using EJS. Request response time still hovered around 20ms (including dns resolution). Initial requests with EJS seem to spike around 30 milliseconds though, so we may explore this again in the future.

Sorry if you were looking for something super fancy. Sometimes the fastest solution is also the simplest. :)

Mon Feb 03 2020

How Optimized Threaded Pagination Works

Article size is 3.2 kB - 173 kB and is a 3 min read

For FastComments one of our challenges was dealing with long comment threads. For example, our demo page already has over 200 comments at time of writing. Long threads make us miss our SLA of 40ms for a /comments api call. We still miss this occasionally for other reasons, but this was a big hit.

Ideally, the call finishes in 30ms. That's our internal goal.

There are a lot of easy solutions to pagination, nothing revolutionary there. However it gets a little bit more fun when you throw nested replies (threads) into the mix.

Threads Example

Reason being is that if you cop-out and do it the easy (and potentially bad way) way that works 90% of the time you end up with something that's very slow the other 10% of the time. Being Fast, we don't want to do that. Some examples which don't make us happy:

  1. Fast - Bad Interaction - not showing the threads and instead just showing the "top-level" comments.
  2. Mostly Fast - Bad Interaction - paginating by the top-level comments and always querying all replies (could potentially show hundreds instead of just 30 a page like we want to).

What we wanted to do is paginate by "content breaks". To illustrate let's break up the example from earlier:

Threads Example Sliced Up

A page can occur at any point in the thread. This gives us a lot of flexibility. But how do you optimize this? An obvious solution is to do what we did originally - query the whole set of comments - then just run DFS through the threads and short-circuit when you reach your pagination limit.

We take it a step further and only do this operation during write. So any vote, comment, or moderator action triggers a job that updates the pageNumber at a comment level. Those from the SQL world will think of this as a stored procedure that runs "ON UPDATE".

Of course, for us it's about 20 lines of JavaScript. This job so far seems to execute in around 300ns for threads with ~300 comments. This includes reading from and writing to the DB.

Some OK Code

Another option would have been to use a graph database. However, wanting to build a stable product and not having the operational experience with a graph database made that not a good idea.

Sat Jan 25 2020

Making an Embeddable Widget Safe from Cross-Site Attacks

Article size is 2.3 kB - 26.6 kB and is a 2 min read

Recently we've launched FastComments and one challenge was security. We wanted our widgets to be easy to use, but also protect people that write comments from CSRF attacks. The widget needs to be authenticated to take actions like comment and vote, however we wanted the same session cookie to authenticate you on - so if you're logged into the widget to comment you can manage your account and vise-versa. This allows for some nice interaction like being able to go from the widget directly to your account page.

The obvious solution is to use an iframe, right? That way the cookies can have the SameSite=None, Secure=true properties set and we simply remove the CORS header from our http request responses. Well not so fast. Iframes are no stranger to me, however usually they're either a specific size or absolutely positioned and take up the whole viewport. While the FastComments widget can have a fixed horizontal size it can expand vertically so how do you deal with that?

So what I decided to do was just use cross-frame communication. The script running inside the iframe can post a message to the parent with its current size:

Height Broadcast Code

Then the parent embeddable code, the snippet that our customers include on their sites, simply captures that event and updates the iframe:

Height Capture Code

And that's it! There is one downside which is that there's another network request to load the iframe, however we can optimize that easily. Happy coding.

Fri Jan 24 2020

How was FastComments built?

Article size is 9.3 kB - 103 kB and is a 8 min read

So let's start with some gloating.

Recently launched was built in a month.

For a high-level view this includes the marketing site, blog, client-side comment plugin, integration with Stripe, data import/exports, comment moderation, and a basic analytics page. Here's the github activity for the repository (yes there's only one, which I'll get into):

FastComments Github Stats

Anyway, enough gloating. As you can tell by the name the product had to be fast. That was a major goal. The second goal was UX.

The main driver for this was during my month-long stint in China I thought how I'd write about my trip. Then I thought about how I'd add comments to my blog. No biggy right, use Disqus? Well based on some reading I decided against that and figured I'd build my own.

I didn't want people to go and comment and have to deal with popups to login to Google, Facebook, etc. In fact I didn't want users to deal with any popups at all, or loading screens, high memory usage, or any of that.

FastComments is light and easy. Those are the goals.

People that have worked with me know I'm a huge fan of typesafety, so I seriously considered Java for this project. I even considered bringing the data store largely into memory and attempting to get comment fetch times into the single milliseconds. An LRU cache at least would have worked, but I decided to hold off on that for now.

Right now - FastComments is a monolith (except for the blog which is in its own repo and uses the same static site generation library I wrote for this site).

That monolith is written in NodeJS. I know, I know. When people hear NodeJS monolith they shudder. But hear me out, the code is very simple, easy to maintain, and fast. That doesn't mean it's not sophisticated - there are a lot of different crons with distributed locks to offload work from the event loop and schedule things.

Probably when I finally learn Vertx or Spring MVC I'll dump Express. Right now just my game backends and simple APIs are in Java, I haven't invested the time into learning a Java web framework yet.

Speaking of crons, here are the crons FastComments uses.

  • Daily expiration of exported files on disk
  • Daily submission of usage of the product to Stripe.
  • Hourly comment notifications (to commenters and site owners).
  • Minutely batch exports
  • Minutely batch imports

Here's what the code for one of the crons looks like. This is the first one listed above.

FastComments Batch Job

The import/export ones probably caught your eye. I've worked with enough enterprise software with terrible imports and exports where you sit there, waiting while some backend ten services away from the web layer is synchronously processing your data. The page won't finish loading until it does, and it'll probably timeout. You know the feeling.

So imports and exports in FastComments are asynchronous. When you upload a file to migrate from another platform we enqueue a job and a cron checks for jobs every minute. You'll get an email when it's done, and this way separate servers can be deployed to handle imports rather than impacting the site if need be.

Importing or Exporting creates a job object like this:

FastComments Batch Job Object

Let's take a step back and talk about the Fast part of FastComments. How fast is it?

It seems comment api calls take around 30-50ms. We make one of these when you load the page, and it's the only API request we make to get all the comments and any other information (like, is this request for a valid tenant, etc). So all authentication, validation, and data fetching happens in those ~30ms. Limiting the number of requests reduces latency. This is currently with no load, so we'll see how things scale. But so far it seems to scale well based on my testing.

This also includes DNS lookups and such which sucks a few milliseconds here and there, as well as the nginx reverse proxy. Also, some middleware is sucking up cpu time because Express has to check whether or not the requests falls into those routes. ExpressJS executes middleware sequentially and right now the order of things is static routes -> API routes. I am considering moving the static routes last, and then using Nginx to cache the static assets.

This project is very IO bound. It does very little on the CPU pretty much everywhere so that's why Node is perfect.

The layers of abstraction has also been kept way down. In fact, I wrote zero API middleware for this project. This leads to some code duplication, but it's very minor and just a couple lines here and there. I really wanted this to perform well at scale without spending a ton of money and I was willing to sacrifice some of my sanity to do it.

Performance also means you have to be very aware of your dependencies at runtime - like when you are fetching data from a source like a database. In FastComments we only ever ask Mongo for what we need (and yes, we're using MongoDB). Projections are used everywhere we fetch data - like SELECT a, b, c FROM in the SQL world - we tell MongooseJS to .select('a b c') everywhere we query objects. This helps scaling feature-wise. Increasing sizes of objects will still increase index size, but it won't add overhead to our existing queries/features.

The main aspect of the product - fetching that comment list and rendering it - is fast even with Mongo because we just make a query like "get me all of the comments for this page id". Mongo's doing very little in this scenario, and in fact I think most of our time is spent in serialization in the JS layer. That is one downside of using Mongoose for this, however it's still very fast.

So, Mongo works fine since we don't have to do many joins and where we do it's in less important areas of the product like admin dashboards and such. Those pages still load very fast, which I'll get into, so I'm not worried.

Client side we use zero frameworks. That's required to keep the file size of the client below 15kb (it is 2.6kb gzipped at time of writing). This isn't very hard, even larger apps like Github don't use client side frameworks anymore. document.querySelector/querySelectorAll works well enough to not need jQuery anymore.

The client library is a mix of raw HTML manipulation and a bunch of optimizations. Nothing fancy really, but that's the idea! You can read the source code here:

In the admin dashboard (My Account, Analytics, Comment Moderation) we use mostly server-side rendering. EJS plays that role and it does it very well. The syntax integrates well with Webstorm and it's pretty quick. SSR can be tricky with state management but since one goal is to keep the complexity down it results in very maintainable code.

I'm also hoping that people that have to use the admin dashboard - like for moderating comments - appreciate how fast it is.

When I say I use EJS, I mean I really use EJS. Also, since there is very little abstraction I've went with a very flat file structure in this project. I have projects with lots of abstraction, very nested folders etc, so this was really some fresh air.

For example, here's my "views" folder:

FastComments EJS Files

Deployment wise, FastComments uses the same Orchestrator that all WinrickLabs projects use. I haven't written about the Orchestrator yet but I will. It doesn't support autoscaling yet but that's coming soon.

There are lots more features coming too so keep an eye out for those :) I think the first one is going to be Avatars since every single competitor has that. I just have to make sure it doesn't slow things down.

Anyways it was really fun and nice to build something that had performance and usability in mind. Surprisingly we don't get enough of that today. I think that's going to change though. People are going to get tired of how slow their software is. Enjoy FastComments.

Fri May 17 2019

Pixmap Release: Launching Rewind

Article size is 4.2 kB - 22 MB and is a 4 min read

Today I'm excited to announce a new feature for Pixmap. While I had planned on mostly focusing on the native iOS port in Swift a couple friends brought up an idea that sounded too fun to implement, so here it is.

Pixmap rewind demo

Think version control for pixel art, except it's just linear. It's useful and fun to use. If you buy Pixmap Ultimate you can use rewind to download snapshots or rewind an image.

When you rewind an image it modifies the history like this:

Pixmap rewind how it works

How it works is pretty simple. Luckily we chose to store every change everyone makes which made this possible. When you start a rewind the application simply paginates the event history from the backend and when you rewind an image we use the same architecture for drawing. All of the changes get added to the same queue, processed by the same servers, etc. The only real difference is that a lock is created on the image and then released once the rewind is done so that no editing can happen during the rewind.

The main challenge client side was performance.

If you drag the bar all the way to the beginning of time and start dragging it forward that is very "light" since each "tick" of the bar is simply applying changes to the scene from the last tick. However, when you drag backwards you have to re-render the whole scene from the beginning of time to where the user dragged the bar. Doing this on the UI thread resulted in lots of lag so I had to implement an "observer" thread that waits for things to render, does the processing, and pushes the result to the UI thread. I also ended up changing how concurrency worked in the renderer to reduce CPU usage, making both rewind and drawing smoother.

There were three challenges server side.

The first was that since I'm using websockets the original plan was for the client to send an event and then the server would paginate through the data and send the pages to the client. This didn't work with websockets because no matter the frame size or delay between messages the connection could get clogged resulting in high memory server side, delay in WS handshakes, and ultimately the connection closing. The solution was to implement traditional pagination over websockets where the client says "give me the next 10k after this id". At that point you don't even need websockets but I already have the infrastructure for it for other features.

The second problem was the pagination itself. We use MongoDB and it runs on a $5 VPS for Pixmap. That instance can handle tens of thousands of people drawing at once, but one rewind could destroy it with my first implementation. You see, Mongo doesn't optimize pagination queries like "find X, sort by Y, start 1000, limit 100". It will literally find all of X, sort it all, and then select the page from the result even though we are sorting in the same direction as the index etc (this is another thing about Mongo (or B-Trees I guess) I will write up later, where you can optimize for sort direction). This would kill the instance.

The solution was to still use pagination but without "start after N" - instead we "start after this _id" since Mongo's ids are naturally sortable (assuming you let the DB create the id and you don't create it application side). This drops cpu usage from around 80% during a rewind of "Mid Peninsula" to a negligible amount.

Anyway, I had fun building it and hope people enjoy it. Cheers.

Sun May 05 2019

What is winrickLabs?

Article size is 639 B and is a 1 min read

winrickLabs is a place for executing ambitious ideas. Things that may be deemed "too hard" or "not profitable" or "too fun" end up here. This doesn't mean elegant solutions are rejected. In fact they are encouraged. But the end product should be something to be proud of and the journey should have a story worth telling.

Sun May 05 2019

How does this blog work?

Article size is 1.3 kB - 8.39 kB and is a 1 min read

For those curious, I decided against having this blog be dynamically generated at runtime (think Wordpress, Pico, etc). It really didn't need to be interactive, doesn't need any client side javascript, and really doesn't even need a database. However, writing HTML manually for each post is a pain to maintain and not worth my time.

So to get the best of both worlds - maintainability and simplicity - I do dynamically generate the blog but only at build time.

What this means that I write the posts in Markdown and check them into github in the same repo. The repository contains a NodeJS app and when a github event fires to the build server the build server runs the NodeJS app just one time which generates the site. Then it just sends the build result off to the appropriate server running nginx.

Blog diagram