Board game app: overcoming poor performance in monolithic function

I’ve been working on a hobby project for about a half a year. The guts of the app work. The issue is finding the right platforms to hosting those guts. I built initially for Supabase and everything works there. I’ve got a working database, working email notifications, user logins, everything I need—well, almost. The last bit and the sore spot is Compute.

Each of my games is a program which itself is a functional core. You pass in config, players, events, commands and the program computes one of two things: a snapshot of the game state or (if you provided any commands) a snapshot along with any newly created events. The entire premise of the app is built on this model. Thus, I need to host a monolithic function.

I described this journey here:
https://www.reddit.com/r/serverless/comments/12ztu2w/developing_on_the_free_tier_until_you_cant_or_how/?utm_source=share&utm_medium=web2x&context=3

I put that function in a Cloudflare Worker. It couldn’t even stand up. I put it in the Postgres database (150KB) and it worked, for a while. The first app was a card game, so simple. The next, a full fledged modern strategy board game, so more compute was needed. Eventually, the Postgres function timed out. I didn’t want to immediately leap to the paid tier. I moved the code to a Supabase function, and that was hit or miss. Sometimes it succeeds, other times it fails: out of memory or too much compute. I hit free-tier walls.

So I looked elsewhere. I was looking for a more traditional VPS where the code could be hosted 24/7, a typical server. So I ported the code to a Deno app and got it up and running on Fly.io. So I succeeded in at least putting it somewhere where limits didn’t prevent it from finishing is compute.

The whole function is pure compute. It takes in data, it spits out data. It makes no network calls. This seems simple to me.

But when I send it the payload it takes about 6s to process, too long. I tried upping the memory and cpus by using Fly’s scale features hoping it would help, but even though I scaled (vertically) to performance-16x it made almost no difference. It saved 0.5-1s.

What I don’t understand is I host the computation locally. When it hits it locally in the Deno server it takes 2s to process. When I hit it locally in the Browser it takes 0.5-1s. It’s very fast. All I have is a Macbook Air, M1.

I wasn’t certain if some of the delay had to do with network travel time so I ruled that out. I added timestamps in the log and have verified that the majority of the time happens in actual compute.

I bought into the idea of serverless hoping it would simplify things, but so far all I’ve done is port working code from one spot to another trying to find a place where there’s enough power to process a simple reduction.

The work I’m doing isn’t serious. I process a fold over events. The events are pretty lean (think about the size of a Tweet at most). It runs very fast in the browser. It only slows down when I log snapshots of state (thus, the main overhead is from displaying data, not processing it).

I don’t want to keep shifting things around but I’m struggling to figure out how to get my monolithic function to run. It’s the size it is as a matter of design constraints, so splitting the monolith is not on the table. Size details are in my linked post.

Any thoughts on how to handle compute? I thought using a VPS (e.g. Fly) would work, and it does, but it’s too slow to be practical. Whether I process a state snapshot (e.g. a blob of JSON data representing game state) and 1 event or no snapshot and 225 events, it takes about 6s. I can’t understand why a single record fold is as lacking in performance as a 200+ record fold.

1 Like

Hey @mlanza

0.5-1s locally on M1 sounds quite long. Maybe you need to profile/benchmark the function to see which parts of it are slow? What amount of CPU/memory it needs? Benchmarking libraries can run a function thousands of times and calculate average use of CPU and memory per invocation.

If scaling CPU didn’t shave much time, maybe the problem is not in CPU? Something else is the bottleneck? Could it also be that the payload is heavy so it takes time to send to browser and render it?

2 Likes

0.5-1s is under maximum workload, 225 events processed. But as mentioned, whether I process 1 event or 225 on the Fly node, it takes 6s. That’s what’s boggling my mind. 1 piece of work or 225 pieces and 6s either way? And vertically scaling the machine has almost no impact. I’m at a loss right now.

I’d suspect the JavaScript parsing was the issue if this were a serverless function which ramps up from zero on each invocation, but I moved to a long-running instance to eliminating the parsing during a cold start.

Also, this isn’t a single small function but rather a whole program wrapped in a function. In a manner it’s a console app. The crux of the design problem is doing exactly this. This is the part of the problem that isn’t up for grabs. I’m proceeding with a program which stays intact and looking for what in the stack/platform can be varied? Why does scaling the instance make no difference?

I’ve written plenty of web apps at this size. It’s just most are packaged and delivered to the client. In this instance I’m shifting a similarly sized app to the server and just not understanding why the server is substantially weaker than the clients I’ve used.

Other details:
https://www.reddit.com/r/serverless/comments/133z6io/performant_compute_for_monolithic_functions_at/

Also, to be clear, I’m not blaming Fly. I’ve had a rather nice experience with your CLI tool. This is me finding the right way to host a monolith in a compute service.

Hi @mlanza

Does the app connect to anything externally? For instance a database or API?

That could explain why it takes so long.

The fact that scaling CPU had almost no impact means that your bottleneck is not computation, most likely it’s some sort of IO or timer.

3 Likes

It does not. I only runs compute with the data it’s given.

I ran in an Azure function last night. Similar response time. I would expect it to be slower in a serverless function where it has to parse the JavaScript on every invocation.

That is why I chose a container. I thought it would at least cut out the parse step.

Here’s everything in its entirety, the function and the call.

For what it’s worth the game performed fine all the way up to around 220ish moves as a function inside a Postgres database. It was almost as if adding the 225th move was too much for it.

Hey there,

This is puzzling. Information that may be helpful:

  • Can you provide the non-minified version of the code?
  • Can you provide your Dockerfile?

Blotting out those endpoints. Of course, without pulling the dependencies this is of limited value.

I’ve deployed as a bundle in some spots, but on Fly, it uses ES6 imports on Deno, so not a bundle.

// @ts-nocheck
import {
  app,
  get,
  post,
  options,
  redirect,
  contentType,
} from "https://denopkg.com/syumai/dinatra/mod.ts";
import {log, count, comp, uident, date, period, elapsed} from "~~~/lib/atomic/core.js";
import * as g from "~~~/lib/game.js";
import mexica from "~~~/games/mexica/core.js";
import ohHell from "~~~/games/oh-hell/core.js";

const games = {
  "mexica": mexica,
  "oh-hell": ohHell
}

const headers = {
  "access-control-allow-origin": "*",
  "access-control-allow-methods": "GET, POST, OPTIONS",
  "access-control-allow-credentials": true,
  "access-control-allow-headers": "Origin, X-Requested-With, Content-Type, Accept",
  "access-control-max-age": 10,
  "content-type": "application/json"
};

app(
  post("/mind/:game", async function(req){
    const {params} = req;
    const {game, seats, config, events, commands, seen, snapshot} = params;
    const id = uident(5);
    const start = date();
    log("req", game, id, count(events), snapshot ? "snapshot" : "", JSON.stringify(req), "\n\n");
    const simulate = comp(g.effects, g.simulate(games[game]));
    try {
      const results = simulate(seats, config, events, commands, seen, snapshot);
      const stop = date();
      const ms = elapsed(period(start, stop)).valueOf();
      log("resp", game, id, `${ms}ms`, JSON.stringify(results), "\n\n");
      return [200, headers, JSON.stringify(results)];
    } catch (ex) {
      return [500, headers, JSON.stringify(ex)];
    }
  }),
  options("/mind/:game", function(){
    return [200, headers, ""];
  }),
  get("/", () => "<h1>The Game Mind lives!</h1>"),
  get("/info", () => [
    200,
    contentType("json"),
    JSON.stringify({ app: "mind", version: "1.1.0" })
  ])
);

log("The Game Mind lives!");

This is going to be painful, but I think you have a few options for debugging this:

  • Add logs for every function that shows how long it took, until you find the slow one(s).
  • Profile your code w/ something like 0x under node.js: 0x - npm

Locally is probably good enough. An Apple M1 machine is very powerful, much more than our instances in many cases. Maybes there’s something about aarch64 that makes your app faster too, our instances are x86_64.

If your compute-heavy operations don’t run in parallel, then even using a 16x performance instance on Fly won’t change anything compared to a 1x performance instance. If this was Rust, I’d tell you to use rayon to parallelize your code. w/ node.js or deno, you’ll probably have to split the work (if possible) and run it in multiple threads concurrently.

Anyway, even locally you should be able to find out what’s slow. As @Elder mentioned, 500ms-1s is already very slow for a M1.

2 Likes

Thank you for taking your time to provide some feedback. I am investigating many things. Unfortunately, the nature of this design does not allow for parallelization.

Have you tried passing much smaller changes through and keeping the full state in memory? One neat thing about Fly Machines is that you can route users to the exact same memory space. Which means game state can exist entirely in server memory, and the clients can just ship mutations. Roughly like this example:

1 Like

Initially, I designed the compute to do event sourcing, to replay everything from zero and everything worked great up until around 225 events. Then it started choking the Postgres function which was performing the compute.

So I redesigned to allow snapshots of game state to be periodically cached and to perform a fraction of the event sourcing, from the snapshot and the remaining events but event that didn’t help.

Ever since I’ve been shopping for another compute service.

I treat objects/arrays as immutables, thus never mutating them (since Records and Tuples have not yet been released to JavaScript). So in each frame of my reduction, I have to recompute the game state object. I reuse as much of the data as I can, same as Clojure does with its persistents.

I’ll look at your article. I’m open to anything. But I’m trying low hanging fruit first before trying major design pivots.

One thing you might try is running it in Docker locally on your Mac and throttling resources.

If you profile your code, you may find an obvious bottleneck. Or it might just be a million little things. Parsing and generating JSON has surprised me when I’ve had CPU issues, it’s more expensive than I would have guessed.

1 Like

At one point I was concerned that the JSON.stringify used for logging might pose a cost, but I removed it and it made little difference.

I can definitely add logging to key spots in my compute to time things.

One thing about scaling on Fly, I assumed the scale commands were near immediate, but I wasn’t sure so I routinely restarted the app after such a change. Was the restart needed? I’m thinking no.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.