Long cold starts (FLAME)

My FLAME cold starts take a really long time - like around 20 seconds. I’m using them to run headless browsers (not super unlike https://worldpagespeed.fly.dev/), and I basically want my FLAME nodes to only do that.

I don’t understand exactly what all factors into cold start boot times. My suspicion is that my main app just has a lot of dependencies that take time to compile, even if they aren’t going to be used by the FLAME node.

Is that true? If so, would it then make sense to just spin up a separate Fly app specifically to run headless browsers? I see that with FLAME.FlyBackend we also have the option of pointing at whatever Docker image we want.

#flame

What is the size of your app image? There’s nothing to compile/build since it will be launching the prebuilt docker image from the parent. Are you certain the time is the cold start and not something app specific like the time to load your headless chromedriver process? worldpagespeed cold starts are in the 5-10s range with the time being actual time to start chromedriver and start driving the browser. You can also look into starting with a warm pool min: whatever and min_idle_shutdown_after to idle down below min if no work is needed to avoid deploys causing users to hit a cold pool.

1 Like

Hey Chris!

I actually chatted with you about this project a bit at ElixirConf :smiley: I have borrowed quite a bit from the world page speed repo in setting this up and I massively appreciate your responding here.

So the size of the app image is 1.08GB. I got that by doing fly auth docker and then pulling the image locally. Not sure there’s another way to see it. Is that a lot?

The actual machine boot time is pretty snappy. I get this message in the logs when I do a cold start:

Machine created and started in 3.514s

But then after that it seems like it takes another ~10 seconds for it to start handling the request. Is that to be expected? WPS always seemed faster to me.

The request processing itself takes about 4 seconds, which is ok and has as much to do as the site being tested as anything else. So the total cold start round trip is hovering in the 18 second range.

Seems like the most likely solution down the road will be to just keep at least one BrowserRunner FLAME node warm at all times.

That imagine size is reasonable and about what worldpagespeed is. I can’t say where the time is spent, but I would check your app supervision tree to ensure you aren’t waiting on extra services that you don’t need. You can also enable more logging to see if reported FLAME times match what you are experiencing:

{FLAME.Pool, 
 ...
 log: :info}

Then your fly log’s will show times like:

syd [info]19:20:24.116 [info] runner connect: completed in 8493ms

Thank you Chris.

This definitely gives me enough visibility to be able to fine-tune this in the future.

For the supervision tree, I’m actually using your children function and have it set like this:

    children(
      always: TwfexWeb.Telemetry,
      parent: {DNSCluster, query: Application.get_env(:twfex, :dns_cluster_query) || :ignore},
      parent: {Phoenix.PubSub, name: Twfex.PubSub},
      # Start the Finch HTTP client for sending emails
      parent: {Finch, name: Twfex.Finch},
      # Start a worker by calling: Twfex.Worker.start_link(arg)
      # {Twfex.Worker, arg},
      flame: Twfex.Browser.HeadlessDriver,
      parent: Twfex.Repo,
      parent: {AshAuthentication.Supervisor, otp_app: :example},
      parent:
        {FLAME.Pool,
         name: Twfex.BrowserRunner,
         min: 0,
         max: 10,
         max_concurrency: 10,
         idle_shutdown_after: :timer.hours(2)},
      # Start to serve requests, typically the last entry
      parent: TwfexWeb.Endpoint
    )

So in theory the only thing running in the FLAME node is telemetry and the BrowserRunner.