Elixir/Phoenix app unreachable - [PM05] failed to connect to machine: gave up after 15 attempts

Hi all,

So, I seem to have tangled myself in a knot a bit and have been seeing a lot of errors like this in our logs lately:

We’re running an elixir phoenix (live view) app with oban and dns_cluster and postgres.

Appreciate any help/insight on this, thank you. :blush:

14:57:18\[PM05\] failed to connect to machine: gave up after 15 attempts (in 8.467671389s)

* 14:57:20\[PM05\] failed to connect to machine: gave up after 15 attempts (in 8.292206616s)

also errors like these:

[PR03] could not find a good candidate within 1 attempts at load balancing. last error: [PM05] failed to connect to machine: gave up after 15 attempts (in 8.301339195s)

And our app is unreachable.

Dockerfile:

# Find eligible builder and runner images on Docker Hub. We use Ubuntu/Debian
# instead of Alpine to avoid DNS resolution issues in production.
#
# https://hub.docker.com/r/hexpm/elixir/tags?page=1&name=ubuntu
# https://hub.docker.com/_/ubuntu?tab=tags
#
# This file is based on these images:
#
#   - https://hub.docker.com/r/hexpm/elixir/tags - for the build image
#   - https://hub.docker.com/_/debian?tab=tags&page=1&name=bullseye-20230612-slim - for the release image
#   - https://pkgs.org/ - resource for finding needed packages
#   - Ex: hexpm/elixir:1.19.4-erlang-28.2-debian-bullseye-20251117-slim
#
ARG ELIXIR_VERSION=1.19.4
ARG OTP_VERSION=28.2
ARG DEBIAN_VERSION=bullseye-20251117-slim

ARG BUILDER_IMAGE="hexpm/elixir:${ELIXIR_VERSION}-erlang-${OTP_VERSION}-debian-${DEBIAN_VERSION}"
ARG RUNNER_IMAGE="debian:${DEBIAN_VERSION}"

FROM ${BUILDER_IMAGE} as builder

# install build dependencies
RUN apt-get update -y && apt-get install -y build-essential git \
  && apt-get install -y libsodium-dev && apt install -y libvips-dev && apt-get clean && rm -f /var/lib/apt/lists/*_*

# prepare build dir
WORKDIR /app

# install hex + rebar
RUN mix local.hex --force && \
    mix local.rebar --force

# make bumblebee cache dir
RUN mkdir /app/.bumblebee

# set build ENV
ENV MIX_ENV="prod"
ENV BUMBLEBEE_OFFLINE=false
ENV BUMBLEBEE_CACHE_DIR="/app/.bumblebee"

# install mix dependencies
COPY mix.exs mix.lock ./

# install build dependencies
RUN apt-get update -y && apt-get install -y build-essential git nodejs npm \
  && apt-get clean && rm -f /var/lib/apt/lists/*_*

RUN mix deps.get --only $MIX_ENV
RUN mkdir config

# copy compile-time config files before we compile dependencies
# to ensure any relevant config change will trigger the dependencies
# to be re-compiled.
COPY config/config.exs config/${MIX_ENV}.exs config/

RUN mix deps.compile

COPY priv priv
COPY priv/dict/eff_large_wordlist.txt priv/dict/eff_large_wordlist.txt

COPY lib lib

COPY assets assets

# compile assets
RUN mix assets.deploy

# Compile the release
RUN mix compile

# Download and cache the NSFW detection model
RUN mix run --no-start -e 'Application.ensure_all_started(:exla); Application.ensure_all_started(:bumblebee); Mosslet.AI.NsfwImageDetection.load()'

# Changes to config/runtime.exs don't require recompiling the code
COPY config/runtime.exs config/

COPY rel rel
RUN mix release

# start a new build stage so that the final image will only contain
# the compiled release and other runtime necessities
FROM ${RUNNER_IMAGE}

RUN apt-get update -y && apt-get install -y libstdc++6 openssl libncurses5 libsodium-dev locales \
  && apt install -y libvips-dev libheif-examples && apt-get clean && rm -f /var/lib/apt/lists/*_*

# Set the locale
RUN sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && locale-gen

ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8
ENV ECTO_IPV6 true
ENV ERL_AFLAGS "-proto_dist inet6_tcp"

WORKDIR "/app"
RUN chown nobody /app

ENV BUMBLEBEE_CACHE_DIR="/app/.bumblebee"

# set runner ENV
ENV MIX_ENV="prod"
ENV BUMBLEBEE_OFFLINE=true


# Only copy the final release from the build stage
COPY --from=builder --chown=nobody:root /app/_build/${MIX_ENV}/rel/mosslet ./
COPY --from=builder --chown=nobody:root /app/.bumblebee/ ./.bumblebee

USER nobody

CMD ["/app/bin/server", "start"]

fly.toml

# fly.toml app configuration file generated for mosslet on 2024-06-29T13:57:04-04:00
#
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
#

app = "mosslet"
kill_signal = "SIGTERM"
swap_size_mb = 512
primary_region = "ewr"

[deploy]
  release_command = "/app/bin/migrate"

[env]
  DNS_CLUSTER_QUERY = "mosslet.internal"
  PHX_HOST = "mosslet.com"
  PORT = "8080"
  BUMBLEBEE_CACHE_DIR = "/app/.bumblebee"
  

[http_service]
  internal_port = 8080
  force_https = true
  min_machines_running = 1
  auto_stop_machines = "suspend"
  auto_start_machines = true
  processes = ["app"]

  [http_service.concurrency]
    type = "connections"
    hard_limit = 500
    soft_limit = 400

[[vm]]
  memory = "4gb"
  cpu_kind = "shared"
  cpus = 2

/rel/overlays/bin/server

#!/bin/sh
cd -P -- "$(dirname -- "$0")"
PHX_SERVER=true exec ./mosslet start

env.sh.eex

#!/bin/sh

# configure node for distributed erlang with IPV6 support
export ERL_AFLAGS="-proto_dist inet6_tcp"
export ECTO_IPV6="true"
export DNS_CLUSTER_QUERY="${FLY_APP_NAME}.internal"
export RELEASE_DISTRIBUTION="name"
export RELEASE_NODE="${FLY_APP_NAME}-${FLY_IMAGE_REF##*-}@${FLY_PRIVATE_IP}"

application.ex

defmodule Mosslet.Application do
  @moduledoc false

  use Application
  require Logger

  @impl true
  def start(_type, _args) do
    if Mosslet.Platform.native?() do
      Desktop.identify_default_locale(MossletWeb.Gettext)
      Mosslet.Platform.Config.ensure_data_directory!()
    end

    unless Mosslet.Platform.native?() do
      Logger.add_backend(Sentry.LoggerBackend)
      Oban.Telemetry.attach_default_logger()
      Mosslet.ObanReporter.attach()
    end

    children = build_children()

    opts = [strategy: :one_for_one, name: Mosslet.Supervisor]
    Supervisor.start_link(children, opts)
  end

  @impl true
  def config_change(changed, _new, removed) do
    MossletWeb.Endpoint.config_change(changed, removed)
    :ok
  end

  defp build_children do
    if Mosslet.Platform.native?() do
      native_children()
    else
      web_children()
    end
  end

  defp native_children do
    [
      MossletWeb.Telemetry,
      {Phoenix.PubSub, name: Mosslet.PubSub},
      MossletWeb.Presence,
      {Task.Supervisor, name: Mosslet.BackgroundTask},
      {Finch, name: Mosslet.Finch},
      ExMarcel.TableWrapper,
      Mosslet.Extensions.AvatarProcessor,
      Mosslet.Extensions.BannerProcessor,
      Mosslet.Extensions.MemoryProcessor,
      Mosslet.Repo.SQLite,
      Mosslet.Vault.Native,
      Mosslet.Session.Native,
      Mosslet.Sync,
      MossletWeb.Endpoint,
      MossletWeb.Desktop.Window.child_spec()
    ]
  end

  defp web_children do
    flame_parent = FLAME.Parent.get()

    [
      {Fly.RPC, []},
      Mosslet.Repo.Local,
      {Fly.Postgres.LSN.Supervisor, repo: Mosslet.Repo.Local},
      Mosslet.Vault,
      !flame_parent &&
        {DNSCluster, query: Application.get_env(:mosslet, :dns_cluster_query) || :ignore},
      MossletWeb.Telemetry,
      {Phoenix.PubSub, name: Mosslet.PubSub},
      MossletWeb.Presence,
      {Task.Supervisor, name: Mosslet.BackgroundTask},
      {Finch, name: Mosslet.Finch},
      {Finch, name: Mosslet.OpenAIFinch},
      ExMarcel.TableWrapper,
      Mosslet.Extensions.AvatarProcessor,
      Mosslet.Extensions.BannerProcessor,
      Mosslet.Extensions.MemoryProcessor,
      Mosslet.Extensions.URLPreviewServer,
      Mosslet.Timeline.Performance.TimelineCache,
      Mosslet.Notifications.EmailNotificationsProcessor,
      {Mosslet.Notifications.EmailNotificationsGenServer, []},
      {Mosslet.Notifications.ReplyNotificationsGenServer, []},
      {Mosslet.Timeline.Performance.TimelineGenServer, []},
      {Task.Supervisor, name: Mosslet.StorjTask},
      {PlugAttack.Storage.Ets, name: MossletWeb.PlugAttack.Storage, clean_period: 3_600_000},
      {Oban, oban_config()},
      {Mosslet.Extensions.PasswordGenerator.WordRepository, %{}},
      Mosslet.Security.BotDefense,
      Mosslet.Security.BotDetector,
      {FLAME.Pool,
       name: Mosslet.MediaRunner,
       min: 0,
       max: 5,
       max_concurrency: 10,
       min_idle_shutdown_after: :timer.minutes(5),
       idle_shutdown_after: :timer.minutes(2),
       log: :info},
      !flame_parent && MossletWeb.Endpoint,
      !flame_parent &&
        {Mosslet.DelayedServing,
         serving_name: NsfwImageDetection,
         serving_fn: fn -> Mosslet.AI.NsfwImageDetection.serving() end}
    ]
    |> Enum.filter(& &1)
  end

  defp oban_config do
    primary_region = System.get_env("PRIMARY_REGION")
    fly_region = System.get_env("FLY_REGION")

    cond do
      is_nil(primary_region) or is_nil(fly_region) ->
        Logger.info("Oban running in dev/test (no FLY_REGION set). Activated.")
        Application.fetch_env!(:mosslet, Oban)

      primary_region == fly_region ->
        Logger.info("Oban running in primary region. Activated.")
        Application.fetch_env!(:mosslet, Oban)

      true ->
        Logger.info("Oban disabled when running in non-primary region.")

        [
          repo: Mosslet.Repo,
          queues: false,
          plugins: false,
          peer: false,
          notifier: Oban.Notifiers.PG
        ]
    end
  end
end

runtime.exs

import Config

if System.get_env("MOSSLET_DESKTOP") == "true" do
  config :phoenix_live_view,
    debug_heex_annotations: false,
    debug_tags_location: false,
    debug_attributes: false

  Mosslet.Platform.Config.ensure_data_directory!()

  config :mosslet, Mosslet.Repo.SQLite,
    database: Mosslet.Platform.Config.sqlite_database_path(),
    pool_size: 5,
    journal_mode: :wal,
    cache_size: -64_000,
    temp_store: :memory,
    synchronous: :normal

  config :mosslet, MossletWeb.Endpoint,
    adapter: Bandit.PhoenixAdapter,
    http: [port: 0],
    server: true,
    secret_key_base: Mosslet.Platform.Config.generate_secret(),
    render_errors: [
      formats: [html: MossletWeb.ErrorHTML, json: MossletWeb.ErrorJSON],
      layout: false
    ],
    pubsub_server: Mosslet.PubSub,
    live_view: [signing_salt: Mosslet.Platform.Config.generate_salt()]
end

# config/runtime.exs is executed for all environments, including
# during releases. It is executed after compilation and before the
# system starts, so it is typically used to load production configuration
# and secrets from environment variables or elsewhere. Do not define
# any compile-time configuration in here, as it won't be applied.
# The block below contains prod specific runtime configuration.

# ## Using releases
#
# If you use `mix release`, you need to explicitly enable the server
# by passing the PHX_SERVER=true when you start it:
#
#     PHX_SERVER=true bin/Mosslet start
#
# Alternatively, you can use `mix phx.gen.release` to generate a `bin/server`
# script that automatically sets the env var above.
if System.get_env("PHX_SERVER") do
  config :mosslet, MossletWeb.Endpoint, server: true
end

if config_env() == :prod do
  config :flame, :terminator, log: :info
  config :flame, :backend, FLAME.FlyBackend

  config :flame, FLAME.FlyBackend,
    token: System.fetch_env!("FLY_API_TOKEN"),
    env: %{
      "DATABASE_URL" => System.get_env("DATABASE_URL"),
      "RELEASE_COOKIE" => System.fetch_env!("RELEASE_COOKIE")
    }

  config :mosslet, dns_cluster_query: System.get_env("DNS_CLUSTER_QUERY")

  # Configure plug_attack
  config :mosslet, plug_attack_ip_secret: System.get_env("PLUG_ATTACK_IP_SECRET")

  database_url =
    System.get_env("DATABASE_URL") ||
      raise """
      environment variable DATABASE_URL is missing.
      For example: ecto://USER:PASS@HOST/DATABASE
      """

  maybe_ipv6 = if System.get_env("ECTO_IPV6") in ~w(true 1), do: [:inet6], else: []

  config :mosslet, Mosslet.Repo.Local,
    # ssl: true,
    url: database_url,
    pool_size: String.to_integer(System.get_env("POOL_SIZE") || "10"),
    socket_options: maybe_ipv6,
    connect_timeout: 30_000,
    timeout: 30_000,
    queue_target: 5_000,
    queue_interval: 1_000

  # The secret key base is used to sign/encrypt cookies and other secrets.
  # A default value is used in config/dev.exs and config/test.exs but you
  # want to use a different value for prod and you most likely don't want
  # to check this value into version control, so we use an environment
  # variable instead.
  secret_key_base =
    System.get_env("SECRET_KEY_BASE") ||
      raise """
      environment variable SECRET_KEY_BASE is missing.
      You can generate one by calling: mix phx.gen.secret
      """

  host = System.get_env("PHX_HOST") || "mosslet.com"
  port = String.to_integer(System.get_env("PORT") || "4000")

  # Configure the canonical host for redirects.
  config :mosslet,
    canonical_host: host

  config :mosslet, MossletWeb.Endpoint,
    adapter: Bandit.PhoenixAdapter,
    url: [host: host, port: 443, scheme: "https"],
    check_origin: true,
    force_ssl: [rewrite_on: [:x_forwarded_proto]],
    live_view: [
      signing_salt: System.get_env("LIVE_VIEW_SIGNING_SALT"),
      encryption_salt: System.get_env("LIVE_VIEW_ENCRYPTION_SALT")
    ],
    http: [
      # Enable IPv6 and bind on all interfaces.
      # Set it to  {0, 0, 0, 0, 0, 0, 0, 1} for local network only access.
      # See the documentation on https://hexdocs.pm/plug_cowboy/Plug.Cowboy.html
      # for details about using IPv6 vs IPv4 and loopback vs public addresses.
      ip: {0, 0, 0, 0, 0, 0, 0, 0},
      port: port
    ],
    secret_key_base: secret_key_base

  config :mosslet,
    server_public_key: System.get_env("SERVER_PUBLIC_KEY"),
    server_private_key: System.get_env("SERVER_PRIVATE_KEY"),
    env: :prod

  # Configure Swoosh for production.
  config :mosslet, Mosslet.Mailer,
    adapter: Swoosh.Adapters.Mailgun,
    api_key: System.get_env("MAILGUN_API_KEY"),
    domain: System.get_env("MAILGUN_DOMAIN")

  config :swoosh,
    api_client: Swoosh.ApiClient.Finch,
    finch_name: Mosslet.Finch

  # Configure Oban for fly_postgres.
  # We want to ensure we're only running on
  # the primary database.
  unless System.get_env("FLY_REGION") do
    System.put_env("FLY_REGION", "ewr")
  end

  unless System.get_env("PRIMARY_REGION") do
    System.put_env("PRIMARY_REGION", "ewr")
  end

  primary? = System.get_env("FLY_REGION") == System.get_env("PRIMARY_REGION")

  unless primary? do
    config :oban_met, auto_start: false

    config :mosslet, Oban,
      queues: false,
      plugins: false,
      peer: false
  end

  # Configure langchain OpenAI key
  config :langchain,
    openai_key: System.get_env("OPENAI_KEY"),
    openai_org_id: System.get_env("OPENAI_ORG_ID")

  # Configure image nsfw detection
  config :image, :classifier,
    model: {:hf, "Falconsai/nsfw_image_detection"},
    featurizer: {:hf, "Falconsai/nsfw_image_detection"},
    featurizer_options: [module: Bumblebee.Vision.VitFeaturizer],
    name: Image.Classification.Server,
    autostart: true

  config :bumblebee,
    offline: System.get_env("BUMBLEBEE_OFFLINE", "true") == "true"

  # Configure Stripe
  config :stripity_stripe,
    api_key: System.get_env("STRIPE_API_KEY"),
    signing_secret: System.get_env("STRIPE_WEBHOOK_SECRET")

  csp =
    System.get_env("CSP_HEADER") ||
      "default-src 'none'; form-action 'self'; script-src 'self' 'unsafe-eval' https://unpkg.com/@popperjs/core@2.11.8/dist/umd/popper.min.js https://unpkg.com/tippy.js@6.3.7/dist/tippy-bundle.umd.min.js https://unpkg.com/trix@2.1.13/dist/trix.umd.min.js https://cdn.usefathom.com/script.js; style-src 'self' 'unsafe-inline' https://unpkg.com/trix@2.1.13/dist/trix.css; img-src 'self' data: blob: https://cdn.usefathom.com/ https://mosslet-prod.fly.storage.tigris.dev/ https://res.cloudinary.com/; font-src 'self' https://fonts.gstatic.com; connect-src 'self' wss://mosslet.com https://mosslet.com; frame-ancestors 'self'; object-src 'self'; base-uri 'self'; frame-src 'self'; manifest-src 'self';"

  config :mosslet, MossletWeb.Plugs.ContentSecurityPolicy, csp: csp

  # ## SSL Support
  #
  # To get SSL working, you will need to add the `https` key
  # to your endpoint configuration:
  #
  #     config :mosslet, MossletWeb.Endpoint,
  #       https: [
  #         ...,
  #         port: 443,
  #         cipher_suite: :strong,
  #         keyfile: System.get_env("SOME_APP_SSL_KEY_PATH"),
  #         certfile: System.get_env("SOME_APP_SSL_CERT_PATH")
  #       ]
  #
  # The `cipher_suite` is set to `:strong` to support only the
  # latest and more secure SSL ciphers. This means old browsers
  # and clients may not be supported. You can set it to
  # `:compatible` for wider support.
  #
  # `:keyfile` and `:certfile` expect an absolute path to the key
  # and cert in disk or a relative path inside priv, for example
  # "priv/ssl/server.key". For all supported SSL configuration
  # options, see https://hexdocs.pm/plug/Plug.SSL.html#configure/1
  #
  # We also recommend setting `force_ssl` in your endpoint, ensuring
  # no data is ever sent via http, always redirecting to https:
  #
  #     config :mosslet, MossletWeb.Endpoint,
  #       force_ssl: [hsts: true]
  #
  # Check `Plug.SSL` for all available options in `force_ssl`.

  # ## Configuring the mailer
  #
  # In production you need to configure the mailer to use a different adapter.
  # Also, you may need to configure the Swoosh API client of your choice if you
  # are not using SMTP. Here is an example of the configuration:
  #
  #     config :mosslet, Mosslet.Mailer,
  #       adapter: Swoosh.Adapters.Mailgun,
  #       api_key: System.get_env("MAILGUN_API_KEY"),
  #       domain: System.get_env("MAILGUN_DOMAIN")
  #
  # For this example you need include a HTTP client required by Swoosh API client.
  # Swoosh supports Hackney and Finch out of the box:
  #
  #     config :swoosh, :api_client, Swoosh.ApiClient.Hackney
  #
  # See https://hexdocs.pm/swoosh/Swoosh.html#module-installation for details.
end

prod.exs

import Config

# Note we also include the path to a cache manifest
# containing the digested version of static files. This
# manifest is generated by the `mix assets.deploy` task,
# which you should run after static files are built and
# before starting your production server.
config :mosslet, MossletWeb.Endpoint,
  cache_static_manifest: "priv/static/cache_manifest.json",
  force_ssl: [rewrite_on: [:x_forwarded_proto]]

# Configures Swoosh API Client
config :swoosh, api_client: Swoosh.ApiClient.Finch, finch_name: Mosslet.Finch

# Disable Swoosh Local Memory Storage
config :swoosh, local: false

# Do not print debug messages in production
config :logger, level: :info

config :bumblebee, progress_bar_enabled: false

# Runtime production configuration, including reading
# of environment variables, is done on config/runtime.exs.

endpoint.ex

defmodule MossletWeb.Endpoint do
  use Sentry.PlugCapture
  use Phoenix.Endpoint, otp_app: :mosslet

  # Enable concurrent testing for Wallaby
  if sandbox = Application.compile_env(:mosslet, :sandbox, false) do
    plug Phoenix.Ecto.SQL.Sandbox, sandbox: sandbox
  end

  # The session will be stored in the cookie and signed,
  # this means its contents can be read but not tampered with.
  # Set :encryption_salt if you would also like to encrypt it.
  @session_options [
    store: :cookie,
    key: "_mosslet_key",
    signing_salt: {Mosslet.Encrypted.Session, :signing_salt, []},
    encryption_salt: {Mosslet.Encrypted.Session, :encryption_salt, []},
    same_site: "Lax"
  ]

  # We pass the `:user_agent` in the websocket for Wallaby testing
  # We also pass `:peer_data` and `:x_headers` for IP-based bot defense at socket level
  socket "/live", Phoenix.LiveView.Socket,
    websocket: [connect_info: [:peer_data, :x_headers, :user_agent, session: @session_options]],
    longpoll: [connect_info: [:peer_data, :x_headers, session: @session_options]]

  # Serve at "/" the static files from "priv/static" directory.
  #
  # When code reloading is disabled (e.g., in production),
  # the `gzip` option is enabled to serve compressed
  # static files generated by running `phx.digest`.
  plug Plug.Static,
    at: "/",
    from: :mosslet,
    gzip: not code_reloading?,
    only: MossletWeb.static_paths()

  # Tidewave ai support
  if Code.ensure_loaded?(Tidewave) do
    plug Tidewave
  end

  # Code reloading can be explicitly enabled under the
  # :code_reloader configuration of your endpoint.
  if code_reloading? do
    socket "/phoenix/live_reload/socket", Phoenix.LiveReloader.Socket
    plug Phoenix.LiveReloader
    plug Phoenix.CodeReloader
    plug Phoenix.Ecto.CheckRepoStatus, otp_app: :mosslet
  end

  plug Phoenix.LiveDashboard.RequestLogger,
    param_key: "request_logger",
    cookie_key: "request_logger"

  plug Plug.RequestId
  plug Plug.Telemetry, event_prefix: [:phoenix, :endpoint]

  plug Stripe.WebhookPlug,
    at: "/webhooks/stripe",
    handler: Mosslet.Billing.Providers.Stripe.WebhookHandler,
    secret: {Application, :get_env, [:stripity_stripe, :signing_secret]}

  plug Plug.Parsers,
    parsers: [:urlencoded, :multipart, :json],
    pass: ["*/*"],
    json_decoder: Phoenix.json_library(),
    # 8 MB by trix.js calculations
    length: 8_388_608

  plug Sentry.PlugContext

  plug MossletWeb.Plugs.ContentSecurityPolicy

  plug Plug.MethodOverride
  plug Plug.Head
  plug Plug.Session, @session_options

  plug RemoteIp, headers: ~w[fly-client-ip]

  plug :canonical_host

  plug MossletWeb.Plugs.BotDefense

  plug MossletWeb.Router

  defp canonical_host(conn, _opts) do
    canonical = Application.get_env(:mosslet, :canonical_host)
    request_host = get_request_host(conn)

    cond do
      is_nil(canonical) or canonical == "" ->
        conn

      request_host == canonical ->
        conn

      true ->
        location = build_canonical_url(conn, canonical)

        conn
        |> Plug.Conn.put_resp_header("location", location)
        |> Plug.Conn.send_resp(301, "Moved Permanently")
        |> Plug.Conn.halt()
    end
  end

  defp get_request_host(conn) do
    conn.host
  end

  defp build_canonical_url(conn, canonical_host) do
    scheme =
      case Plug.Conn.get_req_header(conn, "x-forwarded-proto") do
        [proto | _] -> proto
        [] -> to_string(conn.scheme)
      end

    query =
      case conn.query_string do
        "" -> ""
        qs -> "?" <> qs
      end

    "#{scheme}://#{canonical_host}#{conn.request_path}#{query}"
  end
end

Okay so to update in case it helps someone else, made the following changes and now are reachable! :blush: We had a timing issue with our machine learning logic for detecting nsfw images when updating how we handle it (although i think the issue was present before because we were seeing strange Fly Doctor error messages about not being on the right port when we were).

I asked an ai agent to explain why our changes work:

The fix works because of timing and initialization order:
The Problem: Calling has_cpu_access?() invoked Nx.tensor(0) which triggered EXLA's initialization immediately during application startup. EXLA init is slow and resource-intensive - it was happening before your health checks passed, causing the container to be killed as "unhealthy."
The Solution: By removing the check and just calling serving_fn.() directly inside the spawned process, EXLA initialization now happens:
After the GenServer returns {:ok, state} (app is "healthy")
In a separate process that doesn't block startup
Only when actually needed to load the model
The whole point of DelayedServing was to defer heavy ML work until after the app is up - the redundant check was defeating that by front-loading EXLA init.

delayed_serving.ex

spawn(fn ->
-       if has_cpu_access?() do
-         Logger.info("Elixir has CPU access! Starting serving #{inspect(state.serving_name)}.")
+       Logger.info("Starting serving #{inspect(state.serving_name)}.")
  
-         serving = state.serving_fn.()
-         Logger.info("Serving #{inspect(state.serving_name)} started")
-         send(server, {:serving_loaded, serving})
-       else
-         Logger.warning("Elixir does not have CPU access. Serving will NOT be started.")
-       end
- 
-       :ok
+       serving = state.serving_fn.()
+       Logger.info("Serving #{inspect(state.serving_name)} started")
+       send(server, {:serving_loaded, serving})
      end)

config.exs in our canonical host

defp build_canonical_url(conn, canonical_host) do
    scheme =
      case Plug.Conn.get_req_header(conn, "x-forwarded-proto") do
        [proto | _] ->
          proto

        [] ->
          if Application.get_env(:mosslet, :env) == :prod,
            do: "https",
            else: to_string(conn.scheme)
      end

    query =
      case conn.query_string do
        "" -> ""
        qs -> "?" <> qs
      end

    "#{scheme}://#{canonical_host}#{conn.request_path}#{query}"
  end

I think i spoke too soon. we are able to be accessible at times so things have improved quite a bit, but still encounter significantly delayed access from say another session and see these errors:

16:29:35
[PC01] instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
16:29:38
[PC01] instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
16:29:38
16:29:38.320 [info] Discovered node :"mosslet-01KDR0NV50NF41PWG5XDMP5RS2@fdaa:0:85f4:a7b:546:aef4:cf06:2" in region fra
16:29:48
[PC01] instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
16:29:48
[PC01] instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)
16:29:48
[PC01] instance refused connection. is your app listening on 0.0.0.0:8080? make sure it is not only listening on 127.0.0.1 (hint: look at your startup logs, servers often print the address they are listening on)

And notices about Fly Proxy waiting a lot or too much:


Fly proxy is waiting a lot for your machine to become reachable...
Check logs for more info.
Fly proxy waited too much for your machine to become reachable then gave up.
Check logs for more info.

My feeling is it is around our ML logic causing our app to be too slow in spinning up machines? Althought I thought the ‘suspend’ state would help with that, so feels like I’m doing something else wrong. I’ll update again if I figure it out.

Okay: there was a missing function in our adapters/web.ex that was not being detected as missing. I’m not even sure how that happened but it did. :blush:

Glad to hear you got it working again! As a small side note, in case you want to do some fine-tuning in the future, your swap_size_mb setting is probably preventing suspend from ever occurring. (This is one of the limitations of suspend, :snowflake:.)

If you try a manual fly m suspend, you should get an error message.

Also…

The root partition is super-slow these days, throttled at 8 MiB/s, so it doesn’t generally make a good cache.

I’d suggest SSHing in and then seeing how much is really stored there. It would only take 80MB to introduce a 10 second delay at an inopportune time (e.g., processing a request that included an uploaded image), assuming it was all loaded simultaneously.

Hope this helps a little!

1 Like

Thank you! I made updates (dropped our memory to 2gb with no swap) and switched to volumes on our machines for bumblebee cache and not seeing the errors anymore, along with the performance I had hoped to see. :heart:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.