auto_stop_machines & start: machines keep getting restarted

Hi everyone,

I’ve setup a Mastodon instance with fly.io using this template GitHub - tmm1/flyapp-mastodon: mastodon on fly.io

I’ve added a section for auto_stop_machines and auto_start_machines so the template looks like:

app = "XXX"

kill_signal = "SIGINT"
kill_timeout = 5

[deploy]
  strategy = "bluegreen"

[env]
  LOCAL_DOMAIN = "XXX"
  WEB_CONCURRENCY = "0"
  OVERMIND_FORMATION = "sidekiq=1"
  MALLOC_ARENA_MAX = "2"
  MAX_THREADS = "15"
  RAILS_ENV = "production"
  RAILS_LOG_TO_STDOUT = "enabled"
  RAILS_SERVE_STATIC_FILES = "false"
  REDIS_HOST = "XXX-redis.internal"
  REDIS_PORT = "6379"
  S3_ENABLED = true
  S3_BUCKET = "XXX"
  S3_ALIAS_HOST = "XXX.XXX"
  S3_ENDPOINT = "https://XXX.r2.cloudflarestorage.com/"
  S3_PERMISSION = "private"
  S3_PROTOCOL = "https"

  SMTP_SERVER = "smtp.eu.mailgun.org"
  SMTP_PORT = "587"
  SMTP_ENABLE_STARTTLS = "always"
  SMTP_FROM_ADDRESS = "mastodon@XXX"

[[statics]]
  guest_path = "/opt/mastodon/public"
  url_prefix = "/"

[[services]]
  # processes = ["rails"]
  internal_port = 8080
  protocol = "tcp"

  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 1

  [services.concurrency]
    type = "requests"
    hard_limit = 250
    soft_limit = 200

  [[services.ports]]
    handlers = ["http"]
    port = 80

  [[services.ports]]
    handlers = ["tls", "http"]
    port = 443

  [[services.tcp_checks]]
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

  [[services.http_checks]]
    path = "/health"
    grace_period = "1s"
    interval = "15s"
    restart_limit = 0
    timeout = "2s"

I saw some machines stopping but instantly restarting. What could cause this? The checks have a restart_limit = 0 so I don’t think it could be this.

I then tried to put a stupidly high hard_limit and soft_limit to make sure the instances would shut down, but it did not work. Some were shutting down and restart just after, some were not even shutting down at all.

Mastodon has a websocket running, and I saw there was an assumption here that it could cause issues. Was it fixed?

I also tried to switch to http_service and http_service.concurrency using type = "requests", but the same thing, even with very high fake values, machines would again restart or not stop at all.

I forgot to save the logs… If needed I can redeploy with these settings and add them to this thread.

Thanks for your help :bowing_man:

We haven’t gotten to looking into this issue just yet. We have a hunch why this is happening but haven’t had the time to reproduce and fix yet. We should in the next few weeks.

The issue with websockets is that they are opened by the client and are intended to remain open as long as a tab containing your webpage is still open. In the event that the server or network goes down, the javascript client will respond to that event by waiting a short period (normally a small number of seconds) and continuously retry.

1 Like

Thanks for your answer, senyo. Anything I could do in the meantime to have some kind of autoscaling? I don’t mind waiting, no worries :revolving_hearts: Thanks for working on that and for the quick answer.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

@alyx we’ve fixed a bug around incorrectly auto-stopping machines, leading to the behavior you saw. Can you try re-enabling autostart & autostop to see if it works better?