I’m facing this error when deploying:
Deployment logs
> [1/2] Clearing lease for 3d8dd959b70068
✔ [2/2] Cleared lease for 7811099c940318
✔ [1/2] Cleared lease for 3d8dd959b70068
Error: failed to update machine 7811099c940318: Unrecoverable error: timeout reached waiting for health checks to pass for machine 7811099c940318: failed to get VM 7811099c940318: Get "https://api.machines.dev/v1/apps/riley-api/machines/7811099c940318": net/http: request canceled
Error: Process completed with exit code 1.
Machine logs
2024-08-21T20:29:53Z runner[7811099c940318] sjc [info]Pulling container image registry.fly.io/riley-api:deployment-01J5VA2S6TGY3CZV607CJTAFS1
2024-08-21T20:29:54Z runner[3d8dd959b70068] sjc [info]Pulling container image registry.fly.io/riley-api:deployment-01J5VA2S6TGY3CZV607CJTAFS1
2024-08-21T20:30:06Z runner[7811099c940318] sjc [info]Configuring firecracker
2024-08-21T20:30:06Z runner[3d8dd959b70068] sjc [info]Successfully prepared image registry.fly.io/riley-api:deployment-01J5VA2S6TGY3CZV607CJTAFS1 (12.169936357s)
2024-08-21T20:30:06Z app[7811099c940318] sjc [info] INFO Sending signal SIGINT to main child process w/ PID 318
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][2024-08-21 20:30:06 +0000] [318] [INFO] Handling signal: int
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][2024-08-21 20:30:06 +0000] [318] [WARNING] Worker with pid 324 was terminated due to signal 3
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][2024-08-21 20:30:06 +0000] [318] [WARNING] Worker with pid 326 was terminated due to signal 3
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][2024-08-21 20:30:06 +0000] [318] [WARNING] Worker with pid 325 was terminated due to signal 3
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][2024-08-21 20:30:06 +0000] [318] [WARNING] Worker with pid 323 was terminated due to signal 3
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][2024-08-21 20:30:06 +0000] [318] [INFO] Shutting down: Master
2024-08-21T20:30:06Z app[7811099c940318] sjc [info] INFO Main child exited normally with code: 0
2024-08-21T20:30:06Z app[7811099c940318] sjc [info] INFO Starting clean up.
2024-08-21T20:30:06Z app[7811099c940318] sjc [info] WARN could not unmount /rootfs: EINVAL: Invalid argument
2024-08-21T20:30:06Z app[7811099c940318] sjc [info][ 5731.812371] reboot: Restarting system
2024-08-21T20:30:10Z app[7811099c940318] sjc [info] INFO Starting init (commit: 20f21dc5f)...
2024-08-21T20:30:10Z app[7811099c940318] sjc [info] INFO Preparing to run: `gunicorn riley_backend.main:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000 --timeout 300` as root
2024-08-21T20:30:10Z app[7811099c940318] sjc [info] INFO [fly api proxy] listening at /.fly/api
2024-08-21T20:30:10Z runner[7811099c940318] sjc [info]Machine created and started in 17.087s
2024-08-21T20:30:10Z app[7811099c940318] sjc [info]2024/08/21 20:30:10 INFO SSH listening listen_address=[fdaa:9:8686:a7b:181:53a4:e1f2:2]:22 dns_server=[fdaa::3]:53
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [318] [INFO] Starting gunicorn 20.1.0
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [318] [INFO] Listening at: http://0.0.0.0:8000 (318)
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [318] [INFO] Using worker: uvicorn.workers.UvicornWorker
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [323] [INFO] Booting worker with pid: 323
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [324] [INFO] Booting worker with pid: 324
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [325] [INFO] Booting worker with pid: 325
2024-08-21T20:30:11Z app[7811099c940318] sjc [info][2024-08-21 20:30:11 +0000] [326] [INFO] Booting worker with pid: 326
2024-08-21T20:30:42Z app[7811099c940318] sjc [info]DEBUG:filelock:Attempting to acquire lock 140641993491232 on /root/.cache/huggingface/hub/.locks/models--gpt2/be4d21d94f3b4687e5a54d84bf6ab46ed0f8defd.lock
2024-08-21T20:30:42Z app[7811099c940318] sjc [info]DEBUG:filelock:Lock 140641993491232 acquired on /root/.cache/huggingface/hub/.locks/models--gpt2/be4d21d94f3b4687e5a54d84bf6ab46ed0f8defd.lock
2024-08-21T20:30:42Z app[7811099c940318] sjc [info]DEBUG:filelock:Attempting to acquire lock 140641993745824 on /root/.cache/huggingface/hub/.locks/models--
I’ve tried everything. None of the following suggested solutions worked:
- Assigning more memory
- Adding
timeout
- Implement graceful shutdown: To handle long-running tasks more effectively.
This my Docker/Stack/fly.toml
# Build stage
FROM python:3.12-slim AS builder
WORKDIR /build
# Copy only the files needed for installing dependencies
COPY xxxx/pyproject.toml xxxx/poetry.lock* ./
RUN pip install --no-cache-dir poetry
# Export runtime dependencies
RUN poetry export -f requirements.txt --output requirements.txt --without-hashes
# Copy the entire xxxx directory
COPY xxxx ./xxxx
# Final stage
FROM python:3.12-slim
WORKDIR /app
ENV MODAL_TOKEN_ID=${MODAL_TOKEN_ID}
ENV MODAL_TOKEN_SECRET=${MODAL_TOKEN_SECRET}
# Install ffmpeg and other necessary packages
RUN apt-get update && apt-get install -y ffmpeg && rm -rf /var/lib/apt/lists/*
# Copy only runtime requirements
COPY --from=builder /build/requirements.txt .
# Install runtime dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code from the builder stage
COPY --from=builder /build/xxxxx /app
# Set Python path
ENV PYTHONPATH=/app
EXPOSE 8000
CMD ["gunicorn", "xxxxx.main:app", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:8000", "--timeout", "300"]
app = 'xxxxxx'
primary_region = 'sjc'
[build]
dockerfile = "./Dockerfile"
[http_service]
internal_port = 8000
force_https = true
auto_stop_machines = true
auto_start_machines = true
min_machines_running = 1
processes = ['app']
[[vm]]
memory = '2gb'
cpu_kind = 'shared'
cpus = 1
[scale]
min = 1
max = 2
[[services]]
http_checks = []
internal_port = 8000
processes = ["app"]
protocol = "tcp"
script_checks = []
[services.concurrency]
hard_limit = 25
soft_limit = 20
type = "connections"
[[services.ports]]
force_https = true
handlers = ["http"]
port = 80
[[services.ports]]
handlers = ["tls", "http"]
port = 443
[[services.tcp_checks]]
grace_period = "60s"
interval = "15s"
restart_limit = 0
timeout = "2s"
No Success yet!