Getting error when trying to run Llama2ChatModel in GPU machine

I’m new in machine learning field and in Fly platform. I tried to do this tutorial: Easy at-home AI with Bumblebee and Fly GPUs · The Phoenix Files in the blog.

I created the machine that should run the ML model, but when inspect the logs I see this:

2024-04-05T01:21:29Z app[e286657df71918] ord [info] WARN Reaped child process with pid: 508 and signal: SIGUSR1, core dumped? false
2024-04-05T01:21:30Z app[e286657df71918] ord [info][    7.603735] NVRM kchannelConstruct_IMPL: Channel allocation not allowed when MIG is enabled without GPU instancing
2024-04-05T01:21:31Z app[e286657df71918] ord [info]WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
2024-04-05T01:21:31Z app[e286657df71918] ord [info]I0000 00:00:1712280091.037464     485 tfrt_cpu_pjrt_client.cc:349] TfrtCpuClient created.
2024-04-05T01:21:31Z app[e286657df71918] ord [info]01:21:31.037 [error] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error
2024-04-05T01:21:31Z app[e286657df71918] ord [info]01:21:31.037 [info] retrieving CUDA diagnostic information for host: e286657df71918
2024-04-05T01:21:31Z app[e286657df71918] ord [info]01:21:31.037 [info] hostname: e286657df71918
2024-04-05T01:21:31Z app[e286657df71918] ord [info]01:21:31.037 [info] libcuda reported version is: 545.23.8
2024-04-05T01:21:31Z app[e286657df71918] ord [info]01:21:31.037 [info] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  545.23.08  Release Build  (dvs-builder@U16-I3-A16-1-1)  Mon Nov  6 23:37:57 UTC 2023
2024-04-05T01:21:31Z app[e286657df71918] ord [info]GCC version:  gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
2024-04-05T01:21:31Z app[e286657df71918] ord [info]"
2024-04-05T01:21:31Z app[e286657df71918] ord [info]01:21:31.038 [warning] Elixir does not have GPU access. Serving will NOT be started.

For a reason that I don’t know, the app is not using the GPU. I’m doing something wrong? Any ideas about it? Thank you!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

hi @danubiolima, you problem seems to start with this message:

2024-04-05T01:21:30Z app[e286657df71918] ord [info][    7.603735] NVRM kchannelConstruct_IMPL: Channel allocation not allowed when MIG is enabled without GPU instancing

The actual fix is to disable MIG before launching the Bumblebee app by running nvidia-smi -mig 0, that said we are aware of the problem and a fix is in the making for the app.

For others hitting this issue, while we don’t have a definitive fix, add a nvidia-smi -mig 0 call to your image’s entrypoint.

2 Likes

Hey @danubiolima, we’ll fix this for right now, you can add that nvidia-smi command to env.sh.eex to disable MIG.

Here’s what that looks like:

1 Like