Getting error when trying to run Llama2ChatModel in GPU machine

danubiolima · April 5, 2024, 1:40am

I’m new in machine learning field and in Fly platform. I tried to do this tutorial: Easy at-home AI with Bumblebee and Fly GPUs · The Phoenix Files in the blog.

I created the machine that should run the ML model, but when inspect the logs I see this:

2024-04-05T01:21:29Z app[e286657df71918] ord [info] WARN Reaped child process with pid: 508 and signal: SIGUSR1, core dumped? false
2024-04-05T01:21:30Z app[e286657df71918] ord [info][    7.603735] NVRM kchannelConstruct_IMPL: Channel allocation not allowed when MIG is enabled without GPU instancing
2024-04-05T01:21:31Z app[e286657df71918] ord [info]WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
2024-04-05T01:21:31Z app[e286657df71918] ord [info]I0000 00:00:1712280091.037464     485 tfrt_cpu_pjrt_client.cc:349] TfrtCpuClient created.
2024-04-05T01:21:31Z app[e286657df71918] ord [info]01:21:31.037 [error] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error
2024-04-05T01:21:31Z app[e286657df71918] ord [info]01:21:31.037 [info] retrieving CUDA diagnostic information for host: e286657df71918
2024-04-05T01:21:31Z app[e286657df71918] ord [info]01:21:31.037 [info] hostname: e286657df71918
2024-04-05T01:21:31Z app[e286657df71918] ord [info]01:21:31.037 [info] libcuda reported version is: 545.23.8
2024-04-05T01:21:31Z app[e286657df71918] ord [info]01:21:31.037 [info] kernel reported version is: NOT_FOUND: could not find kernel module information in driver version file contents: "NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  545.23.08  Release Build  (dvs-builder@U16-I3-A16-1-1)  Mon Nov  6 23:37:57 UTC 2023
2024-04-05T01:21:31Z app[e286657df71918] ord [info]GCC version:  gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
2024-04-05T01:21:31Z app[e286657df71918] ord [info]"
2024-04-05T01:21:31Z app[e286657df71918] ord [info]01:21:31.038 [warning] Elixir does not have GPU access. Serving will NOT be started.

For a reason that I don’t know, the app is not using the GPU. I’m doing something wrong? Any ideas about it? Thank you!

system · April 12, 2024, 1:40am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

dangra · April 16, 2024, 6:20pm

hi @danubiolima, you problem seems to start with this message:

2024-04-05T01:21:30Z app[e286657df71918] ord [info][    7.603735] NVRM kchannelConstruct_IMPL: Channel allocation not allowed when MIG is enabled without GPU instancing

The actual fix is to disable MIG before launching the Bumblebee app by running nvidia-smi -mig 0, that said we are aware of the problem and a fix is in the making for the app.

For others hitting this issue, while we don’t have a definitive fix, add a nvidia-smi -mig 0 call to your image’s entrypoint.

matthewlehner · April 16, 2024, 6:26pm

Hey @danubiolima, we’ll fix this for right now, you can add that nvidia-smi command to env.sh.eex to disable MIG.

Here’s what that looks like:

Topic		Replies	Views
"Elixir Llama2-13b on Fly GPUs" doesn't work on new account?	3	24	September 20, 2024
Attempting to run llama-cpp-python on an a100-40GB GPU server (SIGILL) Build debugging	0	70	October 4, 2024
Your organization is not allowed to use GPU machines (Request ID: 01HVC7CW3MCK5Y9VFT9A7AAPQ1-nrt) Questions / Help	1	137	April 22, 2024
failed to launch VM: Your organization is not allowed to use GPU machines Questions / Help gpu	6	386	April 22, 2024
Impossible to run Llama 3.1 405b? machines , gpu	2	181	August 3, 2024

Getting error when trying to run Llama2ChatModel in GPU machine

Related topics