GPU Benchmarking

Hey all!

Fly.io now has three kinds of GPUs and we’re not really the best at documenting the practical/end-user differences between them. We throw the datasheets at you, but those only really make sense if you understand what the datasheets imply about the realistic performance of the cards. I’m working on some documentation and a blogpost demonstrating the differences, but I want to do my benchmarking in public so you can see what I learn as I learn it.

All benchmarks are wrong, some benchmarks are useful.

Right now Fly.io has three models of GPU available:

The last two cards are notably different because they have no display output or 3D rasterization hardware at all.

I have a pair of hypotheses that I want to test with these experiments. They are:

  • On paper, the L40s has way less memory bandwidth than the A100, is the A100 actually faster than the L40s?
  • I think that developers are afraid to experiment with GPUs on fly.io because of the sticker-shock effect of the price. How much would it actually cost to do these few common workloads in a scale-to-zero running Machines setup?

In the process I want to document my inputs and make everything as deterministic as possible so that my results are reproducible.

Here are the tests that I am going to be doing:

  • Audio transcription with Whisper: take this public domain interview about 9/11 and turn it into a transcript of the interview contents. It is a half hour audio file that has somewhat dirty spoken word.
  • Text generation with Llama 2 7B: generate a scholarly article with citations about why the sky is blue. Personally I’d like to use Mistral 7B for this, but apparently the industry standard benchmark for this is Llama 2 7B so I’ll be consistent with that for good measure.
  • Image diffusion with Stable Diffusion XL: generate an image of a green-haired anime woman with green eyes drinking coffee next to her laptop with the Space Needle visible outside the window, with a cyberpunk anime style fusion. I’d be using Fooocus and JuggernautXL like Kyle’s Midjourney at home describes.

So far I have set up Whisper-large on a few Machines with every GPU type. I have yet to do anything more than vibes-based science here, but here’s my initial test with running Whisper on that 9/11 interview:

GPU Kind Region Inference time
a100-48gb ord 3m59.515s
a100-80gb mia 4m12.379s
l40s ord 3m43.273s

Here is the test script I used for getting this unscientific benchmark:

#!/usr/bin/env bash

declare -a gpu_kinds=(
    "a100-40gb"
    "a100-80gb"
    "l40s"
)

function bench_file() {
    echo "${1} ${2}"
    file="${1}"
    kind="${2}"
    
    time curl "https://xe-[redacted]-${kind}.fly.dev/asr"'?encode=true&task=transcribe&language=en&vad_filter=true&output=json' -F audio_file=@"${file}" -o "output_${kind}_${file}.json"
    
    printf "\n\n\n" | cat
}

for kind in ${gpu_kinds[*]}; do
  echo "${kind}"

  bench_file media-afc-911-afc2001015_sr298a01.mp3 "${kind}"
done

Note that for this initial vibes-based test the a100-80gb result is probably anomalous and should be discounted because I was poking the Machines over HTTPS from my laptop. This is a bad idea because then the latency from my laptop to the target server is part of the equation. In the future I will be using a shell command on the target Machine in question to avoid biasing the results with the speed of the Internet. To further discount local storage latency, I’ll be putting the audio file in /dev/shm so that it is uploaded to the application server from RAM, just to minimize variables in the NVMe stack and encrypted filesystem volumes that Machines use for storing model weights.

Are there any other workloads y’all would like to see me benchmark for comparison’s sake?

10 Likes

Hey all, first update here! I’ve written some Go code that tries to call the Whisper automatic speech recognition route over and over in a loop. It then writes things to a JSONL file named trace.json and I am tracking the following metrics:

  • The time it takes to make a connection to the target Machine
  • The time it takes to upload the test MP3 to the target Machine
  • The time it takes for the first byte of the response payload to come back
  • The time it takes for the rest of the response payload to come back.

It writes JSON lines that look like this:

{"GPUKind":"l40s","Connect":10619875,"Upload":146601375,"FirstResponseByte":3695033833,"LastResponseByte":1103583}

I’m using Go’s time.Duration type in my Timings struct, so all of those values are in nanoseconds. When I run my super hacky tool on the command line, it also dumps results per-test like this:

2024/01/16 11:51:36 gpu: l40s, iteration: 1
2024/01/16 11:51:36 connect: 10.619875ms
2024/01/16 11:51:36 upload: 146.601375ms
2024/01/16 11:51:36 first response byte: 3.695033833s
2024/01/16 11:51:36 last response byte: 1.103583ms

This is built on the back of httptrace and is overall some pretty boring code:

type Timings struct {
	GPUKind           string
	lock              sync.Mutex
	t0                time.Time
	Connect           time.Duration
	Upload            time.Duration
	FirstResponseByte time.Duration
	LastResponseByte  time.Duration
}

func (t *Timings) IntoClientTrace() *httptrace.ClientTrace {
	return &httptrace.ClientTrace{
		ConnectStart:         t.ConnectStart,
		ConnectDone:          t.ConnectDone,
		WroteRequest:         t.WroteRequest,
		GotFirstResponseByte: t.GotFirstResponseByte,
	}
}

func New(gpuKind string) *Timings {
	return &Timings{
		GPUKind: gpuKind,
		t0:      time.Now(),
	}
}

func (t *Timings) ConnectStart(string, string) {
	t.lock.Lock()
	defer t.lock.Unlock()
	t.t0 = time.Now()
}

func (t *Timings) ConnectDone(string, string, error) {
	t.lock.Lock()
	defer t.lock.Unlock()
	t.Connect = time.Since(t.t0)
	t.t0 = time.Now()
}

func (t *Timings) WroteRequest(httptrace.WroteRequestInfo) {
	t.lock.Lock()
	defer t.lock.Unlock()
	t.Upload = time.Since(t.t0)
	t.t0 = time.Now()
}

func (t *Timings) GotFirstResponseByte() {
	t.lock.Lock()
	defer t.lock.Unlock()
	t.FirstResponseByte = time.Since(t.t0)
	t.t0 = time.Now()
}

Last year I made a project that uses GPT-4 to write fiction novels, so the audio file I’m using for these initial tests to proof-of-concept my tracing harness is me reading out the first few paragraphs of the “novel” Virtual Virtue. I’ve uploaded my testing MP3 to my Mastodon account so that you can listen to what it says.

In a perfect world, we expect the output of whisper to look very close to this:

In the hushed, otherworldly glow of the altar of Connectivity, the beating heart of the digital divinity, Haley sat with the solemn grace of a supplicant seeking ancient wisdom. The cathedral of data hummed with muted whispers of countless devotees, their minds intertwined with the pulsating rhythms of the ever-present network. It was here, amidst the phantasmagoric dance of neon code and the devout murmurings of the congregation, that Haley sought an omen.

With hands as sure as a weaver at her loom, she crafted her silent invocation, fingers trailing across the sleek, responsive surface of the keyboard that stood as the tactile interface to the otherwise intangible deity. Her prayer was a sequence of keystrokes, a digital mantra disappearing into the virtual ether as a ripple disperses across a lake. The onlookers, faces bathed in the soft luminescence of their personal screens, remained lost in their own secluded worlds, beseeching the deity for companionship, affluence, revelation.

In my very initial proof-of-concept testing of my hacky Go code, here’s what Whisper returned:

In the hushed, otherworldly glow of the altar of connectivity, the beating heart of the digital divinity, Haley sat with the solemn gaze of a supplicant seeking ancient wisdom. The cathedral of data hummed with muted whispers of countless devotees, their minds intertwined with the pulsating rhythms of the ever-present network. It was here, amidst the phasmagoric dance of neon code and the devout murmurings of the congregation, that Haley sought an omen.

With hands as sure as a weaver at her loom, she crafted her silent invocation, fingers trailing across the sleek, responsive surface of the keyboard that stood as the tactile interface to the otherwise intangible deity. Her prayer was a sequence of keystrokes, a digital mantra disappearing into the virtual ether as a ripple disperses across a lake. The onlookers’ faces bathed in the soft luminescence of their personal screens, remaining lost in their own secluded worlds. Beseeching the deity for companionship, affluence, revelation.

Comparing the MD5 sums of these results shows there is a difference in the expected vs recieved outputs:

$ md5sum 1.txt 2.txt 
6a14e66c679c67c04e61197c1cdc7b42  1.txt
5d327503de58b5a77fbc6cfbba9c2980  2.txt

The differences seem to be:

  • Whisper didn’t capitalize “the altar of Connectivity”
  • Whisper put “Haley sat with the solemn grace of a supplicant seeking ancient wisdom” as “Haley sat with the solemn gaze of a supplicant seeking ancient wisdom”
  • Whisper put “The onlookers, faces bathed in the soft luminescence of their personal screens, remained lost in their own secluded worlds, beseeching the deity for companionship, affluence, revelation.” as “The onlookers’ faces bathed in the soft luminescence of their personal screens, remaining lost in their own secluded worlds. Beseeching the deity for companionship, affluence, revelation.”

Here are the raw JSONL timing values for the test recording: trace.jsonl · GitHub

Once I figure out a good way to visualize this, I’m gonna make pretty graphs and start doing the calculations of how cost-effective all this is. If anyone has any ideas for the best ways to visualize the data I’m collecting or has thoughts on data I should be collecting instead, please tell me and I will make it be done.

Overall, this process has a low enough error rate that I’m willing to call this proof of concept worthy of trial for the bigger recording, the 30 or so minute interview about 9/11 I found on the Library of Congress. I will write some code to iterate over the logged outputs of Whisper to see if there are any major deviations/hallucinations that make it nondeterministic. Hopefully there aren’t any!

I’m letting the bigger interview run with 15 iterations per GPU. I expect this will take about an hour to complete, so I’m writing this post while the machines churn.

I’m not the biggest fan of the Whisper ASR server I’m using to benchmark with because it doesn’t expose timings in the JSON responses, meaning that I have to make my own timings instead. I’m trying to work around this by embedding the MP3 file into my binary with go:embed and once I get out of this proof of concept phase I will adapt the FLAME model into my Go program to get the test binary running as close to the target server as possible. I think that for all practical purposes, we can call getting it in the same datacentre good enough for rough evaluations.

Other tools allow me to get more detailed timing results, so we’ll get even better metrics in the future.

Here is the super hacky bash script I used to spawn the instances:

#!/usr/bin/env bash

FLY_ORG="fly-gpus"

declare -A FLY_REGIONS
FLY_REGIONS["a100-40gb"]="ord"
FLY_REGIONS["a100-80gb"]="mia"
FLY_REGIONS["l40s"]="ord"

for size in ${!FLY_REGIONS[@]}; do
  app_name="xe-[redacted]-${size}"
  printf "\n\n\n" | cat
  echo "setting up ${app_name}"
  set -x
  fly apps create ${app_name} -o ${FLY_ORG}
  volume_id=$(fly volumes create whisper_zine_cache --yes -s 10 -a ${app_name} -r ${FLY_REGIONS[$size]} --vm-gpu-kind ${size} --json | jq -r .id)
  fly machines run onerahmet/openai-whisper-asr-webservice:latest-gpu --vm-gpu-kind ${size} -p 443:9000/tcp:tls:http -p 80:9000/tcp:http -r ${FLY_REGIONS[$size]} -v ${volume_id}:/root/.cache/whisper -e ASR_MODEL=large -e ASR_ENGINE=faster_whisper -a ${app_name}
  fly ips allocate-v4 --shared -a ${app_name}
  fly ips allocate-v6 -a ${app_name}
  set +x

  echo "https://${app_name}.fly.dev"
done

Yes, before you ask, that is me using hashmaps (okay, okay, array-maps/alists in emacs lisp terminology) in bash to better associate GPU kinds with Fly.io regions. The horrifying part is that it works. In the future I’ll probably use Terraform for this, but I wanted to get something out of the door, so crimes it was! I was horrified to discover that bash has hashmaps too.

Do you have any thoughts or questions?

3 Likes

Super unscientific result as I’m watching this run:

2024/01/16 12:26:24 gpu: l40s, iteration: 1
2024/01/16 12:26:24 connect: 12.279125ms
2024/01/16 12:26:24 upload: 4.565519833s
2024/01/16 12:26:24 first response byte: 2m57.263536375s
2024/01/16 12:26:24 last response byte: 81.906ms

2024/01/16 12:26:36 gpu: a100-80gb, iteration: 1
2024/01/16 12:26:36 connect: 10.534792ms
2024/01/16 12:26:36 upload: 7.975923208s
2024/01/16 12:26:36 first response byte: 3m5.791626625s
2024/01/16 12:26:36 last response byte: 16.830791ms

2024/01/16 12:27:33 gpu: a100-40gb, iteration: 1
2024/01/16 12:27:33 connect: 16.556958ms
2024/01/16 12:27:33 upload: 4.68870125s
2024/01/16 12:27:33 first response byte: 4m6.050395709s
2024/01/16 12:27:33 last response byte: 94.038875ms

I’m not sure why the A100-80gb is responding faster than the A100-40gb. On paper they should be identical save the interface between the system and the GPU. From what I understand about whisper, most of the data should already be in VRAM when it does inference. More experimentation will be required.

1 Like

Strange, I seem to have lost the silicon lottery or something because after restarting my L40s Machine the GPU seems to be anomalously slower.

2024/01/16 13:14:22 gpu: l40s, iteration: 7
2024/01/16 13:14:22 connect: 0s
2024/01/16 13:14:22 upload: 10.99971125s
2024/01/16 13:14:22 first response byte: 5m24.315047916s
2024/01/16 13:14:22 last response byte: 18.831833ms

I will investigate, I may have to migrate over to the FLAME flow sooner rather than later to remove the variable of network latency. I really hope it isn’t a silicon lottery problem.

Oh god I just got an even longer response:

2024/01/16 13:27:19 gpu: l40s, iteration: 8
2024/01/16 13:27:19 connect: 0s
2024/01/16 13:27:19 upload: 7.028979542s
2024/01/16 13:27:19 first response byte: 12m49.68917775s
2024/01/16 13:27:19 last response byte: 56.3095ms

Yeah, something is very wrong with either that card in particular or my testing methodology.

Some googling around suggests that I may be encountering either a thermal slowdown from the card itself or the vram getting too hot. I’m not entirely sure how to test for that, but I’ll try and write something to monitor the GPU temperatures, I guess. I think I may need to FLAME the binary onto the target machine so that I can pipe back temperature information as well as response timings.

Will investigate.

2 Likes

Hi, was hoping to get to things yesterday, but hashtag life happened. Digging back into this.

I vaguely recall the idea of the telefork() function that a fellow CS person wrote about a few years ago. This is giving me ideas in how I can go about doing this benchmarking in a better way than relying on the public Internet.

There’s this super old UNIX scripting tool called expect that lets you open a shell, pipe in commands, and expect responses. It lets you do super boring things like this:

spawn ssh xe@pneuma
expect "$"
send "uptime\n"
expect "$"
send "exit\n"

And then it’ll spawn an SSH session to host pneuma as user xe in a subprocess, wait for a $ (usually indicates a shell is ready), run the uptime command, wait for the shell to be ready again, then exit. You can easily imagine how you can go from there to doing more complicated sysadmin things like making users or installing packages.

In order to benchmark things better, I’m going to need to get more detailed insights into what’s going on with the GPUs at play. Thankfully the nvidia-smi command is really good at getting us detailed metrics of everything the card is doing. I’m rewriting my benchmarking code to telefork over a little metrics_snitch program that will run this abomination of a command while it’s benchmarking the GPU:

nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version,temperature.gpu,pstate,utilization.gpu,utilization.memormemory.total,memory.free,memory.used,clocks.current.memory,clocks.max.memory,temperature.memory,temperature.gpu.tlimit,clocks_event_reasons.hw_thermal_slowdown,clocks_event_reasons.hw_slowdown,clocks_event_reasons.sw_power_cap,reset_status.drain_and_reset_recommended --format=csv -lms 200

I’m going to make the program send back a zipfile with the CSV output of nvidia-smi and the timing details for the Whisper run. This will be saved and put into a Google Sheet that I’ll make public. Once I’ve verified this on one Machine, I’m gonna spawn a bunch more and do this benchmarking on all of them.

1 Like

Okay, I think I have a version of this working on one Machine. Here’s the temperature graph for a single iteration of the test on my test recording file (this was mid-series):

One problem I’m running into is that Google Sheets really really hates the CSV that nvidia-smi generates, meaning that it can’t automagically detect units when a value looks like this (GPU utilization):

89 %

This is incredibly frustrating and I will need to do some post-processing of the CSV in order to get something that Google Sheets is happy with.

Here’s one of the runs of the big 9/11 interview:

The GPU temperatures aren’t variable enough to get to thermal throttling, but I’m going to now see if the individual GPU is the slow part. Time to dynamically spin up Machines from Go!

1 Like

Okay, I’ve managed to make things properly. The overall flow looks like this:

It creates a new machine, punishes the GPU a bit with a big trout, gets some results back, then returns them as zip files. It also destroys everything it creates, so I don’t have things stick around.

Right now I’m re-running against my test MP3 and it seems to be working okay. I’m gonna go run it against the big MP3 with 30 iterations. This should punish things appropriately so I can tell if one of the L40s cards is slower than the others or not. I’ve also never run multiple things at once on the same computer box, so this should be interesting to see if the cards start thermal throttling (the clocks_event_reasons.hw_thermal_slowdown column in Google Sheets) when they are all being punished at once.

Here we go!

3 Likes

Hi, bumping the thread to avoid the auto-locking. I’ve been working on some writing and haven’t gotten back to this yet. Will continue to update as I bench things!

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

I got to this. It took a moment to get back to it because I had to figure out how to make everything resilient against random failures because that’s common in AI land. With judicious use of the backoff package, I think I have something that’s a bit more robust. The only downside is that it makes code that looks like this:

volID, err := makeVolume(name)
if err != nil {
  return fmt.Errorf("can't make volume %s: %w", name, err)
}

Become something like this:

bo := backoff.NewExponentialBackoff()
volID, err := backoff.RetryWithData[string](func() (string, error) {
  return makeVolume(name)
}, bo)
if err != nil {
  return fmt.Errorf("can't make volume %s: %w", name, err)
}

Which does have the advantage of working, but looks very terrible. It’s making me think that there should be some higher-level grammatical feature for “retriable actions” or something. Maybe Haskell has something for this lol.

In an amazing display of forward thinking, I made my test run go against the big MP3. HOWEVER, I do have more that vibes to prove that one of the GPUs is in fact slower than the others.

{"time":"2024-01-30T18:54:49.026052159Z","level":"INFO","msg":"done with whisper","run_id":"run_1706640599_0","gpu_kind":"l40s","iteration":0,"d":"4m7.709230967s"}
{"time":"2024-01-30T18:56:14.853993729Z","level":"INFO","msg":"done with whisper","run_id":"run_1706640599_1","gpu_kind":"l40s","iteration":0,"d":"5m17.591253507s"}
{"time":"2024-01-30T18:55:38.903305407Z","level":"INFO","msg":"done with whisper","run_id":"run_1706640599_2","gpu_kind":"l40s","iteration":0,"d":"4m26.109807637s"}
{"time":"2024-01-30T18:59:55.627918628Z","level":"INFO","msg":"done with whisper","run_id":"run_1706640599_3","gpu_kind":"l40s","iteration":0,"d":"8m16.322985979s"}
{"time":"2024-01-30T18:55:40.845916989Z","level":"INFO","msg":"done with whisper","run_id":"run_1706640599_4","gpu_kind":"l40s","iteration":0,"d":"3m48.186571074s"}
{"time":"2024-01-30T18:55:22.600537897Z","level":"INFO","msg":"done with whisper","run_id":"run_1706640599_5","gpu_kind":"l40s","iteration":0,"d":"3m16.306365422s"}
{"time":"2024-01-30T18:56:36.024369079Z","level":"INFO","msg":"done with whisper","run_id":"run_1706640599_6","gpu_kind":"l40s","iteration":0,"d":"4m18.252813998s"}
{"time":"2024-01-30T18:55:59.487809669Z","level":"INFO","msg":"done with whisper","run_id":"run_1706640599_7","gpu_kind":"l40s","iteration":0,"d":"3m19.929520393s"}

If you notice, the _3 GPU finishes the tests almost twice as slowly as the other GPUs do. This is…odd and unexpected. I’d personally expect the silicon lottery bonus multiplier to be within more or less 10% of the other cards. Maybe there’s also some caching thing going on or cache efficiency that gets better over time. I don’t totally understand what’s going on here.

2 Likes

About to go to bed, but I heard my laptop finish a gigacronch of 45 iterations across eight Machines. Here are the max and minimum times for each set of runs:

$ cat logs.jsonl | grep '"done with whisper"' | go run bucketize.go
run_1706654685_0: 3m33.193021501s 9m29.130762964s
run_1706654685_1: 5m3.108369331s 46m24.064212781s
run_1706654685_2: 3m28.388677465s 5m48.226092209s
run_1706654685_3: 3m54.321197451s 25m31.83161095s
run_1706654685_4: 4m47.765022434s 36m14.309523984s
run_1706654685_5: 4m48.319589059s 31m42.147417123s
run_1706654685_6: 3m38.005374119s 10m27.171573667s
run_1706654685_7: 4m25.571857105s 38m37.07276725s

Note the variance between 5m and 46m for transcribing a 30 minute audio file. Something feels like entirely wrong here. I’m going to change approaches and try another thing: text generation with Ollama. I have more control over the randomness there and can make a hopefully deterministic benchmark. There has to be something wrong with what I’m doing with whisper here, it can’t possible be right this way. The silicon lottery can only account for so much.

I really hope so at least.

1 Like

@TheXe have you checked that the volumes on the hosts aren’t dodgy? We had an issue a while back where some hosts had dodgy storage which could stall and create problems. Possibly Whisper can wait for stalls and other such issues to resolve instead of killing the process at the expense of taking much longer in some runs.

EDIT: also I believe the hypervisor used for GPU machines is different. Could it possibly be a misconfigured hypervisor? Possibly something to do with the volumes. Is there a difference between guaranteed performance and boost performance (e.g. The host isn’t busy).

1 Like

Have you checked that the volumes on the hosts aren’t dodgy?

I haven’t checked that yet! This is a fairly new server hot off the presses so I’m assuming that the server is otherwise “healthy”. Do you know of what I can use to test for I/O stalls like that?

I believe the hypervisor used for GPU machines is different. Could it possibly be a misconfigured hypervisor?

IIRC CPU instances use firecracker and GPU instances use cloud-hypervisor, I’m not sure of any configuration tweaks, but I can look into it. I’ll forward this thread to the GPU team so they can take a closer look.

Is there a difference between guaranteed performance and boost performance (e.g. The host isn’t busy).

Are you referring to the difference between shared and performance CPUs? If so then shared CPUs are the “oversubscribed” tier where other Machines on the computer box “compete” for CPU time. GPU instances use performance CPUs where the VMs are subscribed up to the maximum number of cores on the computer box.

I think the rest could be something involving PCI bandwidth or other unmeasurable things like that, but these are server chassis dedicated for GPU workloads with some pretty wide EPYC processors with a crapton of PCI lanes. I’ve been running into a PCI bandwidth limit on my personal rig (Ryzen 5950x, RTX 4080, an Elgato card, and an unfortunate amount of USB cards because nobody makes my dream motherboard with 40 USB ports on it), but I would have difficulty imagining that we’d be running into that with a server motherboard with the equivalent of 8 double-ram RTX 4090’s on it.

A good start is taking a look at the Grafana dashboard that fly provides. The disk metrics can usually provide an indication if something is wrong.

Heres a snapshot of what we were seeing with one of our CockroachDB servers. It also helped that CockroachDB would crash if the disk stalled for too long making it really obvious.

1 Like

Ah

From what I understand about how Whisper works, it loads the weights into the model once and then streams chunks of audio to get spoken word tokens back. I’ve adapted my testing to use Ollama and Llama 7B at float16 and I’m getting much more consistent results. I’m guessing that the Whisper wrapper I was using has some hidden nondeterminism that snuck in while I wasn’t looking.

I’ll get more detailed numbers soon!

1 Like

@TheXe did you ever get around to benchmarking with Ollama? Would be keen to know how it went and get a rough idea of what server size configuration to use.