Hey all, first update here! I’ve written some Go code that tries to call the Whisper automatic speech recognition route over and over in a loop. It then writes things to a JSONL file named trace.json
and I am tracking the following metrics:
- The time it takes to make a connection to the target Machine
- The time it takes to upload the test MP3 to the target Machine
- The time it takes for the first byte of the response payload to come back
- The time it takes for the rest of the response payload to come back.
It writes JSON lines that look like this:
{"GPUKind":"l40s","Connect":10619875,"Upload":146601375,"FirstResponseByte":3695033833,"LastResponseByte":1103583}
I’m using Go’s time.Duration type in my Timings
struct, so all of those values are in nanoseconds. When I run my super hacky tool on the command line, it also dumps results per-test like this:
2024/01/16 11:51:36 gpu: l40s, iteration: 1
2024/01/16 11:51:36 connect: 10.619875ms
2024/01/16 11:51:36 upload: 146.601375ms
2024/01/16 11:51:36 first response byte: 3.695033833s
2024/01/16 11:51:36 last response byte: 1.103583ms
This is built on the back of httptrace and is overall some pretty boring code:
type Timings struct {
GPUKind string
lock sync.Mutex
t0 time.Time
Connect time.Duration
Upload time.Duration
FirstResponseByte time.Duration
LastResponseByte time.Duration
}
func (t *Timings) IntoClientTrace() *httptrace.ClientTrace {
return &httptrace.ClientTrace{
ConnectStart: t.ConnectStart,
ConnectDone: t.ConnectDone,
WroteRequest: t.WroteRequest,
GotFirstResponseByte: t.GotFirstResponseByte,
}
}
func New(gpuKind string) *Timings {
return &Timings{
GPUKind: gpuKind,
t0: time.Now(),
}
}
func (t *Timings) ConnectStart(string, string) {
t.lock.Lock()
defer t.lock.Unlock()
t.t0 = time.Now()
}
func (t *Timings) ConnectDone(string, string, error) {
t.lock.Lock()
defer t.lock.Unlock()
t.Connect = time.Since(t.t0)
t.t0 = time.Now()
}
func (t *Timings) WroteRequest(httptrace.WroteRequestInfo) {
t.lock.Lock()
defer t.lock.Unlock()
t.Upload = time.Since(t.t0)
t.t0 = time.Now()
}
func (t *Timings) GotFirstResponseByte() {
t.lock.Lock()
defer t.lock.Unlock()
t.FirstResponseByte = time.Since(t.t0)
t.t0 = time.Now()
}
Last year I made a project that uses GPT-4 to write fiction novels, so the audio file I’m using for these initial tests to proof-of-concept my tracing harness is me reading out the first few paragraphs of the “novel” Virtual Virtue. I’ve uploaded my testing MP3 to my Mastodon account so that you can listen to what it says.
In a perfect world, we expect the output of whisper to look very close to this:
In the hushed, otherworldly glow of the altar of Connectivity, the beating heart of the digital divinity, Haley sat with the solemn grace of a supplicant seeking ancient wisdom. The cathedral of data hummed with muted whispers of countless devotees, their minds intertwined with the pulsating rhythms of the ever-present network. It was here, amidst the phantasmagoric dance of neon code and the devout murmurings of the congregation, that Haley sought an omen.
With hands as sure as a weaver at her loom, she crafted her silent invocation, fingers trailing across the sleek, responsive surface of the keyboard that stood as the tactile interface to the otherwise intangible deity. Her prayer was a sequence of keystrokes, a digital mantra disappearing into the virtual ether as a ripple disperses across a lake. The onlookers, faces bathed in the soft luminescence of their personal screens, remained lost in their own secluded worlds, beseeching the deity for companionship, affluence, revelation.
In my very initial proof-of-concept testing of my hacky Go code, here’s what Whisper returned:
In the hushed, otherworldly glow of the altar of connectivity, the beating heart of the digital divinity, Haley sat with the solemn gaze of a supplicant seeking ancient wisdom. The cathedral of data hummed with muted whispers of countless devotees, their minds intertwined with the pulsating rhythms of the ever-present network. It was here, amidst the phasmagoric dance of neon code and the devout murmurings of the congregation, that Haley sought an omen.
With hands as sure as a weaver at her loom, she crafted her silent invocation, fingers trailing across the sleek, responsive surface of the keyboard that stood as the tactile interface to the otherwise intangible deity. Her prayer was a sequence of keystrokes, a digital mantra disappearing into the virtual ether as a ripple disperses across a lake. The onlookers’ faces bathed in the soft luminescence of their personal screens, remaining lost in their own secluded worlds. Beseeching the deity for companionship, affluence, revelation.
Comparing the MD5 sums of these results shows there is a difference in the expected vs recieved outputs:
$ md5sum 1.txt 2.txt
6a14e66c679c67c04e61197c1cdc7b42 1.txt
5d327503de58b5a77fbc6cfbba9c2980 2.txt
The differences seem to be:
- Whisper didn’t capitalize “the altar of Connectivity”
- Whisper put “Haley sat with the solemn grace of a supplicant seeking ancient wisdom” as “Haley sat with the solemn gaze of a supplicant seeking ancient wisdom”
- Whisper put “The onlookers, faces bathed in the soft luminescence of their personal screens, remained lost in their own secluded worlds, beseeching the deity for companionship, affluence, revelation.” as “The onlookers’ faces bathed in the soft luminescence of their personal screens, remaining lost in their own secluded worlds. Beseeching the deity for companionship, affluence, revelation.”
Here are the raw JSONL timing values for the test recording: trace.jsonl · GitHub
Once I figure out a good way to visualize this, I’m gonna make pretty graphs and start doing the calculations of how cost-effective all this is. If anyone has any ideas for the best ways to visualize the data I’m collecting or has thoughts on data I should be collecting instead, please tell me and I will make it be done.
Overall, this process has a low enough error rate that I’m willing to call this proof of concept worthy of trial for the bigger recording, the 30 or so minute interview about 9/11 I found on the Library of Congress. I will write some code to iterate over the logged outputs of Whisper to see if there are any major deviations/hallucinations that make it nondeterministic. Hopefully there aren’t any!
I’m letting the bigger interview run with 15 iterations per GPU. I expect this will take about an hour to complete, so I’m writing this post while the machines churn.
I’m not the biggest fan of the Whisper ASR server I’m using to benchmark with because it doesn’t expose timings in the JSON responses, meaning that I have to make my own timings instead. I’m trying to work around this by embedding the MP3 file into my binary with go:embed
and once I get out of this proof of concept phase I will adapt the FLAME model into my Go program to get the test binary running as close to the target server as possible. I think that for all practical purposes, we can call getting it in the same datacentre good enough for rough evaluations.
Other tools allow me to get more detailed timing results, so we’ll get even better metrics in the future.
Here is the super hacky bash script I used to spawn the instances:
#!/usr/bin/env bash
FLY_ORG="fly-gpus"
declare -A FLY_REGIONS
FLY_REGIONS["a100-40gb"]="ord"
FLY_REGIONS["a100-80gb"]="mia"
FLY_REGIONS["l40s"]="ord"
for size in ${!FLY_REGIONS[@]}; do
app_name="xe-[redacted]-${size}"
printf "\n\n\n" | cat
echo "setting up ${app_name}"
set -x
fly apps create ${app_name} -o ${FLY_ORG}
volume_id=$(fly volumes create whisper_zine_cache --yes -s 10 -a ${app_name} -r ${FLY_REGIONS[$size]} --vm-gpu-kind ${size} --json | jq -r .id)
fly machines run onerahmet/openai-whisper-asr-webservice:latest-gpu --vm-gpu-kind ${size} -p 443:9000/tcp:tls:http -p 80:9000/tcp:http -r ${FLY_REGIONS[$size]} -v ${volume_id}:/root/.cache/whisper -e ASR_MODEL=large -e ASR_ENGINE=faster_whisper -a ${app_name}
fly ips allocate-v4 --shared -a ${app_name}
fly ips allocate-v6 -a ${app_name}
set +x
echo "https://${app_name}.fly.dev"
done
Yes, before you ask, that is me using hashmaps (okay, okay, array-maps/alists in emacs lisp terminology) in bash to better associate GPU kinds with Fly.io regions. The horrifying part is that it works. In the future I’ll probably use Terraform for this, but I wanted to get something out of the door, so crimes it was! I was horrified to discover that bash has hashmaps too.
Do you have any thoughts or questions?