Sprites exec WebSocket: U+FFFD placeholders in replay buffer corrupt CJK/wide characters

Summary

When attaching to an existing Sprites terminal session via the exec WebSocket API, CJK (Chinese/Japanese/Korean) characters in the scrollback replay data are each followed by a U+FFFD (Replacement Character), causing garbled display in xterm.js. Live (real-time) output renders correctly.

Steps to Reproduce

  1. Create a Sprites exec session with tty: true
  2. Output text containing CJK characters (e.g., run a command that prints Japanese text) until the scrollback buffer is populated
  3. Attach to the same session from a new WebSocket connection (/v1/sprites/{name}/exec/{sessionId})
  4. Scroll up to view the replayed scrollback history
  5. CJK characters each have a U+FFFD appended

Root Cause

The screen buffer appears to store a U+FFFD placeholder in the second cell of each wide (fullwidth) character. When the buffer is serialized and sent as replay data, these placeholders are included in the byte stream.

Terminal emulators like xterm.js manage wide character column widths internally, so the second-cell placeholder is unnecessary and renders as a visible replacement character.

Hex dump evidence

# Each CJK char (3-byte UTF-8) is followed by ef bf bd (U+FFFD)
e38193 efbfbd e38293 efbfbd e381ab efbfbd e381a1 efbfbd
こ      FFFD   ん      FFFD   に      FFFD   ち      FFFD

# Same pattern with ANSI color escapes:
\e[38;5;231m 適 \e[0m FFFD \e[38;5;231;48;5;237m 当 \e[0m FFFD

Observed message sequence

# Type Content
1 Text session_info (tty: true)
2 Binary (71 KB) Replay/scrollback data — contains U+FFFD after every wide char
3+ Binary Live PTY data — no U+FFFD, renders correctly

Current Workaround

We strip all U+FFFD bytes (ef bf bd) from binary frames in our WebSocket proxy before forwarding to the client:

var utf8ReplacementChar = []byte{0xef, 0xbf, 0xbd}

// in the sprites→client forwarding loop:
if msgType == websocket.BinaryMessage {
    data = bytes.ReplaceAll(data, utf8ReplacementChar, nil)
}

This is safe in practice since U+FFFD rarely appears in legitimate terminal output, and becomes a no-op once the underlying issue is fixed.

Expected Behavior

Replay data should not contain U+FFFD wide-character placeholders. The serialized screen buffer should emit only the actual character for each wide character cell, omitting the second-column placeholder.

Environment

  • Sprites exec WebSocket API (wss://api.sprites.dev/v1/sprites/{name}/exec/{sessionId})
  • Client: xterm.js v5 with Unicode11 addon
  • Locale: ja_JP.UTF-8

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.