.NET Blazor Server websocket connection errors when load balancer switches machines

I have a .NET Blazor Server app that runs over 2 machines. When the concurrency limit gets hit, it appears to activate the second machine and immediately after I get this error when opening the app in a new tab:

Error: Failed to start the transport ‘WebSockets’: Error: WebSocket failed to connect. The connection could not be found on the server, either the endpoint may not be a SignalR endpoint, the connection ID is not present on the server, or there is a proxy blocking WebSockets. If you have multiple servers check that sticky sessions are enabled.

It appears the machine switching is not compatible with Blazor. Has anybody had experience with this or have any tips to work around this. It doesn’t appear to be an issue when I remove a machine and have it all running on 1 machine, but I want to be able to scale with multiple machines.

I don’t know anything about Blazor (but am willing to learn!) but typically when you have websockets across multiple servers you need something like redis in place to enable the servers to share session data.

The error message you are seeing mentions SignalR - perhaps this will help: Redis backplane for ASP.NET Core SignalR scale-out ?

Information on setting up redis on fly.io can be found here: Upstash for Redis®* · Fly Docs

It seems to be more an issue of sticky sessions as to my knowledge Blazor needs session affinity. I found a doc about sticky sessions: Session Affinity (a.k.a. Sticky Sessions) · Fly Docs which I am attempting to follow the Fly-Replay option and it seems to be working somewhat. Here is the code (chatgpt helped with haha) and let me explain what’s happening:

public class StickySessionMiddleware
{
private readonly RequestDelegate _next;

public StickySessionMiddleware(RequestDelegate next)
{
    _next = next;
}

public async Task InvokeAsync(HttpContext context)
{
    // Get the current machine ID from the environment
    var currentMachineId = Environment.GetEnvironmentVariable("FLY_MACHINE_ID");

    if (string.IsNullOrEmpty(currentMachineId))
    {
        // If no machine ID is available, proceed without sticky session logic
        await _next(context);
        return;
    }

    // Check for the "fly-machine-id" cookie
    var cookieMachineId = context.Request.Cookies["fly-machine-id"];

    if (string.IsNullOrEmpty(cookieMachineId))
    {
        // If no cookie, set it with the current machine ID
        context.Response.Cookies.Append("fly-machine-id", currentMachineId, new CookieOptions
        {
            MaxAge = TimeSpan.FromDays(6), // Six days expiration
            HttpOnly = true,
            Secure = true,
            SameSite = SameSiteMode.Lax
        });

        await _next(context);
    }
    else if (cookieMachineId != currentMachineId)
    {
        // If the cookie doesn't match, set the Fly-Replay header and return 307
        context.Response.Headers["Fly-Replay"] = $"instance={cookieMachineId}";
        context.Response.StatusCode = StatusCodes.Status307TemporaryRedirect;
    }
    else
    {
        // If everything matches, proceed with the request
        await _next(context);
    }
}

}

I have set a concurrency soft limit at 5 and a hard limit at 10 for testing purposes. Before implementing the code above, I would open 5 tabs and get the connection error. Now I can open all 10 and then still get the connection error. The message in logs looks like this:

[PR04] could not find a good candidate within 21 attempts at load balancing
Instance 185e717a475218 reached hard limit of 10 concurrent requests. This usually indicates your app is not responding fast enough for the traffic levels it is handling. Scaling resources, number of instances or increasing your hard limit might help.

After it hits the hard limit, shouldn’t it be able to move to the second instance I have? And when I monitor the machines in the dashboard it says they are both running, but all traffic is only getting routed to the one instance. I am a novice with this load balancing/multi instance stuff. Am I confused on what it should be doing? I mostly just want to be able to run this app across multiple machines and have sticky sessions. Currently the app breaks for all tabs when it hits the limit.

All tabs within the same browser will use the same cookie, so will all “stick” to the same server.

While sticky sessions will get you further, Redis + SignalR would provide a better solution.

That makes sense about the multiple tabs sticking to the same server, but even when I open the app in multiple browsers, when it hits the limit on one machine it just throws that error instead of going to the other machine that has not hit its concurrency limit. I tried the redis caching and signalR options but I still get the same symptoms.