I have a .NET Blazor Server app that runs over 2 machines. When the concurrency limit gets hit, it appears to activate the second machine and immediately after I get this error when opening the app in a new tab:
Error: Failed to start the transport ‘WebSockets’: Error: WebSocket failed to connect. The connection could not be found on the server, either the endpoint may not be a SignalR endpoint, the connection ID is not present on the server, or there is a proxy blocking WebSockets. If you have multiple servers check that sticky sessions are enabled.
It appears the machine switching is not compatible with Blazor. Has anybody had experience with this or have any tips to work around this. It doesn’t appear to be an issue when I remove a machine and have it all running on 1 machine, but I want to be able to scale with multiple machines.
I don’t know anything about Blazor (but am willing to learn!) but typically when you have websockets across multiple servers you need something like redis in place to enable the servers to share session data.
It seems to be more an issue of sticky sessions as to my knowledge Blazor needs session affinity. I found a doc about sticky sessions: Session Affinity (a.k.a. Sticky Sessions) · Fly Docs which I am attempting to follow the Fly-Replay option and it seems to be working somewhat. Here is the code (chatgpt helped with haha) and let me explain what’s happening:
public class StickySessionMiddleware
{
private readonly RequestDelegate _next;
public StickySessionMiddleware(RequestDelegate next)
{
_next = next;
}
public async Task InvokeAsync(HttpContext context)
{
// Get the current machine ID from the environment
var currentMachineId = Environment.GetEnvironmentVariable("FLY_MACHINE_ID");
if (string.IsNullOrEmpty(currentMachineId))
{
// If no machine ID is available, proceed without sticky session logic
await _next(context);
return;
}
// Check for the "fly-machine-id" cookie
var cookieMachineId = context.Request.Cookies["fly-machine-id"];
if (string.IsNullOrEmpty(cookieMachineId))
{
// If no cookie, set it with the current machine ID
context.Response.Cookies.Append("fly-machine-id", currentMachineId, new CookieOptions
{
MaxAge = TimeSpan.FromDays(6), // Six days expiration
HttpOnly = true,
Secure = true,
SameSite = SameSiteMode.Lax
});
await _next(context);
}
else if (cookieMachineId != currentMachineId)
{
// If the cookie doesn't match, set the Fly-Replay header and return 307
context.Response.Headers["Fly-Replay"] = $"instance={cookieMachineId}";
context.Response.StatusCode = StatusCodes.Status307TemporaryRedirect;
}
else
{
// If everything matches, proceed with the request
await _next(context);
}
}
}
I have set a concurrency soft limit at 5 and a hard limit at 10 for testing purposes. Before implementing the code above, I would open 5 tabs and get the connection error. Now I can open all 10 and then still get the connection error. The message in logs looks like this:
[PR04] could not find a good candidate within 21 attempts at load balancing
Instance 185e717a475218 reached hard limit of 10 concurrent requests. This usually indicates your app is not responding fast enough for the traffic levels it is handling. Scaling resources, number of instances or increasing your hard limit might help.
After it hits the hard limit, shouldn’t it be able to move to the second instance I have? And when I monitor the machines in the dashboard it says they are both running, but all traffic is only getting routed to the one instance. I am a novice with this load balancing/multi instance stuff. Am I confused on what it should be doing? I mostly just want to be able to run this app across multiple machines and have sticky sessions. Currently the app breaks for all tabs when it hits the limit.
That makes sense about the multiple tabs sticking to the same server, but even when I open the app in multiple browsers, when it hits the limit on one machine it just throws that error instead of going to the other machine that has not hit its concurrency limit. I tried the redis caching and signalR options but I still get the same symptoms.
This is for anyone else that tries to run Blazor across multiple machines:
-It seems that implementing sticky sessions, similar to above is very necessary.
-Upgrade to .NET 9 as it handles circuit disconnects/reconnects much more gracefully and without errors.
-Since Blazor works over TCP with websockets, when I set my concurrency settings, having the type be “connections” seems more effective at showing who’s using the app and the load balancing seems to work better (just from viewing the metrics and seeing how it responds to the soft and hard limits when theres multiple concurrent users).
SignalR + Redis backplaning may be something to look into, but I know that the app should be able to handle thousands of connections and when I add backplaning to my app, I didn’t really see anything happen.
But with the first 3 things, my app seems to be working very well with no errors and lots of concurrent users.