Challenge 3b: Inconsistency with Jepsen

For the same code, I had following:

  1. Validation passed
  2. Validation failed
  3. It broke Jepsen (?)

The validation failed first time and when I ran again, it passed.

spoilers for 3b: gist

(ignore the mutex part, I am trying out some stuff)

Following is the buggy code:

for _, id := range state.nodeIds {
if id == state.nodeId {
continue
}
// Bug here: this is supposed to ignore the Src, not Dest
if id == msg.Dest {
continue
}
if err := n.Send(id, msg.Body); err != nil {
panic(err)
}
}

I also thought the buggy line might not be relevant. So I removed it, but I did not get the error at all. I ran it like 10 times and it passed all the time. But if I introduce if condition back, I get into issues like 1 in three runs. In failure case, I saw it was trying to send a message to client (c1), but not sure how that happened:

2023-02-25 21:15:31,337{GMT}	WARN	[n1 stdout] maelstrom.process: Error!
java.lang.AssertionError: Assert failed: Invalid dest for message #maelstrom.net.message.Message{:id 1248942, :src "n1", :dest "c10", :body {:in_reply_to 4, :type "broadcast_ok"}}

In one time it broke Jepsen and it said sorry:

ERROR [2023-02-25 21:16:32,103] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
clojure.lang.ExceptionInfo: Client read timeout
	at slingshot.support$stack_trace.invoke(support.clj:201)
	at maelstrom.client$recv_BANG_.invokeStatic(client.clj:98)
	at maelstrom.client$recv_BANG_.invoke(client.clj:81)
	at maelstrom.client$send_PLUS_recv_BANG_.invokeStatic(client.clj:123)

Here are jepsen analysis files: GossipGlommer:3b - Google Drive

I am happy to provide any help. Should I open this as a bug in the maelstorm or is it some issue in my code?

Its also not clear to me when your code would send to a client c0. Perhaps add some logging around what is in state.nodeids before sending?

The jepsen failure is more clear, its a client timeout. The issue is that you are nesting broadcasts in your broadcast handler. You generally want to avoid long running things in RPC handlers so that it can respond quickly. Imagine c1 broadcasts to n1 which then broadcasts to n2 and so on before n1 can respond to c1. Depending on topology, you could even have n1 broadcast to n2 which broadcasts to n1 and you get into a loop.

Another issue you have is the panic on send failure. If n1 panics, it never responds to c1 (or whoever is the sender) and also causes a client timeout. You should return error rather than panic there.

Edit: I think I know whats going on with the invalid dest error. So that is a reply from your node to the client that sent the broadcast RPC (we can see the broadcast_ok). The issue is that the client has a timeout. So it sent an RPC, your handler is taking a long time (probably due to the nested RPCS), the client times out and dies, your handler responds, jepsen sees no client so throws that error.

Thank you for taking your time and checking out.

Its also not clear to me when your code would send to a client c0. Perhaps add some logging around what is in state.nodeids before sending?

I did, but unable to reproduce the issue now.

The jepsen failure is more clear, its a client timeout. The issue is that you are nesting broadcasts in your broadcast handler. You generally want to avoid long running things in RPC handlers so that it can respond quickly. Imagine c1 broadcasts to n1 which then broadcasts to n2 and so on before n1 can respond to c1. Depending on topology, you could even have n1 broadcast to n2 which broadcasts to n1 and you get into a loop.

This makes sense. I fixed this part of the code. Any idea how much is the timeout set at Jepsen

Another issue you have is the panic on send failure. If n1 panics, it never responds to c1 (or whoever is the sender) and also causes a client timeout. You should return error rather than panic there.

I added a print statement and this panic never happens. I think it is just writing to stdout, so it doesn’t panic. I can’t think of reasons where it would panic.

However, I do agree that panic in the code is not good anyways.

I run into another issue. My code passed the 3c challenge with the workload specified on the page:

 --node-count 5 --time-limit 20 --rate 10 --nemesis partition

So I moved to the next challenge where we crank up the numbers, but I saw the following error:

java.lang.AssertionError: Assert failed: Invalid dest for message #maelstrom.net.message.Message{:id 71, :src "n0", :dest "c19", :body {:in_reply_to 1, :type "init_ok"}}

I went through the go source code to understand why this happens, basically I was sending reply messages on init. I fixed the issue by responding anything.

But I am wondering why this error happened only when nodes were 25 but not 5.


Edit: I think I know whats going on with the invalid dest error. So that is a reply from your node to the client that sent the broadcast RPC (we can see the broadcast_ok). The issue is that the client has a timeout. So it sent an RPC, your handler is taking a long time (probably due to the nested RPCS), the client times out and dies, your handler responds, jepsen sees no client so throws that error.

This makes sense, but I thought clients are also remain same. Cos I do see messages from that client and it being active.

I think this is also similar… I keep seeing following errors in the 3d problem for the same code which passed 3c:

WARN [2023-02-26 22:42:58,044] n14 stdout - maelstrom.process Error!
java.lang.AssertionError: Assert failed: Invalid dest for message #maelstrom.net.message.Message{:id 2050397, :src "n14", :dest "c64", :body {:in_reply_to 5, :messages [0 3 2 4 1 7 5 6 8 10 9 11 12 13 14 19 16 17 15 18 21 24 22 20 23 26 30 28 25 32 29 27 31 33 34 35 36 37 41], :msg_id 5, :type "read_ok"}}

so… by the time I respond, c64 is died? Here is my read code, there are no locks etc, so not sure why would it cause timeout… One reason could be my entire process has become slow due to high CPU.

Summary
n.Handle("read", func(msg maelstrom.Message) error {
		var body map[string]any
		if err := json.Unmarshal(msg.Body, &body); err != nil {
			return err
		}
		body["type"] = "read_ok"
		body["messages"] = state.messages
		return n.Reply(msg, body)
	})

edit: yeah, my CPU usage is quite high. so I guess I am failing to do things fast and there is too much message being passed