Challenge 3b: Inconsistency with Jepsen

avi · February 25, 2023, 3:58pm

For the same code, I had following:

Validation passed
Validation failed
It broke Jepsen (?)

The validation failed first time and when I ran again, it passed.

spoilers for 3b: gist

(ignore the mutex part, I am trying out some stuff)

Following is the buggy code:

for _, id := range state.nodeIds {
if id == state.nodeId {
continue
}
// Bug here: this is supposed to ignore the Src, not Dest
if id == msg.Dest {
continue
}
if err := n.Send(id, msg.Body); err != nil {
panic(err)
}
}

I also thought the buggy line might not be relevant. So I removed it, but I did not get the error at all. I ran it like 10 times and it passed all the time. But if I introduce if condition back, I get into issues like 1 in three runs. In failure case, I saw it was trying to send a message to client (c1), but not sure how that happened:

2023-02-25 21:15:31,337{GMT}	WARN	[n1 stdout] maelstrom.process: Error!
java.lang.AssertionError: Assert failed: Invalid dest for message #maelstrom.net.message.Message{:id 1248942, :src "n1", :dest "c10", :body {:in_reply_to 4, :type "broadcast_ok"}}

In one time it broke Jepsen and it said sorry:

ERROR [2023-02-25 21:16:32,103] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
clojure.lang.ExceptionInfo: Client read timeout
	at slingshot.support$stack_trace.invoke(support.clj:201)
	at maelstrom.client$recv_BANG_.invokeStatic(client.clj:98)
	at maelstrom.client$recv_BANG_.invoke(client.clj:81)
	at maelstrom.client$send_PLUS_recv_BANG_.invokeStatic(client.clj:123)

Here are jepsen analysis files: GossipGlommer:3b - Google Drive

I am happy to provide any help. Should I open this as a bug in the maelstorm or is it some issue in my code?

sdab · February 25, 2023, 7:41pm

Its also not clear to me when your code would send to a client c0. Perhaps add some logging around what is in state.nodeids before sending?

The jepsen failure is more clear, its a client timeout. The issue is that you are nesting broadcasts in your broadcast handler. You generally want to avoid long running things in RPC handlers so that it can respond quickly. Imagine c1 broadcasts to n1 which then broadcasts to n2 and so on before n1 can respond to c1. Depending on topology, you could even have n1 broadcast to n2 which broadcasts to n1 and you get into a loop.

Another issue you have is the panic on send failure. If n1 panics, it never responds to c1 (or whoever is the sender) and also causes a client timeout. You should return error rather than panic there.

Edit: I think I know whats going on with the invalid dest error. So that is a reply from your node to the client that sent the broadcast RPC (we can see the broadcast_ok). The issue is that the client has a timeout. So it sent an RPC, your handler is taking a long time (probably due to the nested RPCS), the client times out and dies, your handler responds, jepsen sees no client so throws that error.

avi · February 26, 2023, 4:10pm

Thank you for taking your time and checking out.

Its also not clear to me when your code would send to a client c0. Perhaps add some logging around what is in state.nodeids before sending?

I did, but unable to reproduce the issue now.

The jepsen failure is more clear, its a client timeout. The issue is that you are nesting broadcasts in your broadcast handler. You generally want to avoid long running things in RPC handlers so that it can respond quickly. Imagine c1 broadcasts to n1 which then broadcasts to n2 and so on before n1 can respond to c1. Depending on topology, you could even have n1 broadcast to n2 which broadcasts to n1 and you get into a loop.

This makes sense. I fixed this part of the code. Any idea how much is the timeout set at Jepsen

Another issue you have is the panic on send failure. If n1 panics, it never responds to c1 (or whoever is the sender) and also causes a client timeout. You should return error rather than panic there.

I added a print statement and this panic never happens. I think it is just writing to stdout, so it doesn’t panic. I can’t think of reasons where it would panic.

However, I do agree that panic in the code is not good anyways.

avi · February 26, 2023, 5:09pm

I run into another issue. My code passed the 3c challenge with the workload specified on the page:

 --node-count 5 --time-limit 20 --rate 10 --nemesis partition

So I moved to the next challenge where we crank up the numbers, but I saw the following error:

java.lang.AssertionError: Assert failed: Invalid dest for message #maelstrom.net.message.Message{:id 71, :src "n0", :dest "c19", :body {:in_reply_to 1, :type "init_ok"}}

I went through the go source code to understand why this happens, basically I was sending reply messages on init. I fixed the issue by responding anything.

But I am wondering why this error happened only when nodes were 25 but not 5.

Edit: I think I know whats going on with the invalid dest error. So that is a reply from your node to the client that sent the broadcast RPC (we can see the broadcast_ok). The issue is that the client has a timeout. So it sent an RPC, your handler is taking a long time (probably due to the nested RPCS), the client times out and dies, your handler responds, jepsen sees no client so throws that error.

This makes sense, but I thought clients are also remain same. Cos I do see messages from that client and it being active.

I think this is also similar… I keep seeing following errors in the 3d problem for the same code which passed 3c:

WARN [2023-02-26 22:42:58,044] n14 stdout - maelstrom.process Error!
java.lang.AssertionError: Assert failed: Invalid dest for message #maelstrom.net.message.Message{:id 2050397, :src "n14", :dest "c64", :body {:in_reply_to 5, :messages [0 3 2 4 1 7 5 6 8 10 9 11 12 13 14 19 16 17 15 18 21 24 22 20 23 26 30 28 25 32 29 27 31 33 34 35 36 37 41], :msg_id 5, :type "read_ok"}}

so… by the time I respond, c64 is died? Here is my read code, there are no locks etc, so not sure why would it cause timeout… One reason could be my entire process has become slow due to high CPU.

Summary

n.Handle("read", func(msg maelstrom.Message) error {
		var body map[string]any
		if err := json.Unmarshal(msg.Body, &body); err != nil {
			return err
		}
		body["type"] = "read_ok"
		body["messages"] = state.messages
		return n.Reply(msg, body)
	})

edit: yeah, my CPU usage is quite high. so I guess I am failing to do things fast and there is too much message being passed

Topic		Replies	Views
[SOLVED] help debug error output from maelstrom unique-id Questions / Help	3	380	March 21, 2023
Challenge 6 question dist-sys-challenge	1	518	February 27, 2023
Unexpected Maelstrome error for maelstrom-unique-ids challenge Questions / Help dist-sys-challenge	2	976	February 23, 2023
Challenge #1: failure running "maelstrom test" dist-sys-challenge	7	786	March 30, 2023
Challenge #5a: Single-Node Kafka-Style Log: what is stats.by_f.crash? dist-sys-challenge	3	583	March 28, 2023

Challenge 3b: Inconsistency with Jepsen

Related topics