Help with Gossip Glomers: Fly.io Distributed Systems Challenges

I made it to 3b before heading over here! :slight_smile:

In my handler for broadcast I added:

    for _, node := range n.NodeIDs() {
      if node == n.ID() {
        continue
      }
      n.Send(node, body)
    }

Thinking this would send the message to all the other nodes but meā€¦ But I get:

Assert failed: Invalid dest for message
 #maelstrom.net.message.Message{:id 3755969, 
:src "n1", :dest "c11", 
:body {:in_reply_to 5, :type "broadcast_ok"}}
1 Like

Congrats! :slight_smile:

Looks like that is sending back to the client (c11) instead of another node (which would start with n). And itā€™s also a "broadcast_ok" message rather than the original "broadcast" type. Have you tried printing out whatā€™s in n.NodeIDs() and seeing what it contains?

Wow, hard to see any output from fmt.Println so I wrote it out to a file:

ā€œn0,n1,n2,n3,n4ā€

I must be missing something because Iā€™m only sending the messages I get in the handler for ā€œbroadcastā€. And then after that I reply with ā€œbroadcast_okā€ like I did in 3a.

If you can post the code to gist then I can see if thereā€™s anything that jumps out.

I wasnā€™t able to reproduce the "c11" destination issue although I noticed the list variable needs a mutex since handlers can be called in concurrent goroutines. That could be causing a race condition when printing it to STDOUT maybe?

Debugging with a log.Printf() will be much easier too if you fix two issues. First, you have a lot of message amplification for each message received from a client. And second, your message list doesnā€™t deduplicate messages so it gets quite long! :slight_smile:

Iā€™m also having some trouble with this one. No issues up until now.

With --node-count=5, it usually starts failing straight away with :net-timeout after an initial :ok :broadcast.

jepsen worker 0 - jepsen.util 0	:invoke	:broadcast	0
jepsen worker 0 - jepsen.util 0	:ok   	:broadcast	0
jepsen worker 1 - jepsen.util 1	:invoke	:broadcast	1
jepsen worker 2 - jepsen.util 2	:invoke	:read   	nil
...
jepsen worker 1 - jepsen.util 1	:info	:broadcast	1	:net-timeout
...
jepsen worker 4 - jepsen.util 4	:fail	:read   	nil	:net-timeout

Although while writing the above, a 5 node run made a lot more progress, failing only intermittently with :net-timeout for :read and :broadcast messages.

I would assume it was deadlocking somewhere, but I canā€™t see where any deadlocks could happen, and I have also had one instance of a run with --node-count=1 which has failed with :net-timeouts towards the endā€¦ The timeouts also occur with no mutexes.

The other thing Iā€™ve noticed is that in the stderr output, the only message which is ever listed as ā€˜Receivedā€™, is {"type":"broadcast","message":0}, where the message value is always 0, never any other values. This is the case even for runs with successful responses to lots of broadcast and read messages.

Low-key hoping that the test command is broken on M1 Macs or something and that Iā€™m not crazy.

Code below. Any ideas?

I am also having the same issue with 3b (3a was ok) where it seems like the messages are getting mixed up in transit. Seeing almost everything @james1 mentioned as well.

Iā€™ve created my own message type just so that I donā€™t mix it with the incoming message accidentally:

type broadcastMsg struct {
	Message int    `json:"message,omitempty"`
	Type    string `json:"type"`
}

and I explicitly set msg.type="broadcast", yet somehow it arrives to other node as broadcast_ok:

for _, neighbor := range neighbors {
	 err := n.Send(toNode, replicationMsg{Type:"broadcast", Message: value})
         ...
}

at the end the log sequence ends like:

STDERR:
And to STDERR:

2023/02/23 19:05:31 Received {c2 n3 {"type":"init","node_id":"n3","node_ids":["n0","n1","n2","n3","n4"],"msg_id":1}}
2023/02/23 19:05:31 Node n3 initialized
2023/02/23 19:05:31 Sent {"src":"n3","dest":"c2","body":{"in_reply_to":1,"type":"init_ok"}}
2023/02/23 19:05:31 Received {c8 n3 {"type":"topology","topology":{"n0":["n3","n1"],"n1":["n4","n2","n0"],"n2":["n1"],"n3":["n0","n4"],"n4":["n1","n3"]},"msg_id":1}}
2023/02/23 19:05:31 Sent {"src":"n3","dest":"c8","body":{"in_reply_to":1,"type":"topology_ok"}}
2023/02/23 19:05:31 Received {n0 n3 {"type":"broadcast"}}
2023/02/23 19:05:31 Sent {"src":"n3","dest":"n0","body":{"in_reply_to":0,"type":"broadcast_ok"}}
2023/02/23 19:05:31 Sent {"src":"n3","dest":"n4","body":{"type":"broadcast"}}
2023/02/23 19:05:31 Sent {"src":"n3","dest":"n0","body":{"type":"broadcast"}}
2023/02/23 19:05:31 Received {n4 n3 {"in_reply_to":0,"type":"broadcast_ok"}}
2023/02/23 19:05:31 No handler for {"id":31,"src":"n4","dest":"n3","body":{"in_reply_to":0,"type":"broadcast_ok"}}

note that it ends with not finding a handler for "body":{"in_reply_to":0,"type":"broadcast_ok"}.

Iā€™m inclined to think somethingā€™s wrong with the underlying maelstrom wrapper library or the round trip logic.

I managed to find a way around 3b. I wasnā€™t checking for already broadcasted messages initially, and it seemed to have created an infinite loop.

The broadcast doc was a big help

Code below

@ahmet @james1 Thereā€™s two methods in the Go library for sending messages. One is Send() which just sends the message body and doesnā€™t expect a response. The other is RPC() which adds a message ID to the outgoing message and you can add a handler for the response.

Thereā€™s two ways of getting around this issue:

  1. Use Send() and register a handler on the node for "broadcast_ok".
  2. Use RPC() and pass the response handler for that specific call.

@james1 I switched the Send() call to RPC() and it works:

n.RPC(node, msg_body, func(_ maelstrom.Message) error { return nil })

Iā€™ll clarify those two methods on the site. Thanks for the feedback!

1 Like

oh, but the wording on the challenge page says:

The value is always an integer and it is unique for each message from Maelstrom.

So are they not unique? I am not understanding the above sentence then.

Edit: I think I understand the duplication issue. But now I am wondering why I didnā€™t run into infinite loop problem where I keep passing the same message.

hey, I got the 3b working. So I can confirm that I did not run into any issues

whoaā€¦ could you elaborate on how did you run into infinite loop? I am not checking for the broadcasted messages either, but I did not run into such issue

edit: I now realised how does one run into this issue. My code is vulnerable to this, but I am not running into this issue at all.

While the solution works and just replacing my Send with RPC worked, I still donā€™t understand why I had to do that.

This was my code with Send:


if newAdded := messages.Add(body.Message); newAdded {
    for _, id := range neighbours {
        // I don't understand why n.Send() doesn't work.
        n.Send(id, map[string]any{
            "type":    "broadcast",
            "message": body.Message,
        })
    }
}

if body.MsgID == nil {
    return nil
}

return n.Reply(msg, map[string]any{"type": "broadcast_ok"})

Replacing the Send with the following fixes the issue:

n.RPC(id, map[string]any{
    "type":    "broadcast",
    "message": body.Message,
}, func(msg maelstrom.Message) error {
    return nil
})

I donā€™t understand the difference in the behaviour. Morever, how is this following case possible when a node sends another node a broadcast_ok when Iā€™ve explicitly set a return statement before the n.Reply()?

{n4 n3 {"in_reply_to":0,"type":"broadcast_ok"}}

Appreciate help and pointers. Thanks.

The issue with using Send() is that the node that youā€™re sending to returns a "broadcast_ok" message and the sending node has no handler for that message type. Maelstrom really just deals with individual messages and the ā€œRPCā€ is just a way from the client code to easily associate a request & response message together.

It looks like your example message isnā€™t grabbing the message ID from msg when you call Reply(). Maybe try logging the msg value (e.g. log.Printf("%#v", msg)) to see what it contains.

This clears things up, thanks!