Help with Gossip Glomers: Fly.io Distributed Systems Challenges

I made it to 3b before heading over here! :slight_smile:

In my handler for broadcast I added:

    for _, node := range n.NodeIDs() {
      if node == n.ID() {
        continue
      }
      n.Send(node, body)
    }

Thinking this would send the message to all the other nodes but me… But I get:

Assert failed: Invalid dest for message
 #maelstrom.net.message.Message{:id 3755969, 
:src "n1", :dest "c11", 
:body {:in_reply_to 5, :type "broadcast_ok"}}
1 Like

Congrats! :slight_smile:

Looks like that is sending back to the client (c11) instead of another node (which would start with n). And it’s also a "broadcast_ok" message rather than the original "broadcast" type. Have you tried printing out what’s in n.NodeIDs() and seeing what it contains?

Wow, hard to see any output from fmt.Println so I wrote it out to a file:

“n0,n1,n2,n3,n4”

I must be missing something because I’m only sending the messages I get in the handler for “broadcast”. And then after that I reply with “broadcast_ok” like I did in 3a.

If you can post the code to gist then I can see if there’s anything that jumps out.

I wasn’t able to reproduce the "c11" destination issue although I noticed the list variable needs a mutex since handlers can be called in concurrent goroutines. That could be causing a race condition when printing it to STDOUT maybe?

Debugging with a log.Printf() will be much easier too if you fix two issues. First, you have a lot of message amplification for each message received from a client. And second, your message list doesn’t deduplicate messages so it gets quite long! :slight_smile:

I’m also having some trouble with this one. No issues up until now.

With --node-count=5, it usually starts failing straight away with :net-timeout after an initial :ok :broadcast.

jepsen worker 0 - jepsen.util 0	:invoke	:broadcast	0
jepsen worker 0 - jepsen.util 0	:ok   	:broadcast	0
jepsen worker 1 - jepsen.util 1	:invoke	:broadcast	1
jepsen worker 2 - jepsen.util 2	:invoke	:read   	nil
...
jepsen worker 1 - jepsen.util 1	:info	:broadcast	1	:net-timeout
...
jepsen worker 4 - jepsen.util 4	:fail	:read   	nil	:net-timeout

Although while writing the above, a 5 node run made a lot more progress, failing only intermittently with :net-timeout for :read and :broadcast messages.

I would assume it was deadlocking somewhere, but I can’t see where any deadlocks could happen, and I have also had one instance of a run with --node-count=1 which has failed with :net-timeouts towards the end… The timeouts also occur with no mutexes.

The other thing I’ve noticed is that in the stderr output, the only message which is ever listed as ‘Received’, is {"type":"broadcast","message":0}, where the message value is always 0, never any other values. This is the case even for runs with successful responses to lots of broadcast and read messages.

Low-key hoping that the test command is broken on M1 Macs or something and that I’m not crazy.

Code below. Any ideas?

I am also having the same issue with 3b (3a was ok) where it seems like the messages are getting mixed up in transit. Seeing almost everything @james1 mentioned as well.

I’ve created my own message type just so that I don’t mix it with the incoming message accidentally:

type broadcastMsg struct {
	Message int    `json:"message,omitempty"`
	Type    string `json:"type"`
}

and I explicitly set msg.type="broadcast", yet somehow it arrives to other node as broadcast_ok:

for _, neighbor := range neighbors {
	 err := n.Send(toNode, replicationMsg{Type:"broadcast", Message: value})
         ...
}

at the end the log sequence ends like:

STDERR:
And to STDERR:

2023/02/23 19:05:31 Received {c2 n3 {"type":"init","node_id":"n3","node_ids":["n0","n1","n2","n3","n4"],"msg_id":1}}
2023/02/23 19:05:31 Node n3 initialized
2023/02/23 19:05:31 Sent {"src":"n3","dest":"c2","body":{"in_reply_to":1,"type":"init_ok"}}
2023/02/23 19:05:31 Received {c8 n3 {"type":"topology","topology":{"n0":["n3","n1"],"n1":["n4","n2","n0"],"n2":["n1"],"n3":["n0","n4"],"n4":["n1","n3"]},"msg_id":1}}
2023/02/23 19:05:31 Sent {"src":"n3","dest":"c8","body":{"in_reply_to":1,"type":"topology_ok"}}
2023/02/23 19:05:31 Received {n0 n3 {"type":"broadcast"}}
2023/02/23 19:05:31 Sent {"src":"n3","dest":"n0","body":{"in_reply_to":0,"type":"broadcast_ok"}}
2023/02/23 19:05:31 Sent {"src":"n3","dest":"n4","body":{"type":"broadcast"}}
2023/02/23 19:05:31 Sent {"src":"n3","dest":"n0","body":{"type":"broadcast"}}
2023/02/23 19:05:31 Received {n4 n3 {"in_reply_to":0,"type":"broadcast_ok"}}
2023/02/23 19:05:31 No handler for {"id":31,"src":"n4","dest":"n3","body":{"in_reply_to":0,"type":"broadcast_ok"}}

note that it ends with not finding a handler for "body":{"in_reply_to":0,"type":"broadcast_ok"}.

I’m inclined to think something’s wrong with the underlying maelstrom wrapper library or the round trip logic.

I managed to find a way around 3b. I wasn’t checking for already broadcasted messages initially, and it seemed to have created an infinite loop.

The broadcast doc was a big help

Code below

@ahmet @james1 There’s two methods in the Go library for sending messages. One is Send() which just sends the message body and doesn’t expect a response. The other is RPC() which adds a message ID to the outgoing message and you can add a handler for the response.

There’s two ways of getting around this issue:

  1. Use Send() and register a handler on the node for "broadcast_ok".
  2. Use RPC() and pass the response handler for that specific call.

@james1 I switched the Send() call to RPC() and it works:

n.RPC(node, msg_body, func(_ maelstrom.Message) error { return nil })

I’ll clarify those two methods on the site. Thanks for the feedback!

1 Like

oh, but the wording on the challenge page says:

The value is always an integer and it is unique for each message from Maelstrom.

So are they not unique? I am not understanding the above sentence then.

Edit: I think I understand the duplication issue. But now I am wondering why I didn’t run into infinite loop problem where I keep passing the same message.

hey, I got the 3b working. So I can confirm that I did not run into any issues

whoa… could you elaborate on how did you run into infinite loop? I am not checking for the broadcasted messages either, but I did not run into such issue

edit: I now realised how does one run into this issue. My code is vulnerable to this, but I am not running into this issue at all.

While the solution works and just replacing my Send with RPC worked, I still don’t understand why I had to do that.

This was my code with Send:


if newAdded := messages.Add(body.Message); newAdded {
    for _, id := range neighbours {
        // I don't understand why n.Send() doesn't work.
        n.Send(id, map[string]any{
            "type":    "broadcast",
            "message": body.Message,
        })
    }
}

if body.MsgID == nil {
    return nil
}

return n.Reply(msg, map[string]any{"type": "broadcast_ok"})

Replacing the Send with the following fixes the issue:

n.RPC(id, map[string]any{
    "type":    "broadcast",
    "message": body.Message,
}, func(msg maelstrom.Message) error {
    return nil
})

I don’t understand the difference in the behaviour. Morever, how is this following case possible when a node sends another node a broadcast_ok when I’ve explicitly set a return statement before the n.Reply()?

{n4 n3 {"in_reply_to":0,"type":"broadcast_ok"}}

Appreciate help and pointers. Thanks.

The issue with using Send() is that the node that you’re sending to returns a "broadcast_ok" message and the sending node has no handler for that message type. Maelstrom really just deals with individual messages and the “RPC” is just a way from the client code to easily associate a request & response message together.

It looks like your example message isn’t grabbing the message ID from msg when you call Reply(). Maybe try logging the msg value (e.g. log.Printf("%#v", msg)) to see what it contains.

This clears things up, thanks!