Challenge #5c: Efficient Kafka-Style Log

tony · February 24, 2023, 2:49am

Want to get an idea of how much more I can optimize my 5b solution.

Could anyone else share their results.edn to get an idea of what’s possible?

Sharing mine: (expand collapsed tab)

results.edn

{:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? true,
         :count 16461,
         :ok-count 16455,
         :fail-count 0,
         :info-count 6,
         :by-f {:assign {:valid? true,
                         :count 2097,
                         :ok-count 2097,
                         :fail-count 0,
                         :info-count 0},
                :crash {:valid? false,
                        :count 6,
                        :ok-count 0,
                        :fail-count 0,
                        :info-count 6},
                :poll {:valid? true,
                       :count 7549,
                       :ok-count 7549,
                       :fail-count 0,
                       :info-count 0},
                :send {:valid? true,
                       :count 6809,
                       :ok-count 6809,
                       :fail-count 0,
                       :info-count 0}}},
 :availability {:valid? true, :ok-fraction 0.9996355},
 :net {:all {:send-count 125706,
             :recv-count 125706,
             :msg-count 125706,
             :msgs-per-op 7.6365957},
       :clients {:send-count 40764,
                 :recv-count 40764,
                 :msg-count 40764},
       :servers {:send-count 84942,
                 :recv-count 84942,
                 :msg-count 84942,
                 :msgs-per-op 5.160197},
       :valid? true},
 :workload {:valid? true,
            :worst-realtime-lag {:time 0.047601458,
                                 :process 4,
                                 :key "8",
                                 :lag 0.0},
            :bad-error-types (),
            :error-types (),
            :info-txn-causes ()},
 :valid? true}

sdab · February 25, 2023, 8:19pm

In terms of absolute values, your msgs-per-op looks good. Though rather than just looking at that number, its worth thinking about it like other algorithmic complexity questions. i.e. how does your performance grow with respect to # of nodes? message rate? For example if there are 20 nodes or the message rater is 10 how does your msgs-per-op change?

I also notice the challenges dont really care about message sizes, but if you are syncing the whole state all the time you can see why that would be bad (not that I know what you are doing).

Eike · March 14, 2023, 9:21pm

Thanks for the results to compare. This is my result file for #5c.

results.edn

{:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? true,
         :count 16810,
         :ok-count 16799,
         :fail-count 0,
         :info-count 11,
         :by-f {:assign {:valid? true,
                         :count 2110,
                         :ok-count 2110,
                         :fail-count 0,
                         :info-count 0},
                :crash {:valid? false,
                        :count 7,
                        :ok-count 0,
                        :fail-count 0,
                        :info-count 7},
                :poll {:valid? true,
                       :count 7648,
                       :ok-count 7644,
                       :fail-count 0,
                       :info-count 4},
                :send {:valid? true,
                       :count 7045,
                       :ok-count 7045,
                       :fail-count 0,
                       :info-count 0}}},
 :availability {:valid? true, :ok-fraction 0.9993456},
 :net {:all {:send-count 74488,
             :recv-count 74488,
             :msg-count 74488,
             :msgs-per-op 4.431172},
       :clients {:send-count 41166,
                 :recv-count 41166,
                 :msg-count 41166},
       :servers {:send-count 33322,
                 :recv-count 33322,
                 :msg-count 33322,
                 :msgs-per-op 1.9822725},
       :valid? true},
 :workload {:valid? true,
            :worst-realtime-lag {:time 0.02322775,
                                 :process 4,
                                 :key "8",
                                 :lag 0.0},
            :bad-error-types (),
            :error-types (),
            :info-txn-causes ([:crash "context deadline exceeded"])},
 :valid? true}

johnkoepi · March 21, 2023, 11:16am

With imaginary leader leases and lazy topic index pushes I did around ~3.1 per op on avg. But all that is only possible because we are somewhat on a reliable network run. The proper fallback and recovery on those algorithms would be tricky. Thanks, we are not asked to balance the latency over the network failures. They mention availability tho but boy thanks no.

It could be better if the storage supported batching. But then you can go without lin-* at all and implement storage yourself. But then it will end up with a tradeoff of reliability and latency. For example if 2 replicas is enough you can improve over 3 in stable case. If you can afford losing last 10ms of messages than you can improve even more.

johnkoepi · March 21, 2023, 11:18am

I really appreciate they don’t ask in the end to rework everything to manage membership changes.

madtrick · March 28, 2023, 7:10pm

Here are mine

results.edn

{:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? true,
         :count 16632,
         :ok-count 16625,
         :fail-count 0,
         :info-count 7,
         :by-f {:assign {:valid? true,
                         :count 2195,
                         :ok-count 2195,
                         :fail-count 0,
                         :info-count 0},
                :crash {:valid? false,
                        :count 7,
                        :ok-count 0,
                        :fail-count 0,
                        :info-count 7},
                :poll {:valid? true,
                       :count 7506,
                       :ok-count 7506,
                       :fail-count 0,
                       :info-count 0},
                :send {:valid? true,
                       :count 6924,
                       :ok-count 6924,
                       :fail-count 0,
                       :info-count 0}}},
 :availability {:valid? true, :ok-fraction 0.99957913},
 :net {:all {:send-count 119868,
             :recv-count 119868,
             :msg-count 119868,
             :msgs-per-op 7.207071},
       :clients {:send-count 41020,
                 :recv-count 41020,
                 :msg-count 41020},
       :servers {:send-count 78848,
                 :recv-count 78848,
                 :msg-count 78848,
                 :msgs-per-op 4.740741},
       :valid? true},
 :workload {:valid? true,
            :worst-realtime-lag {:time 0.040306292,
                                 :process 4,
                                 :key "8",
                                 :lag 0.0},
            :bad-error-types (),
            :error-types (),
            :info-txn-causes ()},
 :valid? true}

Thanks for sharing yours

johnkoepi · March 29, 2023, 8:02am

This one is not a limit as I mentioned earlier, it can be improved even beyond:

:availability {:valid? true, :ok-fraction 0.9996053},
 :net {:all {:send-count 101030,:recv-count 101030, :msg-count 101030,  
            :msgs-per-op 5.6966453},
       :clients {:send-count 44144, :recv-count 44144, :msg-count 44144},
       :servers {:send-count 56886, :recv-count 56886, :msg-count 56886, 
            :msgs-per-op 3.2075558},
       :valid? true},
 :workload {:valid? true, :worst-realtime-lag {:time 0.019361178, :process 0, :key "9", :lag 0.0},
            :bad-error-types (), :error-types (), :info-txn-causes ()},
 :valid? true}

sxlijin · April 3, 2023, 4:15am

Once I converted my 5a solution to use LinKV and got it working, it worked for both 5b and 5c. Seems to do decent, but as others have called out, none of the 5* challenges really exercise behavior in presence of a slow or partitioned network.

{:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? true,
         :count 16497,
         :ok-count 9869,
         :fail-count 6621,
         :info-count 7,
         :by-f {:assign {:valid? true,
                         :count 2030,
                         :ok-count 1913,
                         :fail-count 117,
                         :info-count 0},
                :crash {:valid? false,
                        :count 7,
                        :ok-count 0,
                        :fail-count 0,
                        :info-count 7},
                :poll {:valid? true,
                       :count 7517,
                       :ok-count 7517,
                       :fail-count 0,
                       :info-count 0},
                :send {:valid? true,
                       :count 6943,
                       :ok-count 439,
                       :fail-count 6504,
                       :info-count 0}}},
 :availability {:valid? true, :ok-fraction 0.59823},
 :net {:all {:send-count 89062,
             :recv-count 89062,
             :msg-count 89062,
             :msgs-per-op 5.398679},
       :clients {:send-count 35462,
                 :recv-count 35462,
                 :msg-count 35462},
       :servers {:send-count 53600,
                 :recv-count 53600,
                 :msg-count 53600,
                 :msgs-per-op 3.2490757},
       :valid? true},
 :workload {:valid? true,
            :worst-realtime-lag {:time 0.058028273,
                                 :process 5,
                                 :key "9",
                                 :lag 0.0},
            :bad-error-types (),
            :error-types (),
            :info-txn-causes ()},
 :valid? true}

pravin · April 5, 2023, 2:06am

{:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? true,
         :count 1778,
         :ok-count 1769,
         :fail-count 0,
         :info-count 9,
         :by-f {:assign {:valid? true,
                         :count 203,
                         :ok-count 203,
                         :fail-count 0,
                         :info-count 0},
                :crash {:valid? false,
                        :count 7,
                        :ok-count 0,
                        :fail-count 0,
                        :info-count 7},
                :poll {:valid? true,
                       :count 801,
                       :ok-count 801,
                       :fail-count 0,
                       :info-count 0},
                :send {:valid? true,
                       :count 767,
                       :ok-count 765,
                       :fail-count 0,
                       :info-count 2}}},
 :availability {:valid? true, :ok-fraction 0.99493814},
 :net {:all {:send-count 12882,
             :recv-count 12882,
             :msg-count 12882,
             :msgs-per-op 7.245219},
       :clients {:send-count 4434, :recv-count 4434, :msg-count 4434},
       :servers {:send-count 8448,
                 :recv-count 8448,
                 :msg-count 8448,
                 :msgs-per-op 4.751406},
       :valid? true},
 :workload {:valid? true,
            :worst-realtime-lag {:time 0.489093934,
                                 :process 3,
                                 :key "9",
                                 :lag 0.0},
            :bad-error-types (),
            :error-types (),
            :info-txn-causes ([:unknown nil])},
 :valid? true}

these are my results

johnkoepi · April 5, 2023, 7:29am

I am sure you saw this, but let me share this one line with everyone else.

It’s important to consider which components require linearizability versus sequential consistency.

The problem specifically asks not to put everything into linearizable slow storage and offload what can be offloaded to the seq con one.

johnkoepi · April 5, 2023, 7:39am

this is due to the goal to optimize latency CDF with the specific goal. you can’t get target median under network partition.

Alisa · June 14, 2023, 6:23am

Here are my results

{:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? true,
         :count 16601,
         :ok-count 16595,
         :fail-count 0,
         :info-count 6,
         :by-f {:assign {:valid? true,
                         :count 2066,
                         :ok-count 2066,
                         :fail-count 0,
                         :info-count 0},
                :crash {:valid? false,
                        :count 6,
                        :ok-count 0,
                        :fail-count 0,
                        :info-count 6},
                :poll {:valid? true,
                       :count 7569,
                       :ok-count 7569,
                       :fail-count 0,
                       :info-count 0},
                :send {:valid? true,
                       :count 6960,
                       :ok-count 6960,
                       :fail-count 0,
                       :info-count 0}}},
 :availability {:valid? true, :ok-fraction 0.99963856},
 :net {:all {:send-count 41512,
             :recv-count 41512,
             :msg-count 41512,
             :msgs-per-op 2.5005722},
       :clients {:send-count 34272,
                 :recv-count 34272,
                 :msg-count 34272},
       :servers {:send-count 7240,
                 :recv-count 7240,
                 :msg-count 7240,
                 :msgs-per-op 0.4361183},
       :valid? true},
 :workload {:valid? true,
            :worst-realtime-lag {:time 13.996002561,
                                 :process 7,
                                 :key "289",
                                 :lag 0.999179058},
            :bad-error-types (),
            :error-types (),
            :info-txn-causes ()},
 :valid? true}

eclbg · June 16, 2023, 10:39am

Here are the results from my solution in Rust using @johnkoepi 's library rust-maelstrom-node. This is right after passing 5b without optimising any further. I’m logging how many write retries happen due to another node writing first to the same key and found ~200 of those in a total of ~6200. I wonder if other people are experiencing similar rates?

The code for my solution is here: eclbg/gossip_glomers

results.edn

{:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? true,
         :count 14865,
         :ok-count 14857,
         :fail-count 0,
         :info-count 8,
         :by-f {:assign {:valid? true,
                         :count 1891,
                         :ok-count 1891,
                         :fail-count 0,
                         :info-count 0},
                :crash {:valid? false,
                        :count 8,
                        :ok-count 0,
                        :fail-count 0,
                        :info-count 8},
                :poll {:valid? true,
                       :count 6697,
                       :ok-count 6697,
                       :fail-count 0,
                       :info-count 0},
                :send {:valid? true,
                       :count 6269,
                       :ok-count 6269,
                       :fail-count 0,
                       :info-count 0}}},
 :availability {:valid? true, :ok-fraction 0.9994618},
 :net {:all {:send-count 100824,
             :recv-count 100824,
             :msg-count 100824,
             :msgs-per-op 6.782644},
       :clients {:send-count 36904,
                 :recv-count 36904,
                 :msg-count 36904},
       :servers {:send-count 63920,
                 :recv-count 63920,
                 :msg-count 63920,
                 :msgs-per-op 4.3000336},
       :valid? true},
 :workload {:valid? true,
            :worst-realtime-lag {:time 0.782357875,
                                 :process 3,
                                 :key "9",
                                 :lag 0.0},
            :bad-error-types (),
            :error-types (),
            :info-txn-causes ()},
 :valid? true}

Also, I only moved to seq-kv the part of the state that feels the most obvious. I wonder what others have done?

seq-kv and lin-kv

I’ve used lin-kv for storing offset-message pairs and seq-kv for storing committed offsets

Thanks to everyone who’ve shared their answers in the thread!

adiSuper94 · September 30, 2023, 8:14pm

Hi I am not not exactly sure how to interpret these results.edn file.
I read results doc , but still wasn’t able get all the info I needed.

stats.count : is the total number of tests messages sent to our system ?
net.all: total number of messages exchanged between all nodes in the system ?
net.client: total number of messages sent to the servers by clients ? The net.client.send_count is much higher than stats.count. So I assume that this number takes both request from client and response into account + commit_offsets and list_committed_offsets requests.
msgs-per-op: I assume every request from client counts as an op, and every message sent ?? Can some one explain this ??

I have tried to make some improvements for 5c. And my numbers for 5b and 5c do look different, but am very lost in trying to read this results file.
Please help!

results.edn

{:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? true,
         :count 17504,
         :ok-count 17487,
         :fail-count 0,
         :info-count 17,
         :by-f {:assign {:valid? true,
                         :count 2101,
                         :ok-count 2101,
                         :fail-count 0,
                         :info-count 0},
                :crash {:valid? false,
                        :count 8,
                        :ok-count 0,
                        :fail-count 0,
                        :info-count 8},
                :poll {:valid? true,
                       :count 8044,
                       :ok-count 8044,
                       :fail-count 0,
                       :info-count 0},
                :send {:valid? true,
                       :count 7351,
                       :ok-count 7342,
                       :fail-count 0,
                       :info-count 9}}},
 :availability {:valid? true, :ok-fraction 0.9990288},
 :net {:all {:send-count 57023,
             :recv-count 57023,
             :msg-count 57023,
             :msgs-per-op 3.2577126},
       :clients {:send-count 41479,
                 :recv-count 41479,
                 :msg-count 41479},
       :servers {:send-count 15544,
                 :recv-count 15544,
                 :msg-count 15544,
                 :msgs-per-op 0.8880256},
       :valid? true},
 :workload {:valid? true,
            :worst-realtime-lag {:time 39.94545445,
                                 :process 23,
                                 :key "2",
                                 :lag 39.926421005},
            :bad-error-types (),
            :error-types (),
            :info-txn-causes (:net-timeout)},
 :valid? true}

Worst time lag looks very bad.

nebkor · June 7, 2024, 5:49am

My network traffic is good (low) because I’m not using the KV services for anything, but I don’t know why my highest latency is so high (and it’s a total outlier).

{:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? true,
         :count 18695,
         :ok-count 18687,
         :fail-count 0,
         :info-count 8,
         :by-f {:assign {:valid? true,
                         :count 2280,
                         :ok-count 2280,
                         :fail-count 0,
                         :info-count 0},
                :crash {:valid? false,
                        :count 8,
                        :ok-count 0,
                        :fail-count 0,
                        :info-count 8},
                :poll {:valid? true,
                       :count 8653,
                       :ok-count 8653,
                       :fail-count 0,
                       :info-count 0},
                :send {:valid? true,
                       :count 7754,
                       :ok-count 7754,
                       :fail-count 0,
                       :info-count 0}}},
 :availability {:valid? true, :ok-fraction 0.9995721},
 :net {:all {:send-count 44118,
             :recv-count 44118,
             :msg-count 44118,
             :msgs-per-op 2.3598824},
       :clients {:send-count 44118,
                 :recv-count 44118,
                 :msg-count 44118},
       :servers {:send-count 0,
                 :recv-count 0,
                 :msg-count 0,
                 :msgs-per-op 0.0},
       :valid? true},
 :workload {:valid? true,
            :worst-realtime-lag {:time 39.820726321,
                                 :process 12,
                                 :key "7",
                                 :lag 39.80768711},
            :bad-error-types (),
            :error-types (),
            :info-txn-causes ()},
 :valid? true}

The code is at chatty-catties/gg-kafka/src/main.rs at main - nebkor/chatty-catties - Git at the Kitten Collective – it’s pretty straightforward rust

nebkor · June 7, 2024, 11:23pm

By using the linKV for last offset, I was able to drop the worst-case lag to less than a second, and less-than-doubled by overall message count:

{:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :timeline {:valid? true},
 :exceptions {:valid? true},
 :stats {:valid? true,
         :count 18327,
         :ok-count 18319,
         :fail-count 0,
         :info-count 8,
         :by-f {:assign {:valid? true,
                         :count 2153,
                         :ok-count 2153,
                         :fail-count 0,
                         :info-count 0},
                :crash {:valid? false,
                        :count 8,
                        :ok-count 0,
                        :fail-count 0,
                        :info-count 8},
                :poll {:valid? true,
                       :count 8339,
                       :ok-count 8339,
                       :fail-count 0,
                       :info-count 0},
                :send {:valid? true,
                       :count 7827,
                       :ok-count 7827,
                       :fail-count 0,
                       :info-count 0}}},
 :availability {:valid? true, :ok-fraction 0.9995635},
 :net {:all {:send-count 69111,
             :recv-count 69111,
             :msg-count 69111,
             :msgs-per-op 3.7709937},
       :clients {:send-count 44990,
                 :recv-count 44990,
                 :msg-count 44990},
       :servers {:send-count 24121,
                 :recv-count 24121,
                 :msg-count 24121,
                 :msgs-per-op 1.3161455},
       :valid? true},
 :workload {:valid? true,
            :worst-realtime-lag {:time 0.033616429,
                                 :process 4,
                                 :key "9",
                                 :lag 0.0},
            :bad-error-types (),
            :error-types (),
            :info-txn-causes ()},
 :valid? true}

I gossip sends to reduce traffic to the services.

Topic		Replies	Views
Gossip glomers - Challenge 6 dist-sys-challenge	14	896	April 13, 2023
Did I solve Challenge #3d and #3e Questions / Help dist-sys-challenge	1	319	March 10, 2024
Gossip Glomers, Challenge #5b: understanding merging in Multi-Node Kafka-Style Log Questions / Help distributed , dist-sys-challenge , kafka	1	75	November 26, 2024
Challenge #5b: KV stores Questions / Help dist-sys-challenge	3	565	April 17, 2023
About Redis latency	24	1529	August 12, 2020

Challenge #5c: Efficient Kafka-Style Log

Related topics