I'm doing a Backend Dogfight 2nd Edition using Phoenix and SQLite, let's see how that goes!

:brazil: “Rinha de Backend” aka Backend Dogfight was something that started on twitter (I guess) last year and I completely missed. Now I’m trying the second edition and I want to have fun with it.

tl;dr: its a simple web backend that you can POST to charge a customer and GET to see their balance and latest transaction, every other CRUD operation is not considered. You must have a load balancer and two nodes.

My Stack

  • Elixir Phoenix
  • SQLite (heck yea)
  • LiteFS (maybe I’ll try corrosion? Or someone else can take over)

Note: this is a write heavy use case, litefs shines in read heavy use cases. That’s the fun part!

Where am I at?

I just coded the basics of the endpoints a few days ago and today I run the gatling test for the first time but using just one node to see how that goes and…

57% error 500s!

Looking at the logs the major culprit (could not be the only one, idk so far):

08:25:17.750 request_id=F7IuHpb1FlDiO4IACc0h [error] GenServer #PID<0.30517.1> terminating
** (Exqlite.Error) Database busy
INSERT INTO "transactions" ("customer_id","description","type","value","inserted_at") VALUES (?,?,?,?,?) RETURNING "id"
    (ecto_sql 3.11.1) lib/ecto/adapters/sql.ex:1054: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto 3.11.1) lib/ecto/repo/schema.ex:775: Ecto.Repo.Schema.apply/4
    (ecto 3.11.1) lib/ecto/repo/schema.ex:377: anonymous fn/15 in Ecto.Repo.Schema.do_insert/4
    (ecto_sql 3.11.1) lib/ecto/adapters/sql.ex:1358: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.6.0) lib/db_connection.ex:1710: DBConnection.run_transaction/4
    (backend_fight 0.1.0) lib/backend_fight/bank.ex:208: BackendFight.Bank.create_transaction/2
    (backend_fight 0.1.0) lib/backend_fight_web/controllers/transaction_controller.ex:16: BackendFightWeb.TransactionController.create/2
    (backend_fight 0.1.0) lib/backend_fight_web/controllers/transaction_controller.ex:1: BackendFightWeb.TransactionController.action/2

Yay write heavy!

Well this is not the end, there’s load of stuff to do so I’ll keep posting on this thread as I got. The last submission date is 2024-03-10T23:59:59-03:00 so Im good for now.

3 Likes

You may want to try:

https://sqlite.org/hctree/doc/hctree/doc/hctree/index.html

1 Like

Perfect timing, I just started reading some stuff!

--- a/lib/backend_fight/bank.ex
+++ b/lib/backend_fight/bank.ex
@@ -204,19 +204,20 @@ defmodule BackendFight.Bank do
 
   """
   def create_transaction(customer, attrs \\ %{}) do
-    res =
-      Repo.transaction(fn ->
+    # res =
+      # Repo.transaction(fn ->
         balance = get_customer_balance(customer.id)
 
         %Transaction{}
         |> Transaction.changeset(attrs, customer.id)
         |> Transaction.validate_balance(balance, customer.limit)
         |> Repo.insert()
-      end)
+      # end)
 
-    with {:ok, res} <- res do
-      res
-    end
+    # with {:ok, res} <- res do
+    #   res
+    # end
   end

Removed the transaction and stopped getting DB is busy.

So we got a few validation errors. Currently my get_customer_balance() function is just a select sum() so Let me cache that in the customer itself and see how that goes?

1 Like

Down to 1 KO, which seems validation related, Ill look into that.

I’ve added balance to the customers table so no need to sum() it at every get and also went fancy and created a trigger to validate and update balances:

      CREATE TRIGGER validate_balance_before_insert_transaction
      BEFORE INSERT ON transactions
      BEGIN
        SELECT CASE WHEN (select balance from customers where id = NEW.customer_id) + (
          case when NEW.type = 'c' then +NEW.value else -NEW.value end
        ) < -(select "limit" from customers where id = NEW.customer_id) THEN
          RAISE (ABORT, 'Invalid value')
        END;

        UPDATE customers
        SET balance = customers.balance + (case when NEW.type = 'c' then +NEW.value else -NEW.value end)
        WHERE id = NEW.customer_id;
      END;
--- a/lib/backend_fight/bank.ex
+++ b/lib/backend_fight/bank.ex
@@ -203,24 +203,19 @@ defmodule BackendFight.Bank do
       {:error, %Ecto.Changeset{}}
 
   """
-  def create_transaction(customer, attrs \\ %{}) do
-    res =
-      Repo.transaction(fn ->
-        balance = get_customer_balance(customer.id)
-
-        %Transaction{}
-        |> Transaction.changeset(attrs, customer.id)
-        |> Transaction.validate_balance(balance, customer.limit)
-        |> Repo.insert()
-      end)
-
-    with {:ok, res} <- res do
-      res
-    end
+  def create_transaction(%{id: customer_id}, attrs \\ %{}) do
+    %Transaction{}
+    |> Transaction.changeset(attrs, customer_id)
+    |> Repo.insert()
+  rescue
+    _e in Exqlite.Error ->
+      {:error, %Transaction{}
+      |> Transaction.changeset(attrs, customer_id)
+      |> Ecto.Changeset.add_error(:customer_id, "Invalid balance")}
   end
 
   def get_customer_balance(customer_id) do
-    Repo.one!(query_get_balance(customer_id))
+    Repo.get(Customer, customer_id).balance
   end

I know that rescue is very ugly, I promise I wont sin a lot.

What if I told you that 1 failing test was a translation issue? :clown_face:

--- a/lib/backend_fight/bank.ex
+++ b/lib/backend_fight/bank.ex
@@ -13,7 +13,7 @@ defmodule BackendFight.Bank do
     values_query = from t in Transaction,
       select: %{
         id: t.id,
-        value: t.value,
+        valor: t.value,
         tipo: t.type,
         descricao: t.description,
         realizada_em: t.inserted_at
@@ -26,7 +26,7 @@ defmodule BackendFight.Bank do
     customer_query = from c in Customer,
       select: %{
         id: c.id,
-        value: c.limit,
+        valor: c.limit,
         tipo: "tipo",
         descricao: subquery(balance_subquery),
         realizada_em: "now"
@@ -35,7 +35,7 @@ defmodule BackendFight.Bank do
       where: c.id == ^id
 
     case Repo.all(customer_query) do
-      [%{value: limite, descricao: saldo} | transactions] ->
+      [%{valor: limite, descricao: saldo} | transactions] ->
1 Like

After some looking around I’ve dockerfile-ed it, clustered it and Fly.RPC-ed it.

I don’t really like those percentiles. The gist is, the main node owns the DB (yay single point of failure):

version: "3.5"

services:
  rinha1: &api
    # image: .
    build:
      context: .
      dockerfile: Dockerfile
    hostname: rinha1
    environment:
      - DB_HOSTNAME=db
      - PORT=3000
      - SECRET_KEY_BASE=1FAGGKyEC9U+FODc+Jf3apLW+ntvEiT3uAxP4MTuAsZEgr4Z9GQ5X+60UeMXTQQl
      - RELEASE_COOKIE=PGL67P4KHY62MQBXIOTTJ3OTHQB52LEHKW57WRXUH5ZQDWIS42CQ====
      - DATABASE_PATH=/prod.db
      - RINHA_APP_NAME=rinha1
      - IP_V4_ADDRESS=192.0.1.11
      - PRIMARY_REGION=primary
      - MY_REGION=primary
    ports:
      - "3001:3000"
    deploy:
      resources:
        limits:
          cpus: "0.68"
          memory: "270MB"
    networks:
      erlcluster:
        ipv4_address: 192.0.1.11

  rinha2:
    <<: *api
    hostname: rinha2
    environment:
      - DB_HOSTNAME=db
      - PORT=3000
      - SECRET_KEY_BASE=1FAGGKyEC9U+FODc+Jf3apLW+ntvEiT3uAxP4MTuAsZEgr4Z9GQ5X+60UeMXTQQl
      - RELEASE_COOKIE=PGL67P4KHY62MQBXIOTTJ3OTHQB52LEHKW57WRXUH5ZQDWIS42CQ====
      - DATABASE_PATH=/prod.db
      - RINHA_APP_NAME=rinha2
      - IP_V4_ADDRESS=192.0.1.12
      - PRIMARY_REGION=primary
      - MY_REGION=replica
    ports:
      - "3002:3000"
    deploy:
      resources:
        limits:
          cpus: "0.67"
          memory: "270MB"
    networks:
      erlcluster:
        ipv4_address: 192.0.1.12
  
  nginx:
    image: nginx:latest
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - rinha1
      - rinha2
    ports:
      - "9999:9999"
    deploy:
      resources:
        limits:
          cpus: "0.15"
          memory: "10MB"
    networks:
      erlcluster:

networks:
  erlcluster:
    ipam:
      driver: default
      config:
        - subnet: "192.0.1.0/24"

Huge props to @slouchpie for this:

I didn’t have to change a lot of code, Fly.RPC just works:

  def do_create(customer_id, transaction_params) do
    Fly.RPC.rpc_primary(fn ->
      with {:ok, %Transaction{} = _transaction} <-
             Bank.create_transaction(%{id: customer_id}, transaction_params) do
        customer = Bank.get_customer(customer_id)
        {:ok, customer}
      end
    end)
  end

Though I had to tune the Db Connection a bit:

  config :backend_fight, BackendFight.Repo,
    database: database_path,
    pool_size: String.to_integer(System.get_env("POOL_SIZE") || "25"),
    cache_size: 10_000

Im thinking of doing next is to maybe use some caching and/or pubsub so the replica node can get fresh state, lets see how that goes

Nice, looks like you already got Fly.RPC working.

~sheertj/elixir_litefs - sourcehut git might be useful if you want to do it off of Ecto (based off of fly_postgres).

There’s also an Repo.update(await: false) function to not wait for the replication to complete.

I’m using a Genserver to make a replicated cache. I have a lot of custom compression logic going on and I ripped it out to paste it here, so this code might not actually work, but it’s just for the gist of it.

  def start_link(_args) do
    GenServer.start_link(__MODULE__, [])
  end

  def init(init_arg) do
    Phoenix.PubSub.subscribe(Hora.PubSub, "cache")
    {:ok, init_arg}
  end


  def handle_info({:set_session_cache, key, data}, _state) do
   Cachex.put(:prefs_cache, key, data)
  end

  def handle_info({:clear_cache}, _state) do
    Cachex.clear(:prefs_cache)
    {:noreply, nil}
  end

  def clear_all_caches() do
     Phoenix.PubSub.broadcast(Hora.PubSub, "cache", {:clear_cache})
  end


  # Ensure that the guid has the correct data
  defp set_session_cache(key, data) do
    Cachex.put(:prefs_cache, key, data})
  end

  defp get_session_cache(key) do
    case Cachex.get(:prefs_cache, key) do
      {:ok, data} -> data
      _ ->  nil
     end
  end

  # To avoid local race conditions with the pubsub broadcast, we set locally first
  defp set_session_cache_and_broadcast(key, prefs) do
    set_session_cache(key, prefs)
    Phoenix.PubSub.broadcast(Hora.PubSub, "cache", {:set_session_cache, key, prefs})
  end


1 Like

Hey I’m back and I bring good news:

@tj1 thanks for the code, I’ve just had to remove a } and rename stuff to make sense to my code and worked like a charm.

defmodule BackendFight.CustomerCache do
  use GenServer

  # Public API

  # Ensure that the guid has the correct data
  def set_customer_cache(key, data) do
    Cachex.put(:customer_cache, key, data)
  end

  def get_customer_cache(key) do
    case Cachex.get(:customer_cache, key) do
      {:ok, data} -> data
      _ ->  nil
     end
  end

  def fetch_customer_cache(key, func) do
    Cachex.fetch(:customer_cache, key, func)
  end

  def get_and_update_customer_cache!(key, func) do
    cached = get_customer_cache(key)
    with {:commit, data} <- func.(cached) do
      set_customer_cache_and_broadcast(key, data)
      data
    end
  end

  # To avoid local race conditions with the pubsub broadcast, we set locally first
  def set_customer_cache_and_broadcast(key, prefs) do
    set_customer_cache(key, prefs)
    Phoenix.PubSub.broadcast(BackendFight.PubSub, "cache", {:set_customer_cache, key, prefs})
  end

  def clear_all_caches() do
    Cachex.clear(:customer_cache)
    Phoenix.PubSub.broadcast(BackendFight.PubSub, "cache", {:clear_cache})
  end

  # GenServer Spec

  def start_link(_args) do
    GenServer.start_link(__MODULE__, [])
  end

  def init(init_arg) do
    Phoenix.PubSub.subscribe(BackendFight.PubSub, "cache")
    {:ok, init_arg}
  end

  def handle_info({:set_customer_cache, key, data}, _state) do
   Cachex.put(:customer_cache, key, data)
   {:noreply, nil}
  end

  def handle_info({:clear_cache}, _state) do
    Cachex.clear(:customer_cache)
    {:noreply, nil}
  end
end

I’ve also switched from separating customers between nodes to primary does all writes for now since it seemed to show better results. Plus with the PubSub and cache things get better. My customer controller is pretty much:

  def show(conn, %{"id" => id}) do
    case get_customer(id) do
      nil ->
        {:error, :not_found}

      customer_data ->
        render(conn, :show, customer_data: customer_data)
    end
  end

  def get_customer(id) do
    if data = CustomerCache.get_customer_cache(id) do
      data
    else
      Fly.RPC.rpc_region(Bank.region_for_customer(id), {Bank, :get_customer_data, [id]})
    end
  end

(I’ve temporarily made Bank.region_for_customer(id) to always say “primary”)

I was still getting some slow writes from 100~200ms mostly, sometimes more. So I first tried an weird thing: I’ve switched from Repo.insert to Repo.query(“INSERT…”) and things became way faster like mostly under 50ms writes and sometimes going up to 100ms :astonished: . My percentiles dropped!

Then last I just added a queue system to process things in background like @mpl and @jstiebs mentioned.

Max dropped a bit! Ill experiment a bit here and there to see if I can min-max this.

1 Like

That is a bit weird. Why do you need a queue? I’m using the elixir_litefs library with the flag await: false. I mean, the mailbox can get overwhelmed, etc. but that should be fast enough.

I’d be curious to see how much of the latency is due to litefs / sqlite. Like an experiment, what would happen if you incremented ETS instead? Will give an idea of where to tune next.

I was getting different kinds of errors like these so I branched back before I started using litefs.

The queue system works really well.

They added new validations that increase concurrency at the first seconds. Here’s where I’m at:

Implementation details:

  • Only primary writes to SQLite.
  • Writes are done in a queue but the transaction creation reply is immediate, we cache responses as to always have the latest state even with writes not completed.
  • Reads and writes go to primary. Needed since the latest status is managed by primary and these tests want to have correct state asap.
  • Disabled PubSub cache, reading from primary is faster and consistent.

Nginx

Originally I was distributing most of resources to my two elixir services but turns out giving more to nginx yielded better results.

  • Primary: 0.85 CPUs with 256MB RAM
  • Replica: 0.30 CPUs with 147MB of RAM
  • Nginx: 0.35 CPUs with 147MB of RAM

That fits the challenge limits and works really well. After I tuned nginx my max dropped from 200~300ms to sub 150ms.


In a real world system I’d still put litefs anyway for safety :slight_smile:

Built ARM64 with gh actions, pushed to docker hub.

Lets see how that goes

My repo:

I’m lost. :rofl:

  • Isn’t not writing to disk and just posting to the queue kind of cheating?
  • Why is there nginx?
1 Like
  • Rules say you can use an in memory store as long as there’s writes to a persistent DB at some point (I looked it up just to be safe, there’s an issue there :joy:)
  • They ask for a load balancer service with round robin strat. They also mention you can build one yourself too
1 Like

Well, if cheating is allowed, I guess we can go full webscale.

I really wanted to know how the high-concurrency branch of sqlite did though.

ha, this runs faster on arm64 I guess, my times are 2x as fast

and fixed my cluster hack

1 Like

bash + > rc = sqlite3_open(“file:memdb1?mode=memory&cache=shared”, &db);
Or,

ATTACH DATABASE ‘file:memdb1?mode=memory&cache=shared’ AS aux1;
pódi.
amu.

@zeh not sure I follow but I’ve made another approach: multi tenant sqlite

highlights:

  • very very very fast writes
  • Exqlite directly
  • 1 customer = 1 DB = 1 GenServer
  • DynamicSupervisor to make things tight
  • Replica always reads/writes from primary
  • Same memory/CPU share settings as before
PRAGMA :
synchronous = OFF
journal_mode = WAL
threads = 32
temp_store = MEMORY
mmap_size = 30000000000
page_size = 32768

Code is messy and Ive branched out from the previous solution so theres dead code, will cleanup another time and this will be another repo honestly

1 Like

new repo

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.