git.limo: A Git source code management tool powered by Elixir.

Hi, I’ve heard of fly.io quite recently and instantly wanted to give it a try.

With the help of the “Elixir Getting Started Guide” I’ve managed to deploy my project without much hassle :sweat_smile:. The experience has been delightful so far and I’m exited about the distributed possibilities fly.io offers and how it seems to naturally fit with the BEAM world.

My project is a GitHub clone mostly written in Elixir (NIFs written in C for working with libgit2). It is very self-contained and does not require much dependencies (solely libgit2 and openssh-ssh-keygen).

You can check the GitHub project or the demo running on fly.io here:

https://git.limo/redrabbit/git-limo

It implements (in Elixir) the Git transfer protocol and provides support for both HTTP and SSH (:ssh.daemon/3) transport protocols.

Now I’ve seen that only a handful of ports are allowed to be exposed. In my case, I’d like to use port 22 for Git client commands over SSH. I’ve tried to use the experimental allow_public_port without success (IPv6 only?).

Currently, I’m using port 10022 but it’s somehow cumbersome to use on the client side:

git clone ssh://redrabbit@git.limo:10022/redrabbit/git-limo.git

against:

git clone redrabbit@git.limo:redrabbit/git-limo.git

I’d like to start experimenting with the distributed aspect of fly.io soon. Currently the setup is quite simple (see commit Mess around with fly.io) and has

  • a single node running in Europe (fra)
  • a postgres instance
  • 40GB volume for storing Git data

but it would be nice to provide a multi-node environment running in different regions. :star_struck:

2 Likes

Well this is amazing. We’ll look at getting port 22 open for you as well, it might be a little harder than the others we’ve enabled recently but this is a pretty good reason to make it work.

Have you figured out how you’d make clustering work with git repos on different disks in different regions? It would be very cool to have my git repositories somewhere close to me.

:+1:

I also really enjoy the tooling you provide in order to open a secure shell without having to install anything. Having my own restricted (only Git commands) SSH server running in parallel would be awesome :hugs:

All in all, I really enjoy all the stuff you guys made recently (6PN, DNS resolution, Wireguard VPN, volumes, scaling and autoscaling). After two decades of old-school, boring, Devops deployment and maintenance this sounds all very exciting to me.

Having most of my customers installations running on OVH and Hetzner so far the only thing I’m a bit scared about fly.io is resource pricing and billing (pay as you go). So far I bought some credits in order to test the water before binding my credit card :sweat_smile:.

My Git implementation is quite flexible. The core component is the GitAgent module, a dedicated Erlang process allowing multiple processes to manipulate a Git repository simultaneously:

alias GitRekt.GitAgent

{:ok, agent} = GitAgent.start_link("/data/my-user/my-repo.git")

{:ok, branch} = GitAgent.branch(agent, "master")

{:ok, commit} = GitAgent.peel(agent, branch)

{:ok, author} = GitAgent.commit_author(agent, commit)
{:ok, message} = GitAgent.commit_message(agent, commit)

IO.puts "Last commit by #{author.name} <#{author.email}>:"
IO.puts message

From an API standpoint, interacting with a Git repository running on a remote node is exactly the same as working with a repository stored locally.

If I understand how persistent storage works, a volume can only be bound to a single instance. So I cannot have a shared volume accessible from multiple nodes.

This is no big deal because I could simply have a multi-node aware “coordinator” for assigning Git repositories to specific nodes/regions (see libcluster, Horde). Fun times ahead :grinning_face_with_smiling_eyes:

Now I’m not sure how a company like GitHub handles this kind of stuff. In a basic setup running a set of full nodes (storage, web-interface) in different regions, I would assume that when the repository is first created we assign it to the right region/node (basically letting fly.io choose which region to use in the first place). Any further access to the repository (pull, push, browse files over web, etc.) would use the internal “coordinator” to find the right node independently from the end-user’s location. In the long term, maybe some kind of mechanism to replicate data across volumes in different regions for “popular” repositories…

In the first place, I will have to refactor my git-receive-pack implementation.

Currently when pushing fat repositories (hundred thousands of commits), all operations are done in a eager way: reading incoming objects, deserialising them, aggregating stats and meta-infos (contributors, issue references, etc.). Leading to fill the entire RAM storage until all data are pushed…

I’d also like to have a better data processing pipeline with back-pressure, rate-limit, max concurrent jobs, etc.

But hey, this is a hobby project of mine and each day only has 24 hours to offer :sweat_smile:

1 Like

Small semester update :nerd_face:.

I’ve updated and refactored a few things:

  • Refactor the Git storage backend and git-receive-pack in order to write PACK file directly to disc (see #8).
  • Enhance GitRekt.GitAgent with caching and transaction support, better Telemetry integration and more.
  • Rewrite a few template based views into live views and improve overhaul user experience.
  • Add new repository pool (routing-pool) with support for shared cache across Git agents.
  • Add integration for AppSignal. Ecto, Phoenix, LiveViews with special attention to Git related events.

I also experimented with distributed setup. Which has been surprisingly easy to implement so far :sweat_smile::

Using Horde, I was able to run GitGud.RepoPool on multiple nodes with only a few line of code.

On a one-node setup, each repository has a dedicated pool of agent (implemented with DynamicSupervisor). All the agents within a pool share the same ETS cache.

On a multi-node setup, things work in a similar fashion but pools are distributed uniformly across the cluster using a hash ring.

The API ist still the same:

repo = GitGud.RepoQuery.user_repo("redrabbit", "git-limo")
{:ok, agent} = GitRekt.GitAgent.unwrap(repo) # returns agent from pool on the right node
{:ok, head} = GitRekt.GitAgent.head(agent)
{:ok, commit} = GitRekt.GitAgent.peel(agent, head)
{:ok, commit_msg} = GitRekt.GitAgent.commit_message(agent, commit)
IO.puts "Commit #{commit.oid}"
IO.puts commit_msg

So any Git related command within the cluster are routed to the right node :raised_hands:. This includes the web frontend, GraphQL API and Git transfer protocols implementations(HTTP, SSH).


In order to deploy this setup on Fly, I still miss a few things: taking volumes and regions into account :flushed:.

I think having a hashring for distributing repositories across nodes is not the right strategy here. I don’t want to deal with handing off repositories and clone their respective data to other nodes when my cluster changes (auto-scaling etc.).

Instead my idea is to assign a specific region to a repository (additional :region field on Ecto schema) on creation and use the region to retrieve the right node in the cluster.

  1. So let’s say a user creates a new repository via the Web frontend. We use the Fly-Region HTTP header to assign the region and init the repository there.

  2. With a different distribution strategy Git commands are routed to the right node based on the repository region.

  3. Profit :champagne:

Now this solution is only partially pleasing because it does not support multiple instances in the same region :sweat:.

We’d have to store the Fly volume ID instead of the region and keep trace (CRDT) of volume assignment during deployment and cluster changes.

I’m not at all familiar with distributed systems so my approach might seem foolish. I’m very open to critics and would really like some feedback here :hugs:.