git.limo: A Git source code management tool powered by Elixir.

redrabbit · May 9, 2021, 11:45pm

Hi, I’ve heard of fly.io quite recently and instantly wanted to give it a try.

With the help of the “Elixir Getting Started Guide” I’ve managed to deploy my project without much hassle . The experience has been delightful so far and I’m exited about the distributed possibilities fly.io offers and how it seems to naturally fit with the BEAM world.

My project is a GitHub clone mostly written in Elixir (NIFs written in C for working with libgit2). It is very self-contained and does not require much dependencies (solely libgit2 and openssh-ssh-keygen).

You can check the GitHub project or the demo running on fly.io here:

https://git.limo/redrabbit/git-limo

It implements (in Elixir) the Git transfer protocol and provides support for both HTTP and SSH (:ssh.daemon/3) transport protocols.

Now I’ve seen that only a handful of ports are allowed to be exposed. In my case, I’d like to use port 22 for Git client commands over SSH. I’ve tried to use the experimental allow_public_port without success (IPv6 only?).

Currently, I’m using port 10022 but it’s somehow cumbersome to use on the client side:

git clone ssh://redrabbit@git.limo:10022/redrabbit/git-limo.git

against:

git clone redrabbit@git.limo:redrabbit/git-limo.git

I’d like to start experimenting with the distributed aspect of fly.io soon. Currently the setup is quite simple (see commit Mess around with fly.io) and has

a single node running in Europe (fra)
a postgres instance
40GB volume for storing Git data

but it would be nice to provide a multi-node environment running in different regions.

kurt · May 10, 2021, 1:34pm

Well this is amazing. We’ll look at getting port 22 open for you as well, it might be a little harder than the others we’ve enabled recently but this is a pretty good reason to make it work.

Have you figured out how you’d make clustering work with git repos on different disks in different regions? It would be very cool to have my git repositories somewhere close to me.

redrabbit · May 10, 2021, 5:10pm

I also really enjoy the tooling you provide in order to open a secure shell without having to install anything. Having my own restricted (only Git commands) SSH server running in parallel would be awesome

All in all, I really enjoy all the stuff you guys made recently (6PN, DNS resolution, Wireguard VPN, volumes, scaling and autoscaling). After two decades of old-school, boring, Devops deployment and maintenance this sounds all very exciting to me.

Having most of my customers installations running on OVH and Hetzner so far the only thing I’m a bit scared about fly.io is resource pricing and billing (pay as you go). So far I bought some credits in order to test the water before binding my credit card .

My Git implementation is quite flexible. The core component is the GitAgent module, a dedicated Erlang process allowing multiple processes to manipulate a Git repository simultaneously:

alias GitRekt.GitAgent

{:ok, agent} = GitAgent.start_link("/data/my-user/my-repo.git")

{:ok, branch} = GitAgent.branch(agent, "master")

{:ok, commit} = GitAgent.peel(agent, branch)

{:ok, author} = GitAgent.commit_author(agent, commit)
{:ok, message} = GitAgent.commit_message(agent, commit)

IO.puts "Last commit by #{author.name} <#{author.email}>:"
IO.puts message

From an API standpoint, interacting with a Git repository running on a remote node is exactly the same as working with a repository stored locally.

If I understand how persistent storage works, a volume can only be bound to a single instance. So I cannot have a shared volume accessible from multiple nodes.

This is no big deal because I could simply have a multi-node aware “coordinator” for assigning Git repositories to specific nodes/regions (see libcluster, Horde). Fun times ahead

Now I’m not sure how a company like GitHub handles this kind of stuff. In a basic setup running a set of full nodes (storage, web-interface) in different regions, I would assume that when the repository is first created we assign it to the right region/node (basically letting fly.io choose which region to use in the first place). Any further access to the repository (pull, push, browse files over web, etc.) would use the internal “coordinator” to find the right node independently from the end-user’s location. In the long term, maybe some kind of mechanism to replicate data across volumes in different regions for “popular” repositories…

In the first place, I will have to refactor my git-receive-pack implementation.

Currently when pushing fat repositories (hundred thousands of commits), all operations are done in a eager way: reading incoming objects, deserialising them, aggregating stats and meta-infos (contributors, issue references, etc.). Leading to fill the entire RAM storage until all data are pushed…

I’d also like to have a better data processing pipeline with back-pressure, rate-limit, max concurrent jobs, etc.

But hey, this is a hobby project of mine and each day only has 24 hours to offer

redrabbit · September 13, 2021, 8:49pm

Small semester update .

I’ve updated and refactored a few things:

Refactor the Git storage backend and git-receive-pack in order to write PACK file directly to disc (see #8).
Enhance GitRekt.GitAgent with caching and transaction support, better Telemetry integration and more.
Rewrite a few template based views into live views and improve overhaul user experience.
Add new repository pool (routing-pool) with support for shared cache across Git agents.
Add integration for AppSignal. Ecto, Phoenix, LiveViews with special attention to Git related events.

I also experimented with distributed setup. Which has been surprisingly easy to implement so far :

Using Horde, I was able to run GitGud.RepoPool on multiple nodes with only a few line of code.

On a one-node setup, each repository has a dedicated pool of agent (implemented with DynamicSupervisor). All the agents within a pool share the same ETS cache.

On a multi-node setup, things work in a similar fashion but pools are distributed uniformly across the cluster using a hash ring.

The API ist still the same:

repo = GitGud.RepoQuery.user_repo("redrabbit", "git-limo")
{:ok, agent} = GitRekt.GitAgent.unwrap(repo) # returns agent from pool on the right node
{:ok, head} = GitRekt.GitAgent.head(agent)
{:ok, commit} = GitRekt.GitAgent.peel(agent, head)
{:ok, commit_msg} = GitRekt.GitAgent.commit_message(agent, commit)
IO.puts "Commit #{commit.oid}"
IO.puts commit_msg

So any Git related command within the cluster are routed to the right node . This includes the web frontend, GraphQL API and Git transfer protocols implementations(HTTP, SSH).

In order to deploy this setup on Fly, I still miss a few things: taking volumes and regions into account .

I think having a hashring for distributing repositories across nodes is not the right strategy here. I don’t want to deal with handing off repositories and clone their respective data to other nodes when my cluster changes (auto-scaling etc.).

Instead my idea is to assign a specific region to a repository (additional :region field on Ecto schema) on creation and use the region to retrieve the right node in the cluster.

So let’s say a user creates a new repository via the Web frontend. We use the Fly-Region HTTP header to assign the region and init the repository there.
With a different distribution strategy Git commands are routed to the right node based on the repository region.
Profit

Now this solution is only partially pleasing because it does not support multiple instances in the same region .

We’d have to store the Fly volume ID instead of the region and keep trace (CRDT) of volume assignment during deployment and cluster changes.

I’m not at all familiar with distributed systems so my approach might seem foolish. I’m very open to critics and would really like some feedback here .

redrabbit · September 29, 2021, 4:34pm

Small update

I’ve refactored GitGud.RepoPool and GitGud.RepoStorage in order to take Fly volumes into accounts:

Introduce the aspect of volumes. Basically a volume is 32-bit random string used to associated to a node’s disk.
Each GitGud.Repo schema has a :volume field. Creating a new repository will assign the node’s volume per default.
Replace :horde Mix dependency with Erlang’s build-in :global registry. The latter being essentially simpler and faster for my use-case.

For testing purpose, I’ve updated my Fly setup to work on two regions (fra, and lax).

While it was working with repositories stored on different volumes I had to deal with huge latency because of inbound communication. So a lot of refactoring has to be done:

Batch all Git commands into transactions in order to keep number of roundtrips between instances low.
Refactor GitGud.RepoPool to manage repository pools more efficiently. Also try to minimise the number of agents used for each request.

After some work done, I’ve got my latest code running on //git.limo on a two regions setup .

I’ve set up two repositories on different regions for testing purpose:

fra - https://git.limo/redrabbit/elixir
lax - https://git.limo/redrabbit/phoenix

So far I’m quite happy with latency and the optimisations I’ve done so far. But try for yourself and let me know.

kurt · September 29, 2021, 5:12pm

This is still amazing, I’m glad you bumped it. Are you interested in co-authoring a blog post about it with us? Or interested in us writing one?

redrabbit · September 29, 2021, 7:57pm

Thank you for the supporting reply and for the platform you guys offer. Had nothing but a great experience so far!

I’d be very happy to see a story about my project on your blog! If there’s anything I can help with, just ping me.

ahruygt · October 6, 2021, 5:15pm

Hey hey! This is Annie from Fly.io. I’d love to make an illustration for your project. Can you email me at annie@fly.io when you have a moment.

catflydotio · March 21, 2022, 4:16pm

@redrabbit Finally tweeted about this today.

redrabbit · March 21, 2022, 4:49pm

@catflydotio Thanks for sharing this .

I’ve started at a new company earlier this year and didn’t have much time to continue my journey with git.limo. I have a couple of ideas in the pipeline thought, hopefully I can find some time later on.

Anyway thank you for the kind words.

catflydotio · March 21, 2022, 8:56pm

@redrabbit Congratulations!

Topic		Replies	Views
Clustering incoming and out going deployments in one cluster Phoenix elixir , distributed	1	277	January 4, 2024
Elixir Getting Started Guide Phoenix elixir , guide	63	6149	December 14, 2021
Early look at Elixir packages for globally distributed deployments with Postgres DBs Phoenix elixir , distributed , postgres , live_view	15	3648	February 15, 2023
Distributed elixir app between fly.io and home Questions / Help elixir , distributed	2	401	January 10, 2023
Git commands in fly ssh console work but not inside a node process running in the same machine JavaScript machines , nodejs	6	28	January 24, 2025

git.limo: A Git source code management tool powered by Elixir.

Related topics