Getting Honeycomb/OpenTelemetry working with distributed Elixir/Phoenix/Postgres

I started working off of this document which has been incredibly helpful… until I actually deploy it to fly.io.

I have a distributed elixir/phoenix cluster running against a distributed Postgres cluster, both on fly.io using fly_postgres and libcluster. The app itself does nothing other than implement a magic-link auth and a way to change a single user setting.

It works locally and sends traces to Honeycomb. When I deploy it to fly.io, the app works fine, but nothing ever shows up in Honeycomb. I have simplified the config as much as possible (to the point where it is a bad example; if you’re looking for a way to configure this and stumble upon this post, don’t do it like this). The config is identical for dev and prod. Here is the diff with sensitive values withheld:

diff --git a/config/runtime.exs b/config/runtime.exs
index c1e3ff4..acc6118 100644
--- a/config/runtime.exs
+++ b/config/runtime.exs
@@ -20,6 +20,17 @@ if System.get_env("PHX_SERVER") do
   config :my_app, MyAppWeb.Endpoint, server: true
 end
 
+config :opentelemetry,
+  span_processor: :batch,
+  exporter: :otlp
+
+config :opentelemetry_exporter,
+  otlp_endpoint: "https://api.honeycomb.io",
+  otlp_headers: [
+    {"x-honeycomb-team", "xxxxxxxxxxxxxxxxx"}
+  ]
+
 if config_env() == :prod do
   database_url =
     System.get_env("DATABASE_URL") ||
diff --git a/lib/my_app/application.ex b/lib/my_app/application.ex
index 8b0269e..35668b4 100644
--- a/lib/my_app/application.ex
+++ b/lib/my_app/application.ex
@@ -5,6 +5,8 @@ defmodule MyApp.Application do
 
   @impl true
   def start(_type, _args) do
+    OpentelemetryPhoenix.setup()
+    OpentelemetryLiveView.setup()
     topologies = Application.get_env(:libcluster, :topologies) || []
 
     children = [
diff --git a/mix.exs b/mix.exs
index 7a5d270..79c4ee8 100644
--- a/mix.exs
+++ b/mix.exs
@@ -42,6 +42,11 @@ defmodule MyApp.MixProject do
       {:hackney, "~> 1.18"},
       {:jason, "~> 1.2"},
       {:libcluster, "~> 3.3"},
+      {:opentelemetry_exporter, "~> 1.0"},
+      {:opentelemetry, "~> 1.0"},
+      {:opentelemetry_api, "~> 1.0"},
+      {:opentelemetry_liveview, "~> 1.0.0-rc.4"},
+      {:opentelemetry_phoenix, "~> 1.0"},
       {:phoenix_ecto, "~> 4.4"},
       {:phoenix_html, "~> 3.0"},
       {:phoenix_live_dashboard, "~> 0.6"},
@@ -80,10 +85,16 @@ defmodule MyApp.MixProject do
     ]
   end
 
-  # [Sets up a static clustering cookie for using in containers.](https://fly.io/docs/app-guides/elixir-static-cookie/)
+  # cookie field [Sets up a static clustering cookie for using in
+  # containers.](https://fly.io/docs/app-guides/elixir-static-cookie/)
   defp releases() do
     [
       my_app: [
+        applications: [
+          my_app: :permanent,
+          opentelemetry_exporter: :permanent,
+          opentelemetry: :temporary
+        ],
         include_executables_for: [:unix],
         cookie: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
       ]
diff --git a/rel/overlays/bin/server b/rel/overlays/bin/server
index 465e970..a608cd8 100755
--- a/rel/overlays/bin/server
+++ b/rel/overlays/bin/server
@@ -1,3 +1,9 @@
 #!/bin/sh
 cd -P -- "$(dirname -- "$0")"
-PHX_SERVER=true exec ./my_app start
+
+# This adds annotations so all of my otel spans have information about
+# the environment in which they execute and then starts the phoenix
+# server instance.
+OTEL_RESOURCE_ATTRIBUTES="service.name=${FLY_APP_NAME},service.instance.id=${FLY_ALLOC_ID},cloud.provider=fly_io,cloud.region.id=${FLY_REGION}" \
+  PHX_SERVER=true \
+  exec ./my_app start

At the risk of sending people trying to help me after red herrings, I suspect I have something screwed up or not set in my release. The other thing I could think of but couldn’t think of how to troubleshoot is that open telemetry data is getting swallowed in the fly.io network or DNS or something.

I really have no idea what to do at this point. There are no log messages indicating anything is failing. I have no idea how to turn on debug logging for the open telemetry libs. I’m honestly a little shocked that I am struggling this much with something that should be trivial. Any help or ideas anyone has would be profoundly appreciated.

I just tried switching the opentelemetry processor to stdout instead of sending it to Honeycomb to see if it was even running using this config directly in runtime.exs above the if config_env() == :prod do statement.

config :opentelemetry, :processors,
  otel_batch_processor: %{
    exporter: {:otel_exporter_stdout, []}
  }

Here are the logs from the deploy and subsequent call that should have logged telemetry data.

2022-09-10T19:22:06Z runner[ca900611] dfw [info]Starting instance
2022-09-10T19:22:06Z runner[ca900611] dfw [info]Configuring virtual machine
2022-09-10T19:22:06Z runner[ca900611] dfw [info]Pulling container image
2022-09-10T19:22:07Z runner[ca900611] dfw [info]Unpacking image
2022-09-10T19:22:08Z runner[ca900611] dfw [info]Preparing kernel init
2022-09-10T19:22:08Z runner[ca900611] dfw [info]Configuring firecracker
2022-09-10T19:22:08Z runner[ca900611] dfw [info]Starting virtual machine
2022-09-10T19:22:08Z app[ca900611] dfw [info]Starting init (commit: 249766e)...
2022-09-10T19:22:08Z app[ca900611] dfw [info]Setting up swapspace version 1, size = 512 MiB (536866816 bytes)
2022-09-10T19:22:08Z app[ca900611] dfw [info]no label, UUID=c073199d-abba-47a1-80cb-4d0794229080
2022-09-10T19:22:08Z app[ca900611] dfw [info]Preparing to run: `/app/bin/migrate` as nobody
2022-09-10T19:22:08Z app[ca900611] dfw [info]2022/09/10 19:22:08 listening on [fdaa:0:869c:a7b:cf99:ca90:611:2]:22 (DNS: [fdaa::3]:53)
2022-09-10T19:22:10Z app[ca900611] dfw [info]19:22:10.966 [info] Migrations already up
2022-09-10T19:22:11Z app[ca900611] dfw [info]Starting clean up.
2022-09-10T19:22:32Z runner[a05fbf8c] sjc [info]Shutting down virtual machine
2022-09-10T19:22:32Z app[a05fbf8c] sjc [info]Sending signal SIGTERM to main child process w/ PID 515
2022-09-10T19:22:32Z app[a05fbf8c] sjc [info]19:22:32.125 [notice] SIGTERM received - shutting down
2022-09-10T19:22:32Z app[b65b9934] dfw [info]19:22:32.567 [info] [libcluster:fly6pn] disconnected from :"reel-prod@fdaa:0:869c:a7b:b389:a05f:bf8c:2"
2022-09-10T19:22:32Z app[b65b9934] dfw [info]19:22:32.567 [info] Dropping node :"reel-prod@fdaa:0:869c:a7b:b389:a05f:bf8c:2" for region sjc
2022-09-10T19:22:34Z app[a05fbf8c] sjc [info]Starting clean up.
2022-09-10T19:22:35Z runner[b09fb331] sjc [info]Starting instance
2022-09-10T19:22:36Z runner[b09fb331] sjc [info]Configuring virtual machine
2022-09-10T19:22:36Z runner[b09fb331] sjc [info]Pulling container image
2022-09-10T19:22:38Z runner[b09fb331] sjc [info]Unpacking image
2022-09-10T19:22:39Z runner[b09fb331] sjc [info]Preparing kernel init
2022-09-10T19:22:39Z runner[b09fb331] sjc [info]Configuring firecracker
2022-09-10T19:22:39Z runner[b09fb331] sjc [info]Starting virtual machine
2022-09-10T19:22:39Z app[b09fb331] sjc [info]Starting init (commit: 249766e)...
2022-09-10T19:22:39Z app[b09fb331] sjc [info]Preparing to run: `/app/bin/server` as nobody
2022-09-10T19:22:39Z app[b09fb331] sjc [info]2022/09/10 19:22:39 listening on [fdaa:0:869c:a7b:b389:b09f:b331:2]:22 (DNS: [fdaa::3]:53)
2022-09-10T19:22:42Z app[b65b9934] dfw [info]19:22:42.684 [warning] [libcluster:fly6pn] unable to connect to :"reel-prod@fdaa:0:869c:a7b:b389:b09f:b331:2"
2022-09-10T19:22:42Z app[b09fb331] sjc [info]19:22:42.837 [info] Running ReelWeb.Endpoint with cowboy 2.9.0 at :::8080 (http)
2022-09-10T19:22:42Z app[b09fb331] sjc [info]19:22:42.840 [info] Access ReelWeb.Endpoint at https://reel-prod.fly.dev
2022-09-10T19:22:43Z app[b09fb331] sjc [info]19:22:43.231 [info] [libcluster:fly6pn] connected to :"reel-prod@fdaa:0:869c:a7b:cf99:b65b:9934:2"
2022-09-10T19:22:43Z app[b65b9934] dfw [info]19:22:43.303 [info] Discovered node :"reel-prod@fdaa:0:869c:a7b:b389:b09f:b331:2" in region sjc
2022-09-10T19:22:43Z app[b09fb331] sjc [info]19:22:43.328 [info] Discovered node :"reel-prod@fdaa:0:869c:a7b:cf99:b65b:9934:2" in region dfw
2022-09-10T19:22:43Z app[b09fb331] sjc [info]19:22:43.529 [notice] global: Name conflict terminating {Swoosh.Adapters.Local.Storage.Memory, #PID<47718.2251.0>}
2022-09-10T19:23:23Z runner[b65b9934] dfw [info]Shutting down virtual machine
2022-09-10T19:23:23Z app[b65b9934] dfw [info]Sending signal SIGTERM to main child process w/ PID 515
2022-09-10T19:23:23Z app[b65b9934] dfw [info]19:23:23.616 [notice] SIGTERM received - shutting down
2022-09-10T19:23:24Z app[b09fb331] sjc [info]19:23:24.657 [info] Dropping node :"reel-prod@fdaa:0:869c:a7b:cf99:b65b:9934:2" for region dfw
2022-09-10T19:23:25Z app[b65b9934] dfw [info]Starting clean up.
2022-09-10T19:23:28Z runner[0de1a07d] dfw [info]Starting instance
2022-09-10T19:23:28Z runner[0de1a07d] dfw [info]Configuring virtual machine
2022-09-10T19:23:28Z runner[0de1a07d] dfw [info]Pulling container image
2022-09-10T19:23:29Z runner[0de1a07d] dfw [info]Unpacking image
2022-09-10T19:23:29Z runner[0de1a07d] dfw [info]Preparing kernel init
2022-09-10T19:23:29Z runner[0de1a07d] dfw [info]Configuring firecracker
2022-09-10T19:23:29Z runner[0de1a07d] dfw [info]Starting virtual machine
2022-09-10T19:23:29Z app[0de1a07d] dfw [info]Starting init (commit: 249766e)...
2022-09-10T19:23:29Z app[0de1a07d] dfw [info]Preparing to run: `/app/bin/server` as nobody
2022-09-10T19:23:29Z app[0de1a07d] dfw [info]2022/09/10 19:23:29 listening on [fdaa:0:869c:a7b:cf99:de1:a07d:2]:22 (DNS: [fdaa::3]:53)
2022-09-10T19:23:32Z app[0de1a07d] dfw [info]19:23:32.466 [info] Running ReelWeb.Endpoint with cowboy 2.9.0 at :::8080 (http)
2022-09-10T19:23:32Z app[0de1a07d] dfw [info]19:23:32.468 [info] Access ReelWeb.Endpoint at https://reel-prod.fly.dev
2022-09-10T19:23:32Z app[0de1a07d] dfw [info]19:23:32.846 [info] [libcluster:fly6pn] connected to :"reel-prod@fdaa:0:869c:a7b:b389:b09f:b331:2"
2022-09-10T19:23:32Z app[b09fb331] sjc [info]19:23:32.919 [info] Discovered node :"reel-prod@fdaa:0:869c:a7b:cf99:de1:a07d:2" in region dfw
2022-09-10T19:23:32Z app[0de1a07d] dfw [info]19:23:32.942 [info] Discovered node :"reel-prod@fdaa:0:869c:a7b:b389:b09f:b331:2" in region sjc
2022-09-10T19:23:33Z app[b09fb331] sjc [info]19:23:33.175 [notice] global: Name conflict terminating {Swoosh.Adapters.Local.Storage.Memory, #PID<47722.2082.0>}
2022-09-10T19:24:44Z app[0de1a07d] dfw [info]19:24:44.807 request_id=FxOWPwMCzccFs78AAAHh [info] GET /
2022-09-10T19:24:44Z app[0de1a07d] dfw [info]19:24:44.818 request_id=FxOWPwMCzccFs78AAAHh [info] Sent 200 in 10ms
2022-09-10T19:24:49Z app[0de1a07d] dfw [info]19:24:49.437 request_id=FxOWQBbxV8uDvG0AAAIR [info] GET /users/settings
2022-09-10T19:24:49Z app[0de1a07d] dfw [info]19:24:49.443 request_id=FxOWQBbxV8uDvG0AAAIR [info] Sent 200 in 5ms
2022-09-10T19:24:52Z app[0de1a07d] dfw [info]19:24:52.036 request_id=FxOWQLHqO3JUUOoAAAIx [info] POST /users/settings
2022-09-10T19:24:52Z app[0de1a07d] dfw [info]19:24:52.046 request_id=FxOWQLHqO3JUUOoAAAIx [info] Sent 302 in 9ms
2022-09-10T19:24:52Z app[0de1a07d] dfw [info]19:24:52.098 request_id=FxOWQLWRaKLQMx0AAAJR [info] GET /
2022-09-10T19:24:52Z app[0de1a07d] dfw [info]19:24:52.102 request_id=FxOWQLWRaKLQMx0AAAJR [info] Sent 200 in 4ms

Here is what that config generates locally:

➜  reel git:(main) ✗ iex -S mix phx.server
Erlang/OTP 25 [erts-13.0.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

[info] Exporter :otel_exporter_stdout successfully initialized
[debug] QUERY OK db=0.1ms decode=1.5ms queue=39.1ms idle=0.0ms
select CAST(pg_last_wal_replay_lsn() AS TEXT) []
[info] Running ReelWeb.Endpoint with cowboy 2.9.0 at 127.0.0.1:4000 (http)
[info] Access ReelWeb.Endpoint at http://localhost:4000
Interactive Elixir (1.13.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> [watch] build finished, watching for changes...

Rebuilding...
Done in 115ms.
[info] GET /
[debug] Processing with ReelWeb.PageController.index/2
  Parameters: %{}
  Pipelines: [:browser]
[debug] QUERY OK source="users_tokens" db=2.7ms queue=0.6ms idle=1118.6ms
SELECT u1."id", u1."email", u1."name", u1."display_name", u1."confirmed_at", u1."inserted_at", u1."updated_at" FROM "users_tokens" AS u0 INNER JOIN "users" AS u1 ON u1."id" = u0."user_id" WHERE ((u0."token" = $1) AND (u0."context" = $2)) AND (u0."inserted_at" > $3::timestamp + (-(129600)::numeric * interval '1 minute')) [<<171, 105, 223, 252, 200, 247, 200, 61, 159, 94, 185, 73, 149, 188, 220, 25, 213, 196, 244, 237, 22, 189, 181, 19, 107, 185, 117, 162, 110, 175, 168, 144>>, "session", ~U[2022-09-10 19:26:16.734217Z]]
↳ ReelWeb.UserAuth.fetch_current_user/2, at: lib/reel_web/controllers/user_auth.ex:92
[info] Sent 200 in 88ms
*SPANS FOR DEBUG*
{span,159932783622264806624766945821061902995,16343692459422432525,[],
      undefined,<<"/">>,server,-576460745652062625,-576460745570740208,
      {attributes,128,infinity,0,
                  #{'http.client_ip' => <<"127.0.0.1">>,
                    'http.flavor' => '1.1','http.host' => <<"localhost">>,
                    'http.method' => <<"GET">>,'http.route' => <<"/">>,
                    'http.scheme' => <<"http">>,'http.status_code' => 200,
                    'http.target' => <<"/">>,
                    'http.user_agent' =>
                        <<"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.25.584.290 Safari/537.36">>,
                    'net.host.ip' => <<"127.0.0.1">>,'net.host.port' => 4000,
                    'net.peer.ip' => <<"127.0.0.1">>,'net.peer.port' => 61437,
                    'net.transport' => 'IP.TCP','phoenix.action' => index,
                    'phoenix.plug' => 'Elixir.ReelWeb.PageController'}},
      {events,128,128,infinity,0,[]},
      {links,128,128,infinity,0,[]},
      undefined,1,false,
      {instrumentation_scope,<<"opentelemetry_phoenix">>,<<"1.0.0">>,
                             undefined}}

I am reasonably sure that telemetry is not running in my released code. I have no idea why. Any insights anyone has would be appreciated.

I created a new phoenix app from scratch, no clustering, no db, no ecto. I have the exact same problem. I am at a complete loss. I suspect it is something to do with how I am doing the release… but who knows?

I just mangled the code so I could get a release to run locally… it works fine. The same exact code runs and sends no telemetry data on fly if I configure it to send it. It logs nothing if I configure it to log it. I have no idea why opentelemetry does nothing on fly.

Just spun up the docker container locally with ipv4 (mac docker doesn’t seem to let me turn on ipv6?) and it works fine in the container with ipv4 networking.

I spun the docker container up in linux with ipv6 and it works fine and logs opentelemetry data.

I spun this into the default phoenix app scaffold and it exhibits the same behavior. It can be found here. I was so excited to use elixir on fly, and this whole experience has completely taken the wind out of my sails.

I don’t think it has anything to do with the distribution, postgres, or the release at this point. The docker container behaves without modification. If you run docker build -t please . to build the release in a container, run docker run -it -p 4000:4000 please /bin/bash to enter the container, run OTEL_RESOURCE_ATTRIBUTES="service.name=fake-fly-app,service.instance.id=fake-allocation-id,cloud.provider=fly_io,cloud.region.id=dfw" SECRET_KEY_BASE=kNgqiStPbOiIsNfNzflbiuULjSxHpOJPCaepCveU5raoF0xPu1Dl8LYpkC9ZHYzE MIX_ENV=prod PHX_SERVER=true bin/ugh start to start the app, and hit http://localhost:4000 you are going to see phoenix in your browser and in the logs in your docker container you are going to see telemetry data.

If you do a fly deploy, fly logs, and then hit your site, you’re not going to see openetelemetry data. Again, if anyone has any ideas as to what is happening, I am all ears.

Here are the debug logs on startup with opentelemetry incorrectly set to :permanent in the mix.exs to try to surface any failures:

2022-09-11T19:55:31Z runner[68b8cf47] dfw [info]Starting instance
2022-09-11T19:55:31Z runner[68b8cf47] dfw [info]Configuring virtual machine
2022-09-11T19:55:31Z runner[68b8cf47] dfw [info]Pulling container image
2022-09-11T19:55:32Z runner[68b8cf47] dfw [info]Unpacking image
2022-09-11T19:55:33Z runner[68b8cf47] dfw [info]Preparing kernel init
2022-09-11T19:55:33Z runner[68b8cf47] dfw [info]Configuring firecracker
2022-09-11T19:55:33Z runner[68b8cf47] dfw [info]Starting virtual machine
2022-09-11T19:55:33Z app[68b8cf47] dfw [info]Starting init (commit: 249766e)...
2022-09-11T19:55:33Z app[68b8cf47] dfw [info]Preparing to run: `/app/bin/server` as nobody
2022-09-11T19:55:33Z app[68b8cf47] dfw [info]2022/09/11 19:55:33 listening on [fdaa:0:869c:a7b:cf99:68b8:cf47:2]:22 (DNS: [fdaa::3]:53)
2022-09-11T19:55:35Z app[68b8cf47] dfw [info]19:55:35.930 [info] Running UghWeb.Endpoint with cowboy 2.9.0 at :::8080 (http)
2022-09-11T19:55:35Z app[68b8cf47] dfw [info]19:55:35.932 [info] Access UghWeb.Endpoint at https://crimson-smoke-6983.fly.dev
2022-09-11T19:56:21Z app[68b8cf47] dfw [info]19:56:21.266 request_id=FxPmjSITXxx5z5EAAACx [info] GET /
2022-09-11T19:56:21Z app[68b8cf47] dfw [info]19:56:21.267 request_id=FxPmjSITXxx5z5EAAACx [debug] Processing with UghWeb.PageController.index/2
2022-09-11T19:56:21Z app[68b8cf47] dfw [info]  Parameters: %{}
2022-09-11T19:56:21Z app[68b8cf47] dfw [info]  Pipelines: [:browser]
2022-09-11T19:56:21Z app[68b8cf47] dfw [info]19:56:21.271 request_id=FxPmjSITXxx5z5EAAACx [info] Sent 200 in 4ms

It’s a weekend so all the Phoenix folks are experiencing the “life” part of work-life balance, but this seems like it might be an issue with release vs prod config. Which is something I never feel like I understand.

I don’t think there’s anything getting swallowed by DNS or networking (our DNS and networking setup is not very complicated).

Here are random things I’d try to debug:

  1. Make sure OTEL_RESOURCE_ATTRIBUTES is actually getting set properly. Maybe the /rel/overlays/bin/server is not actually being used?
  2. Connect with iex and see if OpenTelemetry is actually running like you’d expect.
  3. Write to stdout from the start(_type, _args) def in your application.ex.

Docker was a good debugging step. It makes me think there’s some really-easy-to-overlook difference in env vars between VM when you deploy and local Docker.

That totally makes two of us. At this point I am basically just pulling every lever I can find and noting the changes I can see. It ceased being science early yesterday and has devolved into button smashing.

I’ve been chatting with some of the opentelemetry folks who are also failing at experiencing the “life” part of work-life. To address the suggestions you made:

  1. They actually don’t seem to matter. If I don’t set them, I just end up with some nulls in my telemetry data. I can remove that whole section and it will still “work” locally.
  2. Courtesy of your sick ability to VPN in, I present you with this baller screenshot:
  3. I added some logs in this branch before and after the opentelemetry code and also matched on the :ok atom. Here are the results:
    2022-09-11T21:27:12Z app[6f9a7996] dfw [info]Starting init (commit: 249766e)...
    2022-09-11T21:27:12Z app[6f9a7996] dfw [info]Preparing to run: `/app/bin/server` as nobody
    2022-09-11T21:27:12Z app[6f9a7996] dfw [info]2022/09/11 21:27:12 listening on [fdaa:0:869c:a7b:9f67:6f9a:7996:2]:22 (DNS: [fdaa::3]:53)
    2022-09-11T21:27:15Z app[6f9a7996] dfw [info]Starting up Opentelemetry Phoenix
    2022-09-11T21:27:15Z app[6f9a7996] dfw [info]Started up Opentelemetry Phoenix
    2022-09-11T21:27:15Z app[6f9a7996] dfw [info]21:27:15.325 [info] Running UghWeb.Endpoint with cowboy 2.9.0 at :::8080 (http)
    

It could be some env vars, although I am setting nothing when I run docker locally, and it behaves. It seems pretty unlikely to me that fly is passing it something that could disable it.

Hi @smashedtoatoms!

I don’t have any personal experience with Honeycomb, but I’ll see what I can do.

I haven’t used opentelemetry either. From what I saw, the [hex.pm opentelemetry] (Packages | Hex) library leads to the GitHub - open-telemetry/opentelemetry-erlang: OpenTelemetry Erlang SDK library on Github. I have used telemetry, which Plug and Ecto are already plumbed for. I see from your OTP application that you also have telemetry running.

Perhaps that is all as it should be and intentional, I just wanted to make it clear before going further. I will also assume that you have gone through the opentelemetry setup guide: Getting Started | OpenTelemetry

A question I have about your runtime.exs config was the “xxx” key being configured. It is generally preferred to keep those keys out of the code by using a System.get_env("HONEYCOMB_KEY"). Then set that value via a fly secrets (or fly.toml ENV if it isn’t secret). That way you don’t mix dev/prod keys, etc.

Another debug idea is to see if you can hit the "https://api.honeycomb.io" endpoint from your Elixir code. That could be like an HTTP request to the API endpoint via IEx. There is nothing on the Fly.io side that would prevent that.

In short, I’m unaware of anything on the Fly side that prevents this from working. At the same time, I don’t have experience with either opentelemetry exporting or Honeycomb.

Hey Mark!

I am very far beyond doing the right thing on this example project. I have hard-coded values in the places they are to eliminate ambiguity and the possibility of failure. So, for example, instead of using an environment variable, I hard-coded it, lest people helping me ask me to verify that my environment variable is getting set correctly. I will rotate these, nuke the repo, and write up the actual solution for posterity once I get to the bottom of this.

To be clear, my understanding of opentelemetry is shaky. Telemetry is for internal elixir telemetry. Opentelemetry is for sending telemetry data to companies that implement telemetry gathering per the defined spec. They’re not really for the same thing necessarily. Let me know if that is your understanding as well.

The mechanism I had found that seemed the most reasonable as far as getting data from an elixir into honeycomb was to use opentelemetry per this blog post. All the other Honeycomb solutions haven’t been updated in a long time and/or had requirements out of step with a newly defined phoenix application.

I have an ongoing discussion with the folks at opentelemetry-erlang, but nothing that has uncovered any surprises yet.

I haven’t spent a lot of time troubleshooting the networking because turning on opentelemetry and having it write to stdout isn’t even working. I think it’s failing before the network call. That said, here are the results from remoting into the instance running in fly.io:

iex(crimson-smoke-6983@fdaa:0:869c:a7b:9f67:ff:3777:2)1> HTTPoison.get("https://api.honeycomb.io")
{:ok,
 %HTTPoison.Response{
   body: "{\"message\": \"Oh, hello there.\"}",
   headers: [
     {"Date", "Mon, 12 Sep 2022 19:17:17 GMT"},
     {"Content-Type", "application/json"},
     {"Content-Length", "31"},
     {"Connection", "keep-alive"},
     {"Access-Control-Allow-Origin", "*"},
     {"Vary", "Accept-Encoding"}
   ],
   request: %HTTPoison.Request{
     body: "",
     headers: [],
     method: :get,
     options: [],
     params: %{},
     url: "https://api.honeycomb.io"
   },
   request_url: "https://api.honeycomb.io",
   status_code: 200
 }}

Obviously that is working fine. @kurt had indicated that he believed DNS and networking to not be a factor, and that appears to be accurate.

At this point, I am thinking something in my docker config is causing opentelemetry-erlang to not behave. I would blame the packaging, except for the fact that the container runs fine locally. This is very weird.

Hi @smashedtoatoms,

Yes, indeed. It does seem weird to work in the Docker container but not when deployed via Fly. Typically, the problems are a bad release config because people can’t easily test it.

I read through your issue on with the opentelemetry-erlang project as well. I’m stumped too! A lot of that comes from me not knowing how the library works or what is “normal”.

I just pushed a branch with the networking changes to allow it to run locally here. I then used the following to build it:

docker build -t please .

and the following to run it:

docker run -e SECRET_KEY_BASE='8+0M2NIsyUvHK+8z1NhtE++YfUAKFPJxoYxPXbmIR0WHb0vDxumlM5nCf39rVHNX' -e FLY_APP_NAME='local-fly-app' -it -p 4000:4000 please /app/bin/server

I can hit http://localhost:4000 and see it output the span to stdout.

If I deploy that identical code with fly deploy and then watch the logs with fly watch I see no spans (I do; however, see a bunch of libcluster connection failures because I have to mess up the cluster config to allow the local build to work with my macs ipv4 restrictions).

Ok, I tweaked the setup a little more to make it as easy as I can to reproduce the problem. If you pull the latest from here you will get a version that will build a docker image with a release in it. It can be run locally or be deployed to fly with no modifications. If you make a request against the local one, you will get a span printed to stdout. If you make a request against the one running on fly.io, there will be no span printed to stdout. If you’re on Linux, you can use the fly deploy --local-only and guarantee that your local build and the one running in fly are identical. It works locally, it logs no spans on Fly.

Docker Commands

docker build -t please .
docker run -e SECRET_KEY_BASE='8+0M2NIsyUvHK+8z1NhtE++YfUAKFPJxoYxPXbmIR0WHb0vDxumlM5nCf39rVHNX' -e FLY_APP_NAME='local-fly-app' -it -p 4000:4000 please /app/bin/server

Fly.io Commands

Note: to run it on fly you’ll need to change the app name on line 3 of the fly.toml file.

fly auth login
fly deploy 
fly deploy --local-only # if you're on linux and want to deploy the exact image that you're running locally

I’m surprised that it’s working locally, since you don’t seem to have the x-honeycomb-dataset header set in your config - I assumed it would be required by Honeycomb.

It isn’t getting that far. I actually removed the entire exporter and am just writing the spans to stdout at this point. I wanted to remove networking as a possible thing to troubleshoot.

So Jon Anderson in the Honeycomb Pollinators Slack channel suggested setting the opentelemetry setting to this:

config :opentelemetry,
  sampler: :always_on,
  span_processor: :batch,
  traces_exporter: {:otel_exporter_stdout, []}

I did that, and it works both locally and in fly. sampler: :always_on is the magic code. Does anyone have any idea why the default might not work in fly?

I do!

As of a ~month ago, fly-proxy injects the following headers in requests to backend (your app): traceparent, tracestate.

These are standard - I’m assuming your HTTP lib decides to specifically look for those headers, extract them, and set the current otel context to that “remote otel context” it just extracted from the headers.

I have ~feelings~ about HTTP libs doing that implicitly/by default, but regardless, it’s probably a bad idea for fly-proxy to inject those by default, since default samplers will tend to do “parent-based sampling”, and in your case, unless you fall in the <0.02% of requests we’re sampling by default, it’ll decide to drop the span.

I’ll work on making that behavior opt-in (we rely on it internally).

1 Like

Wild!

Ok, so my http server is receiving traceparent and tracestate headers injected by fly, and then my opentelemetry lib is doing exactly what it is designed to do. That explains it. I will write this up and update the source blog I was working from. Seems useful to capture this.

1 Like

@amos Out of curiosity, if fly-proxy is injecting these, would they overwrite the same values if another of my own services called to one deployed in fly.io and I wanted them to have a continuous distributed trace?