retired the hex package I put up with a notice to refer to official fly ones.
Well @kurt I think I’ve got the fly-replay
approach to work in Laravel To compare I tried various config:
- reads and writes handled by the primary
- reads handled by a read-replica, writes handled by the primary
- reads handled by a read-replica, writes handled by the primary by using the
fly-replay
header to replay writes-to-a-read-replica in its location
… using a suitably distributed vm (in lhr
) and database (in scl
) to be sure to have a large amount of latency due to the enormous distance. And it appears to work: 3, the fly-replay
approach, reduces the time for a write (from lhr
) and is the fastest. Based on the guide it uses a read-replica (port 5433
) unless the request is coming from the primary region. And so when it doesn’t (e.g from lhr
), an exception is thrown (due to the write to a read-replica) and that exception is caught, triggering the replay in the primary region.
I can write about it if you like?
And/or can have a look at Fastify if nobody has done that yet.
Wow that’s amazing. We’d love to have you write about it, I think the first thing to do is create an example repository with a README. We’re happy to pay for it, too.
@kurt Awesome. Ok, great, I’ll put something together over the weekend.
Hi @kurt
As discussed I have written a guide for how I deployed a Laravel application to Fly.
In the end I divided it into two parts: one repo to explain how to get a demo Laravel application deployed, and another repo which builds upon that to describe the changes I made to use the fly-replay header to improve database performance. I figured there will likely be people who will only need one or the other. And a single one became way too long!
They are:
Let me know what you think whenever you get a chance (email, or here, wherever!) and I can add/edit/delete whatever you want.
Hey guys,
I got multiregion working for Typeorm, I followed this https://typeorm.io/multiple-data-sources#replication
here is a gist of my code
const databaseUrl = process.env.DATABASE_URL as string;
let options: ConnectionOptions = {
type: 'postgres',
name: 'default',
logging: false,
synchronize: false,
entities: [__dirname + '/../modules/**/*.js'],
migrations: [__dirname + '/../migration/*.js'],
};
if (process.env.PRIMARY_REGION !== process.env.FLY_REGION) {
options = {
...options,
replication: {
master: {
url: databaseUrl,
},
slaves: [
{
url: databaseUrl.replace('5432', '5433'),
},
],
},
};
} else {
options = {
...options,
url: databaseUrl,
};
}
return createConnection(options)
do it for production environment.
I think I just hit a variation of this in Rails when doing OmniAuth authorization. The request is replayed at the primary, but then fails with bad_verification_code
(Troubleshooting OAuth App access token request errors - GitHub Docs).
I will try the Fly-Prefer-Region
header and report back!
I’m starting to look into this again.
So if I use port 5433
for the connection URL this will always automatically connect the client to the nearest replica and 5432
to the primary node? What if there are no replicas available at a particular moment?
BTW has anyone implemented the request replay in Node? The docs mention adding an HTTP header:
Once caught, just send a
fly-replay
header specifying the primary region. Forchaos-postgres
, sendfly-replay: region=scl
, and we’ll take care of the rest.
I’m not sure I understand where this header needs to be added. The example seems like a very specific Rails implementation.
@pier That’s correct. I’m not sure what would happen if a region was down (and so a replica you’ve created is not available). I’d assume the same as if it didn’t exist at all - the connection would be routed to the primary.
As for Node, yep, I had a go a while back at using the technique with Fastify, with Prisma as the ORM. Check out:
Hopefully the readme explains how it works, but let me know if not
That replay approach sends all queries to the nearest database. Of course if that results in a write being sent to a read-only replica, well of course Postgres will fail, with an error. That exception/error is caught. Since you know the reason, you replay that whole http request in the region that the primary database is in. Which does allow writes, and so it works.
Or you could avoid replays and instead use a separate read and write connection, where the ORM decides whether to use 5433 or 5432.
The only issue I recall I had (with that replay approach) was ensuring there was an app vm in the same region as the primary database vm. Since with auto-scaling, it wasn’t possible to enforce the regions the app’s vms were in, only suggest a region pool. And if not, writes will fail. I’m not sure if that has been resolved since.
Thanks @greg I will check this out in detail!
I’m not using an ORM (Prisma in particular is pretty slow) but I was planning on just having two PG client instances.
But if the replicas are down for some reason, would read queries made to 5433
be sent to the primary instance?
@pier Hmm … that I don’t know. I would assume/hope their proxy would be smart enough to do that, but that is a total guess. It would need someone from Fly to confirm.
As for the ORM, yep, makes sense. I’d think you could still do it e.g with a readClient
and writeClient
(set with the respective port) rather than simply one client
. The replay bit is independent of that anyway - that’s just catching an exception/error, which pg
or whatever would also throw if you try and write to a read-replica.
Thanks I hope someone from Fly can confirm.
The docs only mention:
Port
5433
is direct to the PostgreSQL member, and used to connect to read replicas directly.
I did a little test in my dev environment using a primary client and a replica client using 5433
port.
Something like this:
import postgres from 'postgres';
const sql = postgres(process.env.DATABASE_URL);
const sqlReplica = postgres(process.env.DATABASE_URL_REPLICA);
When using the replica client, the queries were sent to the primary (and only) instance. So I guess my question has been answered.
So my PG instances are all in AMS. I cloned a v2 app into LAX and also cloned a PG replica into LAX.
I’ve also created a new PG client instance using the port 5433 to use the replicas.
It made no difference in performance in a page that has multiple queries reading from the replica. Actually, some requests randomly take 3-5x longer now compared to when the LAX app was reading from AMS.
I restarted the LAX app machine in case “have you tried turning it off and on” or something… Same result.
This is the status of the new replica so I guess it should be working:
started replica lax 3 total, 3 passing
My DB is very small. I doubt it’s still copying the data to the volume.
Not sure if I’m missing something to make the whole thing work. Is there anything else I could check?
For the time being it makes more sense to just run everything in AMS.
Hopefully someone from Fly can chime in and let me know if I’m doing something obviously wrong or maybe something else is failing.
@pier That’s strange Assuming there is nothing else affecting it (like the ongoing issue, but that should be unrelated) yep, I’m not sure why a request would take longer from an app in LAX to a database in LAX compared to a database AMS. Internet routing can be weird, but given that’s thousands of miles away, that’s certainly unexpected. I recall in my experiments I’d get the response to include the FLY_REGION to see where the request was being handled, and also check the logs (where I’d log when a vm was handling or replaying a request). That would show which vms were getting involved.
I actually just noticed they have marked the docs as legacy (Multi-region Postgres (Legacy) · Fly Docs). That may be just because of the old database using Nomad however I wonder if that also means that fly-replay
trick is legacy too? It would still be supported by Fly’s proxy however I wonder if there is a different approach when using the new database. Since if it makes the request take longer, indeed there is no point in adding the additional complexity and consistency issues of the replica at all.
I think that my app in LAX probably was not connecting to the replica in LAX for some reason.
Maybe something failed when cloning the replica or maybe some network shenanigans?
The HTTP request includes this header:
fly-request-id: 01GWQF...KHY-lax
AFAIK this is the entry point into Fly’s network although yeah it’s possible the request wasn’t actually going to the LAX app for some reason.
Do you have any idea how I could check which replica is being used by a PG client?
@pier Hmm … yes if your request was going to the AMS vm or the LAX vm was actually connecting to the AMS database, that would explain slowness. As for how to see which database vm the app’s vm is actually using for a read (for writes, you know of course!) I’m not sure . I don’t know if the logs for the database (after all, a database on Fly is just another app behind the scenes - you can even clone their repo and make your own version of it … at least you could with v1, not sure if that’s the case with the v2) show.
I was just looking how I revealed debugging info when trying using Planetscale’s read replicas (this time with an Express app, also Node, and since they use MySQL-only, with the mysql2
client). The same idea, connecting to the closest one, except instead of using Fly’s internal magic to decide which is the closest one, here it does some logic Ignore that though - you can see here I made a read and write route, and after doing a query they return the region they were served from:
The Fly region being got from an environment variable that Fly apps provide you with at run-time:
The PRIMARY_REGION
is manually provided in the fly.toml
, as of course you know the region your primary database is in.
Here I returned JSON however you could return those region values in headers instead.
Well not sure if Fly changed something on their end… but I tried to connect to a close replica in a Node test app and it worked on the first try.
If anyone is curious here’s the Node project using Fastify and the postgres client:
Initially I used the atdatabases client/ORM which has lots of cool features. Unfortunately it does not support listen/notify which I can’t live without (and I really don’t want to implement this myself).
@pier were you able to resolve this in your application? I’m having the exact same issue.
Apps deployed to lhr
for example will rather go to the Postgres instance in sjc
and completely ignore the Postgres instance in lhr
.
I never implemented it in my application because of data residency and GDPR headaches.
But in my previous comment I made it work with a dumb Node project.