Hey folks,
We’re having a tough time using Tigris storage due to odd issues popping up far more frequently than we’d like.
The various issues are documented below with details. The crux of issues surround Tigris’ S3 API reliability, consistency in responses, replication lag times across geographic locations, and frequently undocumented and inconsistent developer ergonomics.
Preface and Context
We’re big fans of Tigris and want to keep using them if possible. The team is cool & hardworking. They’ve been friendly until now and generally quick to respond.
In fact, we had an issue 6-ish months back with shadow buckets where the feature unexpectedly duplicated all our stored data into Tigris. That caused massive egress fees and a equally egregious S3 bill due to unexpected behavior. I personally had to be in constant conversations with AWS since we couldn’t afford that egress when the bill came and we were heavily AWS-dependent at that time, though to no fault of our own. Luckily, Tigris responded super quickly and even comped the bill . That’s more than reasonable and I was super grateful at the time.
But, the reason why I’m writing this is because odd quirks and major issues like that keep happening and it constantly shifts our focus to doing tasks related to “dealing with Tigris and its edges” rather than properly relying on Tigris as a storage backend to handle that for us.
1 - Massive replication lag (>4-5 hours?) that won’t cut it for production environments:
We’ve faced several issues where uploaded files are requested to be downloaded (from distant geographic locations and therefore PoPs) up to 4 hours later and still not be able to find it. Ideally, this lag is significantly smaller or Tigris enables requester to read-through and grab it from original PoP.
Luckily, Tigris’ CTO has been responsive and proactive telling me that deployment of QoS protocols to mitigate noisy neighbors is on its way, but this caused our whole team to be solely focused on fixing this for 4 full workdays. And the replication lag did not go away (and I have no clue if it ever did).
This was hard enough for us to debug on top of dealing with the SEV1
issue, it was a little upsetting that this wasn’t already on their radar. To prove that there was major replication lag we sent the same S3 List requests, setup VPSes in different locations and a hodgepodge job of using VPNs to manipulate our geographic locations. Results sent from VPSes in different geographic locations responded very different responses, we also attached creation times of those files to make how big the lag is obvious.
“Dealing with Tigris” by migrating our whole app to a different geographical region:
The replication lag, esp. when at > 4 hour lengths are considered SEV1
for our application. We immediately get flooded with customer support tickets because our service primarily deals with the transform and transfer of files. This is 95% of what we do. We determined the replication issue was worst in the US regions (one could view Tigris in the browser or CLI and not see the last 100 files uploaded due to extreme lag with folders and therefore any objects within the subfolders missing). So, once again to accommodate Tigris, we migrated our whole production infrastructure to another geographic location to fix replication lag on Tigris’ end.
This mitigated the issues growing too fast, but only partially. Today, while most of the support tickets are winding down, the issue is still only partially solved and we’re still dealing with the tail-end of the consequences from this.
I understand that distributed object stores are going to be eventually consistent. But that lag is way too large to reach a consistent state. Finally, this should be monitored.
1.1 Eager Caching + Geographic Restrictions?
Even when using some eager caching solutions or restricting storage zones, the guarantees are simply too weak.
Example 1: We’ve seen odd scenarios where uploading files to a bucket in a single restricted storage zone will be downloaded significantly faster when the downloader is close to that zone. This is expected and good! However, when we then upload files to a bucket with the same configuration, but with eager caching enabled and directly requested for the same storage zone as above, the download crawls at 1/10th to 1/100th the speed. In case it helps, the download was for the Singapore AZ. This wasn’t a replication issue, since we waited around 4 hours to attempt that download after the cache request was sent. I’ve seen similar issues a lot on the developer side (see issue 3).
Finally, we’d love to roll out this caching mechanism to help with the lag, but the docs are super sparse on how the various configurations interact with each other.
Our Questions and Requests:
- Can we get better guidance, on handling replication lag scenarios? More broadly, can we get more guidance on handling both uploads and downloads in a stable manner?
- If configuring the caching behavior is the solution, can we get more documentation on implementing eager caching or caching to a specified location
- What exactly is expected behavior for storage regional restriction requests? It doesn’t seem like the use case for regional restriction is to ensure download speeds for nearby users are fast. However, to cache for a customer, let’s say near Singapore, the only way to cache to only that location would be through a single region restriction or global bucket with eager caching (but even then, you do not know if cache is ready!) Shouldn’t we be able to specifically cache to a designated region without utilizing geo-restricted buckets? Utilizing geo-restrictions also has broader implications of needing your infrastructure to be within one of the geographical locations and requiring users who proxy requests via presigned urls or similar to be in that location.
- Inside the Grafana Dashboard or via a reliable API request or CLI interaction, is there a way to know if an object has been cached in the region we want? Alternatively, is there a way to see if a cache-write request has been fully processed? If not, it is not possible to know the state of object transfer speeds at all. And to know if this is an issue, would require VPNs (unreliable) and/or setting up temporary nodes near every PoP which is unrealistic. It seems to be a requirement then to be able to query for a caching request status or for an individual object.
- Do we know or can we know what the expected wait-time would be to replicate to that region so that the next GET request will not 404 and/or the download speed would not crawl?
- There are a lot of edge cases with caching: What happens when you cache on List, but the sender is in a particular AZ, but the request is received in a location where those objects are not recognized yet? Which objects get cached with a List request? None of that is documented. .
- Monitoring of replication lag would be great.
2 - Upload Intermittent Issues
Setting context that our uploads are on the larger side and also larger quantities, we know for a fact that uploads are not working consistently enough. Even with a few retries when utilizing the S3 CLI, we will see that uploads 1 - 100
may work fine, but 101 - 103
will drop out.
The errors returned manifest as follows:
A) - Invalid Arg
- We know this not to be true.
Recreating the upload with no adjustments will succeed and because our workload patterns will upload very similar files and in large sequences e.g. 0001.png to 0300.png
we can be fairly confident that this is not an issue on our end. Additionally, these will almost always manifest from sequential uploads or uploads close to each other in time pointing to a possible dropped or mishandled request(s) on Tigris’ end.
An error occurred (InvalidArgument) when calling the UploadPart operation: Invalid argument
B) - 500 - Internal Error
Today, we’ve had multiple of these random upload errors. We can definitely do better in handling transient network errors, but this is another dropout that persisted after retries as shown below.
An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 2): We encountered an internal errors, please try again.
Finally, we’ve received both of these errors with the same codebase (no changes) on the same day. So the error message is misleading.
Our Questions & Requests:
- Better Error Codes would be hugely appreciated. Even with the above, the
500
is massively preferable over theInvalid Arg
response since we know it’s not an issue with our infra. Taking it one step farther, it’d be great to get a more descriptive response on what happened. - Improved reliability of PUTs (and GETs) - This seems like a must and for workloads that are ephemeral if the PUT fails.
- If there are any, guidance on handling such cases - we are doing retries and we’re implementing forcing fallback to single-part upload in all failure cases as a final attempt. If there are other tips do let us know it’d be massively appreciated.
3 - Sparse Docs, Incompatible features and generally undocumented Behavior
Fast Moving Features, though undocumented and then not working with others.
However, it’s caused a lot of frustration for us devs when working with Tigris and trying to use features with the current docs. It gets especially confusing when combining it with other features.
We’ve found numerous cases where that one feature will not be compatible with another feature, but they’re both individually written in the docs. And to make it worse, those features may not be reflected in the UI (See Experience 4), so there’s no way for us to know things are working the way they’re expected or if we did something wrong or if it’s a bug on Tigris. Debugging becomes a mad chase when this happens.
Experience 1 - CNAME Domain Assignment combined with any bucket setting change experience:
At some point, we could assign a CNAME to a bucket via the UI, but then try to add a team member to the bucket for permissions, the web app would error. When debugging, I noticed it was erroring for any setting on the bucket and was stuck in a loop.
Unfortunately, as far as I knew at the time, you couldn’t add a team member via the CLI, so we had to remove the CNAME completely to get out of the error loop and re-add it after adding my team member.
This has huge issues when operating a production
environment
If your app in production has customer requests going to that CNAME and you want to add a header or change any setting you will have to have downtime in most scenarios unless this is explicitly architected for.
Experience 2 - aws s3 sync
This has largely been fixed, but a few months back, Tigris simply would not accept s3 sync
commands. The same commands would work for AWS S3. Additionally, aws s3 cp
with Tigris would work, but s3 sync
was expected to work from the Tigris S3-API compatibility docs having all the required requests checked. The docs are super sparse on this. and the whole page is written from the Tigris team’s perspective and not of the developer utilizing Tigris. Though to be fair, the docs here on IAM have been updated to be more detailed.
Experience 3 - Odd AWS profiles and inconsistent configs
The configuration of credentials and profiles are neither well-documented nor consistent. ~/.aws/credentials
is not read and used the same way as AWS S3 and the same goes for ~/.aws/config
. We’ve resorted to brute-forcing many options until we find out which one works best (though sometimes, this is also the less ideal option in terms of security).
Currently, we’ve mostly settled on utilizing the --profile
flag always and in any case when a request to Tigris as a storage backend is needed. This is not needed with other S3 compatible providers (we use multiple), but we’ve found it to have the best compatibility to the config needed to stay compatible with other storage backends. Unfortunately, this requires us pulling down those creds in a single defined spot uniformly as a team and also anywhere a temporary auth token via federation won’t work. And it seems that every one on the team has tripped up on this and spent a few hours trying to work around the proper auth configuration for Tigris.
Experience 4 - UI, CLI, API “Uniformness”:
Broadly, the different clients (UI/CLI/Web/SDK/API) are not consistent. The web app UI has a lot of bugs where a CLI config for Tigris will not reflect properly and vice-versa. It is hard to make infrastructure automate-able (manually with a guide on setup) when using the UI. It’s even more difficult to follow IaC as a practice due to this.
As of the time of this writing, you can in the middle of uploading a large file to Tigris, switch networks or turn on/off your VPN (to force an upload failure). Tigris will then reflect the full file size size inside the web app UI. But, there is no file visible in the web app UI and the web app UI - I expect this is due to the successful parts of the multipart upload not getting properly cleaned up and Tigris reflecting the full size too early. However, on our end as devs, there is no simple way AFAIK to clean up that file and that file size metric will remain incorrect. That’s one example of many others we’ve seen.
Our Questions & Requests:
- Just generally more descriptive and more detailed docs.
- UI/CLI/API should be closer to each other in what they present. If not possible, it’d be good to then have relevant APIs to see the “true” state of any stored object, config of bucket, CNAME existence or non-existence, etc. Without this, it just becomes a guessing game on the developer’s side.
- More love into docs on configuring AWS profiles, authorizing users, etc.
Final Thoughts
Totally understand that Tigris is a startup and working on a ton of improvements and shipping fast, maybe too fast. Due to all above, we’re generally paranoid of Tigris’ reliability when it comes to storage interactions now. We’re redesigning to be more fault tolerant and building fallback with other backup providers so that we can flip a switch for storage of urgent issues and also auto-fallback in less urgent cases.
We’re still rooting for Tigris as a service, because we need more AWS S3 alternatives as a whole and it’s a great solution for our needs; the intelligent caching is a great promise and usually the speeds are fantastic. When it works, it’s fantastic! It’s also evident that the price/performance ratio is top-notch.
However, we really need more focus more on reliability, DX, and docs. When rolling out features it’d be highly appreciated if the docs were thorough about what is and what is not supported. Supplementing that with some guidance on implementation of the feature “the right way” (to be tolerant to above) would be great. Finally thorough stability testing on that feature seems par for the course.
I know distributed storage systems are going to be insanely hard to design for, make reliable, and test for across the board, but I don’t think my request is unique and it seems likely that other current customers and prospective new ones probably feel the same.
Feel free to DM/text/etc as needed. We really want this improved.