Our Experience Dealing with Issues of Tigris Reliability, 500 Responses, and Docs: Questions & Requests

steviesh · January 30, 2025, 10:42pm

Hey folks,

We’re having a tough time using Tigris storage due to odd issues popping up far more frequently than we’d like.

The various issues are documented below with details. The crux of issues surround Tigris’ S3 API reliability, consistency in responses, replication lag times across geographic locations, and frequently undocumented and inconsistent developer ergonomics.

Preface and Context

We’re big fans of Tigris and want to keep using them if possible. The team is cool & hardworking. They’ve been friendly until now and generally quick to respond.

In fact, we had an issue 6-ish months back with shadow buckets where the feature unexpectedly duplicated all our stored data into Tigris. That caused massive egress fees and a equally egregious S3 bill due to unexpected behavior. I personally had to be in constant conversations with AWS since we couldn’t afford that egress when the bill came and we were heavily AWS-dependent at that time, though to no fault of our own. Luckily, Tigris responded super quickly and even comped the bill . That’s more than reasonable and I was super grateful at the time.

But, the reason why I’m writing this is because odd quirks and major issues like that keep happening and it constantly shifts our focus to doing tasks related to “dealing with Tigris and its edges” rather than properly relying on Tigris as a storage backend to handle that for us.

1 - Massive replication lag (>4-5 hours?) that won’t cut it for production environments:

We’ve faced several issues where uploaded files are requested to be downloaded (from distant geographic locations and therefore PoPs) up to 4 hours later and still not be able to find it. Ideally, this lag is significantly smaller or Tigris enables requester to read-through and grab it from original PoP.

Luckily, Tigris’ CTO has been responsive and proactive telling me that deployment of QoS protocols to mitigate noisy neighbors is on its way, but this caused our whole team to be solely focused on fixing this for 4 full workdays. And the replication lag did not go away (and I have no clue if it ever did).

This was hard enough for us to debug on top of dealing with the SEV1 issue, it was a little upsetting that this wasn’t already on their radar. To prove that there was major replication lag we sent the same S3 List requests, setup VPSes in different locations and a hodgepodge job of using VPNs to manipulate our geographic locations. Results sent from VPSes in different geographic locations responded very different responses, we also attached creation times of those files to make how big the lag is obvious.

“Dealing with Tigris” by migrating our whole app to a different geographical region:

The replication lag, esp. when at > 4 hour lengths are considered SEV1 for our application. We immediately get flooded with customer support tickets because our service primarily deals with the transform and transfer of files. This is 95% of what we do. We determined the replication issue was worst in the US regions (one could view Tigris in the browser or CLI and not see the last 100 files uploaded due to extreme lag with folders and therefore any objects within the subfolders missing). So, once again to accommodate Tigris, we migrated our whole production infrastructure to another geographic location to fix replication lag on Tigris’ end.

This mitigated the issues growing too fast, but only partially. Today, while most of the support tickets are winding down, the issue is still only partially solved and we’re still dealing with the tail-end of the consequences from this.

I understand that distributed object stores are going to be eventually consistent. But that lag is way too large to reach a consistent state. Finally, this should be monitored.

1.1 Eager Caching + Geographic Restrictions?

Even when using some eager caching solutions or restricting storage zones, the guarantees are simply too weak.

Example 1: We’ve seen odd scenarios where uploading files to a bucket in a single restricted storage zone will be downloaded significantly faster when the downloader is close to that zone. This is expected and good! However, when we then upload files to a bucket with the same configuration, but with eager caching enabled and directly requested for the same storage zone as above, the download crawls at 1/10th to 1/100th the speed. In case it helps, the download was for the Singapore AZ. This wasn’t a replication issue, since we waited around 4 hours to attempt that download after the cache request was sent. I’ve seen similar issues a lot on the developer side (see issue 3).

Finally, we’d love to roll out this caching mechanism to help with the lag, but the docs are super sparse on how the various configurations interact with each other.

Our Questions and Requests:

Can we get better guidance, on handling replication lag scenarios? More broadly, can we get more guidance on handling both uploads and downloads in a stable manner?
If configuring the caching behavior is the solution, can we get more documentation on implementing eager caching or caching to a specified location
What exactly is expected behavior for storage regional restriction requests? It doesn’t seem like the use case for regional restriction is to ensure download speeds for nearby users are fast. However, to cache for a customer, let’s say near Singapore, the only way to cache to only that location would be through a single region restriction or global bucket with eager caching (but even then, you do not know if cache is ready!) Shouldn’t we be able to specifically cache to a designated region without utilizing geo-restricted buckets? Utilizing geo-restrictions also has broader implications of needing your infrastructure to be within one of the geographical locations and requiring users who proxy requests via presigned urls or similar to be in that location.
Inside the Grafana Dashboard or via a reliable API request or CLI interaction, is there a way to know if an object has been cached in the region we want? Alternatively, is there a way to see if a cache-write request has been fully processed? If not, it is not possible to know the state of object transfer speeds at all. And to know if this is an issue, would require VPNs (unreliable) and/or setting up temporary nodes near every PoP which is unrealistic. It seems to be a requirement then to be able to query for a caching request status or for an individual object.
Do we know or can we know what the expected wait-time would be to replicate to that region so that the next GET request will not 404 and/or the download speed would not crawl?
There are a lot of edge cases with caching: What happens when you cache on List, but the sender is in a particular AZ, but the request is received in a location where those objects are not recognized yet? Which objects get cached with a List request? None of that is documented. .
Monitoring of replication lag would be great.

2 - Upload Intermittent Issues

Setting context that our uploads are on the larger side and also larger quantities, we know for a fact that uploads are not working consistently enough. Even with a few retries when utilizing the S3 CLI, we will see that uploads 1 - 100 may work fine, but 101 - 103 will drop out.

The errors returned manifest as follows:

A) - Invalid Arg - We know this not to be true.
Recreating the upload with no adjustments will succeed and because our workload patterns will upload very similar files and in large sequences e.g. 0001.png to 0300.png we can be fairly confident that this is not an issue on our end. Additionally, these will almost always manifest from sequential uploads or uploads close to each other in time pointing to a possible dropped or mishandled request(s) on Tigris’ end.

An error occurred (InvalidArgument) when calling the UploadPart operation: Invalid argument

B) - 500 - Internal Error
Today, we’ve had multiple of these random upload errors. We can definitely do better in handling transient network errors, but this is another dropout that persisted after retries as shown below.

An error occurred (InternalError) when calling the UploadPart operation (reached max retries: 2): We encountered an internal errors, please try again.

Finally, we’ve received both of these errors with the same codebase (no changes) on the same day. So the error message is misleading.

Our Questions & Requests:

Better Error Codes would be hugely appreciated. Even with the above, the 500 is massively preferable over the Invalid Arg response since we know it’s not an issue with our infra. Taking it one step farther, it’d be great to get a more descriptive response on what happened.
Improved reliability of PUTs (and GETs) - This seems like a must and for workloads that are ephemeral if the PUT fails.
If there are any, guidance on handling such cases - we are doing retries and we’re implementing forcing fallback to single-part upload in all failure cases as a final attempt. If there are other tips do let us know it’d be massively appreciated.

3 - Sparse Docs, Incompatible features and generally undocumented Behavior

Fast Moving Features, though undocumented and then not working with others.

However, it’s caused a lot of frustration for us devs when working with Tigris and trying to use features with the current docs. It gets especially confusing when combining it with other features.

We’ve found numerous cases where that one feature will not be compatible with another feature, but they’re both individually written in the docs. And to make it worse, those features may not be reflected in the UI (See Experience 4), so there’s no way for us to know things are working the way they’re expected or if we did something wrong or if it’s a bug on Tigris. Debugging becomes a mad chase when this happens.

Experience 1 - CNAME Domain Assignment combined with any bucket setting change experience:

At some point, we could assign a CNAME to a bucket via the UI, but then try to add a team member to the bucket for permissions, the web app would error. When debugging, I noticed it was erroring for any setting on the bucket and was stuck in a loop.

Unfortunately, as far as I knew at the time, you couldn’t add a team member via the CLI, so we had to remove the CNAME completely to get out of the error loop and re-add it after adding my team member.

This has huge issues when operating a production environment
If your app in production has customer requests going to that CNAME and you want to add a header or change any setting you will have to have downtime in most scenarios unless this is explicitly architected for.

Experience 2 - `aws s3 sync`

This has largely been fixed, but a few months back, Tigris simply would not accept s3 sync commands. The same commands would work for AWS S3. Additionally, aws s3 cp with Tigris would work, but s3 sync was expected to work from the Tigris S3-API compatibility docs having all the required requests checked. The docs are super sparse on this. and the whole page is written from the Tigris team’s perspective and not of the developer utilizing Tigris. Though to be fair, the docs here on IAM have been updated to be more detailed.

Experience 3 - Odd AWS profiles and inconsistent configs

The configuration of credentials and profiles are neither well-documented nor consistent. ~/.aws/credentials is not read and used the same way as AWS S3 and the same goes for ~/.aws/config. We’ve resorted to brute-forcing many options until we find out which one works best (though sometimes, this is also the less ideal option in terms of security).

Currently, we’ve mostly settled on utilizing the --profile flag always and in any case when a request to Tigris as a storage backend is needed. This is not needed with other S3 compatible providers (we use multiple), but we’ve found it to have the best compatibility to the config needed to stay compatible with other storage backends. Unfortunately, this requires us pulling down those creds in a single defined spot uniformly as a team and also anywhere a temporary auth token via federation won’t work. And it seems that every one on the team has tripped up on this and spent a few hours trying to work around the proper auth configuration for Tigris.

Experience 4 - **UI, CLI, API “Uniformness”:**

Broadly, the different clients (UI/CLI/Web/SDK/API) are not consistent. The web app UI has a lot of bugs where a CLI config for Tigris will not reflect properly and vice-versa. It is hard to make infrastructure automate-able (manually with a guide on setup) when using the UI. It’s even more difficult to follow IaC as a practice due to this.

As of the time of this writing, you can in the middle of uploading a large file to Tigris, switch networks or turn on/off your VPN (to force an upload failure). Tigris will then reflect the full file size size inside the web app UI. But, there is no file visible in the web app UI and the web app UI - I expect this is due to the successful parts of the multipart upload not getting properly cleaned up and Tigris reflecting the full size too early. However, on our end as devs, there is no simple way AFAIK to clean up that file and that file size metric will remain incorrect. That’s one example of many others we’ve seen.

Our Questions & Requests:

Just generally more descriptive and more detailed docs.
UI/CLI/API should be closer to each other in what they present. If not possible, it’d be good to then have relevant APIs to see the “true” state of any stored object, config of bucket, CNAME existence or non-existence, etc. Without this, it just becomes a guessing game on the developer’s side.
More love into docs on configuring AWS profiles, authorizing users, etc.

Final Thoughts

Totally understand that Tigris is a startup and working on a ton of improvements and shipping fast, maybe too fast. Due to all above, we’re generally paranoid of Tigris’ reliability when it comes to storage interactions now. We’re redesigning to be more fault tolerant and building fallback with other backup providers so that we can flip a switch for storage of urgent issues and also auto-fallback in less urgent cases.

We’re still rooting for Tigris as a service, because we need more AWS S3 alternatives as a whole and it’s a great solution for our needs; the intelligent caching is a great promise and usually the speeds are fantastic. When it works, it’s fantastic! It’s also evident that the price/performance ratio is top-notch.

However, we really need more focus more on reliability, DX, and docs. When rolling out features it’d be highly appreciated if the docs were thorough about what is and what is not supported. Supplementing that with some guidance on implementation of the feature “the right way” (to be tolerant to above) would be great. Finally thorough stability testing on that feature seems par for the course.

I know distributed storage systems are going to be insanely hard to design for, make reliable, and test for across the board, but I don’t think my request is unique and it seems likely that other current customers and prospective new ones probably feel the same.

Feel free to DM/text/etc as needed. We really want this improved.

ovaistariq · February 1, 2025, 5:24am

I’m Ovais, co-founder and CEO of Tigris. I really appreciate you taking the time to share such detailed feedback. It’s always our goal to provide a reliable and high-performance object storage service, and we take issues like these very seriously. I want to acknowledge the challenges you’ve faced and assure you that we’re actively working on improving your experience.

I’d like to address each of your concerns in detail, starting with the most critical - “replication lag”.

Replication Lag
Your concerns regarding replication lag are absolutely valid. While Tigris’ global replication is designed to be in the order of single-digit seconds, we recognize that your bucket experienced significantly higher replication lag in this case. This is something we actively monitor as part of our platform’s extensive observability, and we’re addressing the root cause of this specific incident.

That said, it’s important to clarify that Tigris follows an eventually consistent model for global replication, meaning we don’t guarantee instant read-after-write consistency across geographically distant locations. However, replication should never take hours, and we recognize this as an unacceptable outlier. If your workload requires guaranteed read-after-write consistency globally, we recommend leveraging our Conditional Operations feature.

Caching Behavior & Documentation

Your questions around caching make a lot of sense, and we’ve taken steps to clarify these concepts further in our caching documentation.

Our general recommendation is to allow Tigris to determine the optimal data placement based on access patterns. By default, Tigris ensures that an object will always be retrievable, even if it has not yet been replicated to a given location. For example, if an object is uploaded in San Jose and accessed from London, the request won’t return a 404 - instead, the system will fetch it from San Jose and trigger replication based on that access pattern.

We’re actively working on improving cache visibility for developers - your feedback on observability (e.g., being able to query cache status per region) is well taken, and we’ll discuss internally how best to expose this information.

Upload Reliability & Error Handling

I want to provide some additional context regarding your upload issues.

First, I would like to mention that we have customers handling many many times larger-scale uploads with success, so this is in no way scale related. That said, your specific experience is something we want to investigate further to ensure we’re addressing any inconsistencies.

Regarding the errors you’ve encountered:

InvalidArgument typically indicates an issue with request parameters (e.g., incorrect uploadID or part number in multipart uploads).
500 InternalError could be improved, and we’ll review the specific cases where this occurred.

I want to emphasize that Tigris does not drop requests silently. The AWS S3 SDK itself manages retries and will log any failed attempts. If a request does fail, the SDK surfaces that failure explicitly, meaning issues should always be visible in logs rather than failing silently.

That said, we’ll dig deeper into the specific errors you’re encountering with your bucket to determine if there are any additional insights we can share.

AWS Profiles & Configuration Behavior

There seems to be some confusion around how AWS CLI and SDKs consume credentials. To clarify:

Tigris does not provide a custom CLI or SDK. We are S3 API-compatible, meaning we rely on the AWS S3 SDK for authentication.
~/.aws/credentials and ~/.aws/config are client-side configurations managed by the AWS SDK. Tigris does not read these files directly, so any authentication discrepancies stem from how the SDK interacts with those settings.

For guidance on correctly configuring the AWS CLI with Tigris, we have documentation that outlines the required setup: AWS CLI configuration.

If there are specific AWS SDK configuration options that appear to behave inconsistently with Tigris, we’d be happy to investigate further.

aws s3 sync Compatibility

I understand how frustrating it can be when a familiar tool like aws s3 doesn’t work as expected. Since s3 sync is a client-side command built on top of core S3 APIs (HeadObject, ListObjects, PutObject, etc.), its behavior depends entirely on how the AWS CLI interacts with Tigris.

The core APIs required for s3 sync have been fully supported from day one, meaning any prior issues were not related to missing API support.

If you experience inconsistencies, it’s worth verifying that the AWS CLI is configured correctly with Tigris as the backend.

CNAME & User Permissions

There is no direct relationship between CNAME configurations and user permissions. The issues you described regarding UI errors when modifying user permissions should not be linked to CNAME assignments. We’ll take a closer look at what might have caused this behavior and address it.

UI, CLI, and API Consistency

We strive for a consistent experience across UI, CLI, and API, and your feedback here is invaluable. A few clarifications:

The UI does not have a separate persistence layer, it directly reflects the API’s state, so any updates via CLI or API should be visible in the UI.
It’s true that our UI exposes additional features beyond AWS S3 compatibility. This is where having our own CLI would probably be useful. But, if you rely on Infrastructure-as-Code workflows, we recommend using Terraform AWS Provider with Tigris. If you encounter any Terraform-related issues, let us know and we’ll address them.

Final Thoughts

Your feedback is incredibly valuable, and I want to personally thank you for taking the time to share such a detailed post.

Ultimately, we recognize that reliability is the most important aspect of a storage platform. This is why we publicly share our availability metrics - something few S3-compatible services do: Live Availability Dashboard

We recognize that incidents like the replication lag you experienced should not happen and are actively working to prevent future occurrences.

Tigris has grown quickly, and with that comes some growing pains, but we’re committed to addressing these concerns and continuously improving our service. We appreciate your patience and continued support, and we’ll keep working to make Tigris the most reliable S3-compatible storage platform for your needs.

Happy to chat further if you’d like.

steviesh · February 2, 2025, 12:23am

Hi Ovais, thank you for the response as always. It is much appreciated that you recognize that I am trying to be helpful here and of course solve our own problems at the same time.

The improvement to the caching docs is hugely welcome, particularly the clarifications around the actual region caching behavior of writes. I’m equally excited to hear that you’ve taken this to somehow expose the cache request state on an object level. I also should’ve clarified my meaning of CLI after a re-read, I see how it is confusing, I meant the AWS CLI (which is just proxying to the API) and will correct for readers later.

However, there are a few questions left unaddressed. Of course, I’m not as experienced in distributed storage systems as you are, so correct me if any of my logic is making faulty assumptions.

Apologies for possibly confusing writing, running a bit low on sleep. My questions and requests are re-stated at the bottom for organization.

Tigris implies read-through cache behavior through its comparison to a CDN

As mentioned in my post, I recognize Tigris is eventually consistent (circa S3 in 2020). However, in the docs this is stated currently:

Tigris behaves like a Content Delivery Network (CDN) with no work on your part. Unlike traditional CDNs, though, it handles dynamic data in a way that provides strong guarantees around freshness of data. Tigris transparently caches the objects close to the user to provide low-latency access. The region chosen for caching the objects depends on the request pattern from the users. Objects stored in San Jose but requested frequently from Sydney will result in getting cached in the Sydney region. Caching is provided through a distributed global caching layer with cache nodes deployed in regions globally. This ensures that user requests can be served from the region closest to the user. Object caching requires no configuration and is enabled by default on all buckets.

Traditional CDNs v.s. Tigris - Traditional CDNs have “stronger” guarantees

I can recognize that there is a Ton of nuance to describing something as a “read-through system” because it is a bloated statement when actually looking under the hood. This is probably even more true in this arena.

In traditional CDNs, there are stronger guarantees around this data. It is somewhat expected that cache misses on reads would read through and would read-through synchronously. Traditional CDNs would operate read-through behavior by doing a synchronous fetch. The obvious example being AWS S3 (store) and Cloudfront.

S3, pre-2020 with their (strong consistent update) was also eventually consistent. However, Cloudfront would nevertheless operate with those more traditional and expected “better guarantees” by using the x-cache header and marking requests as misses, but still returning data synchronously.

So then, my question is what part of Tigris can be depended on as truly read-through?

I ask this because in our individual issue with the AWS s3 ls requests failing to list files because of replication lag, the GET requests to individual objects here were failing with a 404 Not Found IIRC.

Perhaps this is due to where the presigned URL to operate that get was generated from (for us, main API server)? If so, does the origin of the presigned url generation have any impact on caching behavior?

Can we safely assume traditional CDN-like read-after-write guarantee in cases where the the object is small?
How small is small to do so?

You mention the conditional operators CAS feature, but I can’t imagine this is the main way to download without error. Are you saying download implementation should always use this or only in error cases?

So my current understanding & conclusions are:

The CDN comparison is somewhat nominal and the system is described as read-through which is true when everything is all up and running safely. One should still handle edge cases of failures and fallback to CAS?
Read-after-write guarantees are not strong enough (yet) to read from any geographic location after a write like a traditional CDN. So perhaps we have to implement it quite differently from a CDN. But yet you mention this:

Yet our experience was different, so it was just an intermittent issue and bad state because the presigned URL generation origin was able to hit the right cache. But the downloader’s origin was not ready.

I’m trying to reconcile this mentally with Tigris’ eventually-consistency to best download optimally for speed and reliability question:@ovaistariq could you clarify how best we do this? Should we even be thinking about this case?

The same question applies to uploads in cases where we hit something like the 500 Internal Error or the InvalidArgument cases.

The Tigris replication queue at fault is possibly the metadata replication queue/stream and hence the 404s.
To exhibit more traditional CDN-like behavior with stronger guarantees on reads, one would have to fallback to utilizing the CAS Reads feature after expecting a 404 on a failed list request. I’m trying to understand how we would download optimally for speeds and reliability.

Finally, if most of what I’ve said is true, I will remark that I think these learnings should be made more obvious. It’d be great to explicitly state the read-through nuances of Tigris and the recommended way to operate with Tigris as a more traditional CDN, if possible.

To further bolster that point, this is stated from the Overview docs (formatted for readability):

You can use Tigris for a wide range of use cases, such as 1) Storage for real-time applications 2) Web content and media (images, videos, etc.) 3) Storage for IoT applications 4) Data analytics, big data and batch processing 5) Storage for machine learning models and datasets + Backups and archives

For the real-time data and web-content data usecases, I believe that this would be immensely helpful as most would expect that.

Tigris for geographically distributed workloads (`UTLD`)

This is our access pattern. I’ll call this “UTLD” lol with the acronyms for the stages, E/T are the same as in ETL. U and D here are Upload and Download. I’m guessing others may be utilizing similar access patterns and the docs allude to it in places, it’d be great to have a guide on implementing for globally distributed pipelines.

All stages in UTLD may be geographically distributed

U origin - original uploader typically end-user. May be from another transform origin though making a cycle.
T origin - Transformer of original upload (downloads to get it). Note that transformations may include other UT stages. For example, for LLMs training and tuning processing can happen in different locations.
L origin - Loader of a finalized output(s) post-T stage. This is not the same as T in all cases since the compute/storage requirements are very different.
D origin - Downloader of final output is geographically distributed. U origin will almost always equal D origin.

A few examples that of use cases that fall into this pattern would be:

User-facing ETL apps that do any data processing or heavy machine learning/tuning/etc on large batches of data with multiple data centers around the world. User may upload a heavy Avro, Parquet, CSV whatever dataset and wants to extract insights on it.
An image resizing application with serverless functions provide E+T i.e. Cloudflare Workers or AWS Lambda @ Edge
An AI-generated movie workload where a user uploads a few images at critical points and utilizes multiple GPU workloads to generate the rest of their frames (images).

It’d be great to understand how to best architect an application that utilizes this. Because the only assumptions we can make are that U origin and D origin are equal. While in use cases for just returning logos, icons, and small images the default Tigris behavior will work, for this access pattern the default pattern wouldn’t work.

We’re going to take a stab at it internally via eager caching requests, but we really don’t know if we’re going to be doing it optimally given my original post. So guidance here would be tremendously appreciated.

Here’s how I perceive it so far, we can conclude in default behavior:

U is optimized by Tigris automatically because it will go to edge location
T’s download may miss the cache .
T’s upload is optimized due to the same
L’s download may miss the cache
L’s upload is optimized due to the same
D’s download may miss cache

I assume an equally distributed workload per-region here.

All requests/questions organized + restated:

Quick Questions

1: Does the origin of the presigned url generation have any impact on caching behavior?
2 Is it true that, read-after-write guarantees are not strong enough (yet) to read from any geographic location after a write like a traditional CDN and so we should not do so?
3: Can we safely assume traditional CDN-like read-after-write guarantee in cases where the the object is small?
4 If 3 is true, how small is “small” to do so?

Guidance on Upload/Downlaod

You mention the Conditional Operations, for us this would be theCAS feature for new objects, but I can’t imagine this is the main way to download without error. Are you saying download implementation should always use this or only in error cases? You mention this:

Yet our experience was different, so was it just an intermittent issue and bad state because the presigned URL generation origin was able to hit the right cache, but not the downloader’s origin?

I’m trying to reconcile this mentally with Tigris’ eventually-consistent model to best download optimally for speed and reliability and do the same for uploads. Could you clarify how best we do this? Should we even be thinking about this scenario?

Finally, you mention that the recommendation is to use the default behavior. Does this imply that we cannot assume synchronous read-through behavior granted that the bucket has the downloader’s region enabled when eager caching? Perhaps you meant that this is true in all cases?

Guide on properly implementing UTLD access pattern described?

Given the access pattern described how do we go about implementing something like this optimally which seems to be common?

Thoughts

Despite me being actually more confused than I was when I started thinking about this (questions/requests), I can also appreciate the vision behind Tigris more than then now that we’re dealing with this stuff and your efforts.

Generally, I am having trouble understanding what is guaranteed to be available, how modifications of caches and bucket settings affect that guarantee and reconciling that with its eventual consistency model. And of course, guidance on said implementation would be phenomenal.

Thanks again

steviesh · February 3, 2025, 2:26pm

After discussion with team, I realized I left out one major thing to address on caching behavior.

There is currently no way afaik to ensure cache is filled in a specific region.

If we want to ensure eager caching behavior to a region without relying on the source origin, what is the best way to do so?

Current Options all use requester’s origin:

Eager cache (Put) will cache via the region that was used for upload.
Eager cache (List) will cache via the region used to send the List request.

This leaves responsibility to utilize user’s browser to cache via List to eager cache (assuming they’re in the region you wish) or enforce that all Put systems are physically in the region you want which wouldn’t work in a distributed infra.

There’s no current option to specify a region to cache to directly. This would be required I’d believe. My proposal would be that in the current x-tigris-prefetch header you can pass in a region to cache to instead. Currently, only true is supported which would fallback to origin’s region.

steviesh · February 7, 2025, 7:36pm

Hey, I know these threads get auto-resolved in a week after initial post.

I’m also sure @ovaistariq, you and the rest of the team is quite busy. But we wouldn’t be happy with this being auto-resolved.

I want to be clear in stating in case the thread does autoresolve, that not everything I have pointed out has been acknowledged and/or resolved.

https://status.tigris.dev/ also doesn’t reflect our replication event unfortunately. Though I can acknowledge that setting up monitoring for all such operations is a challenge for a small company.

erlan · February 13, 2025, 8:58am

Why Replication Lag is a thing? Couldn’t you just serve the file from different location when it is not available in the location that user is trying to access the file from? Why file appears when doing aws s3 list in europe but doesn’t appear when doing the same thing in US? Isn’t it huge issue?

system · February 20, 2025, 8:58am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tigris global storage released a migration tool: shadow buckets Fresh Produce storage	44	1732	April 24, 2024
Global caching object storage on Fly.io in private beta Fresh Produce storage	49	3052	February 5, 2024
Please define the delete consistency behaviour distributed , storage , tigris	13	146	April 1, 2025
Downloading a file from Tigris after it has been uploaded is very slow Questions / Help tigris	8	145	February 3, 2025
Launch your app with object storage from Tigris Fresh Produce tigris	8	486	May 9, 2024

Our Experience Dealing with Issues of Tigris Reliability, 500 Responses, and Docs: Questions & Requests

Preface and Context

1 - Massive replication lag (>4-5 hours?) that won’t cut it for production environments:

“Dealing with Tigris” by migrating our whole app to a different geographical region:

1.1 Eager Caching + Geographic Restrictions?

Our Questions and Requests:

2 - Upload Intermittent Issues

Our Questions & Requests:

3 - Sparse Docs, Incompatible features and generally undocumented Behavior

Fast Moving Features, though undocumented and then not working with others.

Experience 1 - CNAME Domain Assignment combined with any bucket setting change experience:

Experience 2 - aws s3 sync

Experience 3 - Odd AWS profiles and inconsistent configs

Experience 4 - UI, CLI, API “Uniformness”:

Our Questions & Requests:

Final Thoughts

Tigris implies read-through cache behavior through its comparison to a CDN

Traditional CDNs v.s. Tigris - Traditional CDNs have “stronger” guarantees

Tigris for geographically distributed workloads (UTLD)

All requests/questions organized + restated:

Quick Questions

Guidance on Upload/Downlaod

Guide on properly implementing UTLD access pattern described?

Thoughts

Related topics

Experience 2 - `aws s3 sync`

Experience 4 - **UI, CLI, API “Uniformness”:**

Tigris for geographically distributed workloads (`UTLD`)