Context compression automatically reduces token usage before requests are sent to LLM providers. It runs transparently inside Gateway — your application sends requests normally, and Gateway compresses the context when needed.
Compression helps you:
Compression is configured per organization or per project. Project-level settings override the org default, and a Disabled setting at any level turns compression off entirely.
When a request arrives, Gateway checks the token count against the configured compression mode and target. If compression is needed, it applies two techniques in order:
Compression stops as soon as the request is under the target. If the model is unrecognized, compression is skipped and the request proceeds as-is.
Gateway reserves 15% of the context window for output tokens. Both modes apply their thresholds against this effective input limit.
Compression is configured in the Gateway dashboard at two levels:
A Disabled setting at any level turns compression off entirely. If no setting exists at either level, compression is skipped.
Lossless compression has zero impact on quality — it encodes the same information more efficiently and is always applied first. If the request is still over the target, message trimming removes messages from the middle of the conversation while preserving all system messages and the most recent messages. For many requests, lossless compression alone is enough and no messages are removed.
Only when the token count exceeds the threshold for the configured mode. In Context Window Only mode, compression triggers when tokens exceed the model’s effective input limit. In Cost Optimization mode, it triggers once tokens exceed the target ratio (default 70%) of the effective input limit.
Gateway skips compression and forwards the request to the provider as-is. Requests never fail due to an unrecognized model.
Gateway sends the request anyway. This can happen when system messages alone use most of the context window. System messages are never dropped.
Yes. In the dashboard, open the project’s compression settings and set the mode to Disabled. This overrides the organization default for that project.
The target ratio controls when Cost Optimization mode triggers compression. With the default of 70%, compression activates when the request exceeds 70% of the effective input limit. The effective input limit is 85% of the model’s context window (reserving 15% for output tokens). For example, with a 128K context window: the effective limit is about 108K tokens, and compression triggers above roughly 76K tokens. You can adjust the target ratio from 10% to 95% in the dashboard.