Context compression

Optimize context usage with compression in Merge Gateway

Context compression

Optimize context usage with compression in Merge Gateway

What is context compression?

Context compression automatically reduces token usage before requests are sent to LLM providers. It runs transparently inside Gateway — your application sends requests normally, and Gateway compresses the context when needed.

Compression helps you:

Avoid context window errors by keeping requests within a model’s input limit
Reduce costs by proactively trimming tokens before they’re billed
Handle long conversations without manually managing message history

Compression is configured per organization or per project. Project-level settings override the org default, and a Disabled setting at any level turns compression off entirely.

How it works

When a request arrives, Gateway checks the token count against the configured compression mode and target. If compression is needed, it applies two techniques in order:

Lossless compression — minifies JSON in tool schemas, tool call arguments, tool results, and JSON message content. This alone can reduce tokens by 30–60% on JSON-heavy requests, with no change to content or response quality.
Message trimming — if still over the target after lossless compression, Gateway removes messages from the middle of the conversation. System messages and the most recent messages are always preserved.

Compression stops as soon as the request is under the target. If the model is unrecognized, compression is skipped and the request proceeds as-is.

Compression modes

Mode	When it triggers	Use case
Context Window Only	Token count exceeds the model’s effective input limit	Safety net — prevent context window errors
Cost Optimization	Token count exceeds target ratio (default 70%) of the input limit	Proactively reduce token usage and costs
Disabled	Never	Turn off compression for an org or project

Gateway reserves 15% of the context window for output tokens. Both modes apply their thresholds against this effective input limit.

Configuration

Compression is configured in the Gateway dashboard at two levels:

Organization default — applies to all projects unless overridden. Set in the organization settings page.
Project override — set per-project during project creation or editing to override the org default. Choose a specific mode or use the org default.

A Disabled setting at any level turns compression off entirely. If no setting exists at either level, compression is skipped.

FAQ

Does compression affect response quality?

Lossless compression has zero impact on quality — it encodes the same information more efficiently and is always applied first. If the request is still over the target, message trimming removes messages from the middle of the conversation while preserving all system messages and the most recent messages. For many requests, lossless compression alone is enough and no messages are removed.

When does compression run?

Only when the token count exceeds the threshold for the configured mode. In Context Window Only mode, compression triggers when tokens exceed the model’s effective input limit. In Cost Optimization mode, it triggers once tokens exceed the target ratio (default 70%) of the effective input limit.

What happens if the model isn't recognized?

Gateway skips compression and forwards the request to the provider as-is. Requests never fail due to an unrecognized model.

What if compression still can't get under the target?

Gateway sends the request anyway. This can happen when system messages alone use most of the context window. System messages are never dropped.

Can I disable compression for a specific project?

Yes. In the dashboard, open the project’s compression settings and set the mode to Disabled. This overrides the organization default for that project.

How does the target ratio in Cost Optimization mode work?

The target ratio controls when Cost Optimization mode triggers compression. With the default of 70%, compression activates when the request exceeds 70% of the effective input limit. The effective input limit is 85% of the model’s context window (reserving 15% for output tokens). For example, with a 128K context window: the effective limit is about 108K tokens, and compression triggers above roughly 76K tokens. You can adjust the target ratio from 10% to 95% in the dashboard.