Token Management
Estimate and validate token usage to avoid exceeding model limits.
Table of Contents
- Token Management
Overview
Token management helps you:
- Estimate token counts before making API calls
- Validate prompts don’t exceed model context limits
- Budget token usage for cost optimization
- Fail early to save API costs
What Are Tokens?
Tokens are the basic units LLMs use to process text:
- 1 token ≈ 4 characters in English
- 1 token ≈ ¾ of a word
- Tokens include punctuation and spaces
Example:
- “Hello, world!” ≈ 4 tokens
- “Explain Rust ownership” ≈ 3 tokens
- A 1000-word essay ≈ 1333 tokens
Token Estimation
Fortified LLM Client estimates tokens using model-specific tokenizers.
Enable Token Validation
CLI:
1
2
3
4
5
fortified-llm-client \
--api-url http://localhost:11434/v1/chat/completions \
--model llama3 \
--user-text "Your prompt" \
--validate-tokens
Config file:
1
2
3
api_url = "http://localhost:11434/v1/chat/completions"
model = "llama3"
validate_tokens = true
Library:
1
2
3
4
5
6
7
8
9
10
11
12
use fortified_llm_client::{evaluate, EvaluationConfig};
let config = EvaluationConfig {
api_url: "http://localhost:11434/v1/chat/completions".to_string(),
model: "llama3".to_string(),
user_prompt: "Your prompt".to_string(),
validate_tokens: true,
..Default::default()
};
let result = evaluate(config).await?;
println!("Tokens used: {}", result.metadata.tokens_estimated);
What Gets Estimated
Token estimation includes:
- System prompt (if provided)
- User prompt (or PDF-extracted text)
- Response buffer (reserved for LLM output based on
max_tokens)
Formula:
1
total_tokens = system_tokens + user_tokens + response_buffer
Context Limits
Auto-Detection
Context limits are auto-detected from model names:
| Model Pattern | Context Limit |
|---|---|
gpt-4-turbo |
128,000 tokens |
gpt-4-32k |
32,768 tokens |
gpt-4 |
8,192 tokens |
gpt-3.5-turbo-16k |
16,385 tokens |
gpt-3.5-turbo |
4,096 tokens |
claude-3-opus |
200,000 tokens |
claude-3-sonnet |
200,000 tokens |
llama3 |
8,192 tokens |
llama3-70b |
8,192 tokens |
mistral |
8,192 tokens |
Fallback: 4096 tokens if model not recognized.
Override Context Limit
Force a specific context limit:
CLI:
1
2
3
4
5
6
fortified-llm-client \
--api-url http://localhost:11434/v1/chat/completions \
--model custom-model \
--user-text "Your prompt" \
--validate-tokens \
--context-limit 16384
Config file:
1
2
3
4
api_url = "http://localhost:11434/v1/chat/completions"
model = "custom-model"
validate_tokens = true
context_limit = 16384
Library:
1
2
3
4
5
6
7
8
let config = EvaluationConfig {
api_url: "http://localhost:11434/v1/chat/completions".to_string(),
model: "custom-model".to_string(),
user_prompt: "Your prompt".to_string(),
validate_tokens: true,
context_limit: Some(16384),
..Default::default()
};
Validation Behavior
Success Case
If tokens ≤ context limit, request proceeds:
1
2
3
4
5
6
7
8
{
"status": "success",
"content": "LLM response...",
"metadata": {
"tokens_estimated": 1234,
...
}
}
Failure Case
If tokens > context limit, request fails before calling the API:
1
2
3
4
5
6
7
{
"status": "error",
"error": {
"code": "ValidationError",
"message": "Token count (10000) exceeds model context limit (8192)"
}
}
Benefits:
- No API charges for invalid requests
- Immediate feedback
- Faster failure (no network round-trip)
Use Cases
Use Case 1: Validate Large PDFs
Prevent exceeding limits when processing PDFs:
1
2
3
4
5
6
7
fortified-llm-client \
--api-url http://localhost:11434/v1/chat/completions \
--model llama3 \
--pdf-file large-document.pdf \
--system-text "Summarize" \
--validate-tokens \
--context-limit 8192
If PDF text + system prompt + response buffer > 8192 tokens, fails before extraction.
Use Case 2: Cost Optimization
Estimate cost before expensive API calls:
1
2
3
4
5
6
fortified-llm-client \
--api-url https://api.openai.com/v1/chat/completions \
--model gpt-4 \
--api-key-name OPENAI_API_KEY \
--user-file long-prompt.txt \
--validate-tokens
Check metadata.tokens_estimated to calculate cost:
- GPT-4: $0.03 per 1K input tokens
- Estimated tokens: 5000
- Cost:
5 * $0.03 = $0.15
Use Case 3: Batch Processing with Token Budgets
Process multiple prompts with total token budget:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
use fortified_llm_client::{evaluate, EvaluationConfig};
async fn batch_with_budget(prompts: Vec<String>, max_total_tokens: usize) -> Result<(), Box<dyn std::error::Error>> {
let mut total_tokens = 0;
for prompt in prompts {
let config = EvaluationConfig {
api_url: "http://localhost:11434/v1/chat/completions".to_string(),
model: "llama3".to_string(),
user_prompt: prompt.clone(),
validate_tokens: true,
..Default::default()
};
match evaluate(config).await {
Ok(result) => {
total_tokens += result.metadata.tokens_estimated;
if total_tokens > max_total_tokens {
println!("Budget exceeded at {} tokens", total_tokens);
break;
}
println!("Processed: {}", result.content);
}
Err(e) => {
eprintln!("Error: {}", e);
}
}
}
Ok(())
}
Token Estimation Accuracy
Estimation Methods
Fortified LLM Client uses:
- Model-specific tokenizers - When model is recognized (GPT-4, Llama 3, etc.)
- Character-based estimation - Fallback:
char_count / 4
Accuracy: ±5% for recognized models, ±15% for fallback estimation.
Factors Affecting Accuracy
- Language: Non-English text may use more tokens
- Special characters: Code, emojis, symbols increase token count
- Tokenizer version: Provider may update tokenizers
Estimation is conservative (tends to overestimate) to avoid unexpected limit exceedances.
Advanced Configuration
Custom Response Buffer
By default, response buffer = max_tokens (if set) or provider default.
Override:
1
2
3
4
5
6
7
// Not directly exposed, but max_tokens controls it
let config = EvaluationConfig {
// ...
max_tokens: Some(1000), // Reserves 1000 tokens for response
validate_tokens: true,
..Default::default()
};
Calculation:
- If
max_tokens = 1000: total = system + user + 1000 - If
max_tokens = None: total = system + user + default (often 2048)
Disable Validation for Specific Requests
1
2
3
4
5
let config = EvaluationConfig {
// ...
validate_tokens: false, // Skip validation
..Default::default()
};
Use when:
- You know the prompt is safe
- You want to let the API handle limits
- Debugging estimation issues
Troubleshooting
Error: “Token count exceeds limit”
Cause: Prompt too long for model.
Fixes:
- Shorten prompt: Reduce system/user prompt length
- Use larger model: Switch to model with bigger context (e.g., gpt-4-turbo 128K)
- Reduce max_tokens: Lower response buffer
- Split prompts: Break into multiple smaller requests
Estimation Seems Wrong
Cause: Model not recognized or tokenizer mismatch.
Debug:
1
2
3
4
5
6
fortified-llm-client \
--api-url http://localhost:11434/v1/chat/completions \
--model unknown-model \
--user-text "test" \
--validate-tokens \
--verbose
Check logs for tokenizer being used.
Fix: Override context limit if estimation is incorrect:
1
--context-limit 32768
Validation Passes but API Rejects
Cause: API’s actual limit differs from estimation.
Fix: Set a lower context limit to add safety margin:
1
--context-limit 7000 # Instead of 8192
Best Practices
1. Always Validate for Unknown Input
When processing user input or PDFs:
1
--validate-tokens --context-limit 8192
2. Add Safety Margin
Reserve 10-20% headroom:
1
2
# Model limit: 8192
--context-limit 7000 # ~15% margin
3. Monitor Token Usage
Track tokens in responses:
1
2
3
4
5
let result = evaluate(config).await?;
println!("Tokens used: {}", result.metadata.tokens_estimated);
// Log to metrics system
metrics::gauge!("llm.tokens.used", result.metadata.tokens_estimated as f64);
4. Use Appropriate Models
Match model to task:
- Short prompts: gpt-3.5-turbo (4K context)
- Long prompts: gpt-4-turbo (128K context)
- Huge documents: claude-3-opus (200K context)
Next Steps
- PDF Extraction - Handle large PDFs
- Response Formats - Control output structure
- Configuration - Set defaults