TokenRateGate.Extensions
0.9.1
dotnet add package TokenRateGate.Extensions --version 0.9.1
NuGet\Install-Package TokenRateGate.Extensions -Version 0.9.1
<PackageReference Include="TokenRateGate.Extensions" Version="0.9.1" />
<PackageVersion Include="TokenRateGate.Extensions" Version="0.9.1" />
<PackageReference Include="TokenRateGate.Extensions" />
paket add TokenRateGate.Extensions --version 0.9.1
#r "nuget: TokenRateGate.Extensions, 0.9.1"
#:package TokenRateGate.Extensions@0.9.1
#addin nuget:?package=TokenRateGate.Extensions&version=0.9.1
#tool nuget:?package=TokenRateGate.Extensions&version=0.9.1
TokenRateGate
Stop getting HTTP 429 "Rate limit exceeded" errors from Azure OpenAI, OpenAI, and Anthropic Claude APIs.
TokenRateGate is a .NET library that prevents rate limit errors by intelligently managing your token and request budgets. It tracks both TPM (Tokens-Per-Minute) and RPM (Requests-Per-Minute) limits, automatically queues requests when capacity is full, and ensures you never hit the dreaded 429 error again.
The Problem
Getting this error when calling LLM APIs?
HTTP 429: Too Many Requests
Rate limit is exceeded. Try again in X seconds.
This happens when you exceed your API provider's limits:
- Azure OpenAI / Azure AI Foundry: TPM (Tokens-Per-Minute) limits based on your deployment tier
- OpenAI API: Both TPM and RPM (Requests-Per-Minute) limits
- Anthropic Claude: Rate limits based on usage tier
TokenRateGate prevents these errors by managing your token budget and queueing requests before they hit the API.
Features
- Prevents HTTP 429 Errors: Enforces TPM and RPM limits before making API calls
- Dual Limiting: Tracks both token usage (TPM) and request count (RPM) - whichever is more restrictive applies
- Smart Queuing: Automatically queues requests when capacity is full, with configurable timeouts
- Safety Buffer: Configurable buffer to avoid hitting exact limits (prevents edge-case 429s)
- Accurate Tracking: Records actual token usage from API responses for precise capacity management
- Multiple Providers: Built-in support for OpenAI and Azure OpenAI SDKs
- Multi-Tenant: Factory pattern for managing different rate limits per tenant/model/tier
- Real-Time Monitoring: Detailed usage statistics and capacity tracking
- Dependency Injection: First-class DI support with configuration binding
- High Performance: Optimized for throughput with minimal overhead
- Thread-Safe: Concurrent request handling with proper synchronization
Installation
Option 1: Complete Solution (Recommended)
For OpenAI or Azure OpenAI users:
dotnet add package TokenRateGate
Includes: Everything - base engine, DI, OpenAI integration, Azure integration.
Option 2: Custom LLM Providers
For Anthropic Claude, Google Gemini, or custom APIs:
dotnet add package TokenRateGate.Base
Includes: Core engine, DI support, character-based token estimation. Use when: Building custom integrations without OpenAI/Azure SDKs.
Option 3: Add Integrations Individually
# Base + OpenAI only
dotnet add package TokenRateGate.Base
dotnet add package TokenRateGate.OpenAI
# Base + Azure only
dotnet add package TokenRateGate.Base
dotnet add package TokenRateGate.Azure
Quick Start
1. Dependency Injection Setup (Recommended)
The easiest way to use TokenRateGate is with dependency injection:
using TokenRateGate.Extensions.DependencyInjection;
// In your Program.cs or Startup.cs
var builder = WebApplication.CreateBuilder(args);
// Register TokenRateGate with configuration
builder.Services.AddTokenRateGate(options =>
{
options.TokenLimit = 150000; // 150K tokens per minute (Azure Standard tier)
options.WindowSeconds = 60; // 60-second sliding window
options.SafetyBufferPercentage = 0.05; // 5% safety buffer (avoids hitting exact limit)
options.MaxConcurrentRequests = 10; // Limit concurrent API calls
options.MaxRequestsPerMinute = 100; // Optional: Also enforce RPM limit
});
var app = builder.Build();
Or bind from configuration:
// appsettings.json
{
"TokenRateGate": {
"TokenLimit": 500000,
"WindowSeconds": 60,
"MaxConcurrentRequests": 10
}
}
builder.Services.AddTokenRateGate(
builder.Configuration.GetSection("TokenRateGate"));
2. Using with Custom LLM Providers
For APIs without built-in token estimation (Anthropic Claude, Google Gemini, custom LLMs):
using TokenRateGate.Core;
using TokenRateGate.Core.TokenEstimation;
using TokenRateGate.Abstractions;
public class CustomLlmService
{
private readonly ITokenRateGate _rateGate;
private readonly ITokenEstimator _tokenEstimator;
private readonly ILogger<CustomLlmService> _logger;
public CustomLlmService(ITokenRateGate rateGate, ILogger<CustomLlmService> logger)
{
_rateGate = rateGate;
_logger = logger;
// Use character-based estimation (4 chars ≈ 1 token for most LLMs)
_tokenEstimator = new CharacterBasedTokenEstimator();
}
public async Task<string> CallCustomLlmAsync(string prompt)
{
// Estimate tokens using character-based estimator
int estimatedInputTokens = _tokenEstimator.EstimateTokens(prompt);
int estimatedOutputTokens = 1000; // Your estimated response size
// Reserve capacity before calling the LLM
await using var reservation = await _rateGate.ReserveTokensAsync(
estimatedInputTokens,
estimatedOutputTokens);
_logger.LogInformation("Reserved {Tokens} tokens", reservation.ReservedTokens);
// Make your custom LLM API call
var response = await CallYourCustomApiAsync(prompt);
// Record actual usage from the response (IMPORTANT for accurate tracking)
var actualTotalTokens = response.Usage.TotalTokens;
reservation.RecordActualUsage(actualTotalTokens);
_logger.LogInformation("Actual usage: {Tokens} tokens", actualTotalTokens);
return response.Content;
}
}
Using CharacterBasedTokenEstimator:
// Default: 4 characters per token
var estimator = new CharacterBasedTokenEstimator();
int tokens = estimator.EstimateTokens("Hello world!"); // ≈ 3 tokens
// Custom ratio for different languages
var chineseEstimator = new CharacterBasedTokenEstimator(charactersPerToken: 2.0);
int chineseTokens = chineseEstimator.EstimateTokens("你好世界"); // Better for non-Latin scripts
// Estimate multiple texts
var messages = new[] { "System prompt", "User message", "Assistant response" };
int totalTokens = estimator.EstimateTokens(messages);
3. Using with OpenAI SDK (Recommended for OpenAI Users)
TokenRateGate integrates seamlessly with the OpenAI SDK.
Note: Logging is optional. The WithRateLimit() extension method accepts an optional ILoggerFactory parameter for diagnostics. If not provided, logging is disabled.
using OpenAI.Chat;
using TokenRateGate.OpenAI;
using TokenRateGate.Abstractions;
public class ChatService
{
private readonly ITokenRateGate _rateGate;
private readonly string _apiKey;
public ChatService(
ITokenRateGate rateGate,
IConfiguration configuration)
{
_rateGate = rateGate;
_apiKey = configuration["OpenAI:ApiKey"];
}
public async Task<string> AskQuestionAsync(string question)
{
// Create OpenAI client and wrap with rate limiting
var client = new ChatClient("gpt-4", _apiKey);
var rateLimitedClient = client.WithRateLimit(_rateGate, "gpt-4");
// Make rate-limited API call - automatic token tracking!
var messages = new[] { new UserChatMessage(question) };
var response = await rateLimitedClient.CompleteChatAsync(messages);
return response.Content[0].Text;
}
public async Task<string> AskQuestionStreamingAsync(string question)
{
var client = new ChatClient("gpt-4", _apiKey);
var rateLimitedClient = client.WithRateLimit(_rateGate, "gpt-4");
var messages = new[] { new UserChatMessage(question) };
var result = new StringBuilder();
// Streaming support with automatic token tracking
await foreach (var chunk in rateLimitedClient.CompleteChatStreamingAsync(messages))
{
if (chunk.ContentUpdate.Count > 0)
{
var text = chunk.ContentUpdate[0].Text;
result.Append(text);
Console.Write(text);
}
}
return result.ToString();
}
}
4. Using with Azure OpenAI (Recommended for Azure Users)
Note: Logging is optional. The WithRateLimit() extension method accepts an optional ILoggerFactory parameter for diagnostics. If not provided, logging is disabled.
using Azure;
using Azure.AI.OpenAI;
using TokenRateGate.Azure;
public class AzureChatService
{
private readonly ITokenRateGate _rateGate;
public AzureChatService(ITokenRateGate rateGate)
{
_rateGate = rateGate;
}
public async Task<string> AskQuestionAsync(string question)
{
var azureClient = new AzureOpenAIClient(
new Uri("https://your-resource.openai.azure.com/"),
new AzureKeyCredential("your-api-key"));
// Wrap with rate limiting (deployment name + model name for token counting)
var rateLimitedClient = azureClient.WithRateLimit(
_rateGate,
deploymentName: "my-gpt4-deployment",
modelName: "gpt-4");
var messages = new[] { new UserChatMessage(question) };
var response = await rateLimitedClient.CompleteChatAsync(messages);
return response.Content[0].Text;
}
}
Multi-Tenant Configuration
Support different rate limits for different users, models, or tenants:
// Registration in Program.cs
builder.Services.AddTokenRateGateFactory();
builder.Services.AddNamedTokenRateGate("basic-tier", options =>
{
options.TokenLimit = 100000; // 100K tokens/min for basic users
options.WindowSeconds = 60;
});
builder.Services.AddNamedTokenRateGate("premium-tier", options =>
{
options.TokenLimit = 1000000; // 1M tokens/min for premium users
options.WindowSeconds = 60;
});
// Usage in your service
public class MultiTenantChatService
{
private readonly ITokenRateGateFactory _factory;
public MultiTenantChatService(ITokenRateGateFactory factory)
{
_factory = factory;
}
public async Task<string> AskQuestionAsync(string question, string tier)
{
// Get rate gate for the tenant's tier
var rateGate = _factory.GetOrCreate(tier);
var client = new ChatClient("gpt-4", "your-api-key");
var rateLimitedClient = client.WithRateLimit(rateGate, "gpt-4");
var messages = new[] { new UserChatMessage(question) };
var response = await rateLimitedClient.CompleteChatAsync(messages);
return response.Content[0].Text;
}
}
Standalone Usage (Without DI)
You can also use TokenRateGate without dependency injection:
using TokenRateGate.Core;
using TokenRateGate.Core.Options;
using TokenRateGate.OpenAI;
using Microsoft.Extensions.Logging;
// Create rate gate manually
var options = new TokenRateGateOptions
{
TokenLimit = 500000,
WindowSeconds = 60,
MaxConcurrentRequests = 10
};
using var loggerFactory = LoggerFactory.Create(builder =>
{
builder.AddConsole();
});
var rateGate = new TokenRateGate(options, loggerFactory);
// Use with OpenAI
var client = new ChatClient("gpt-4", "your-api-key");
var rateLimitedClient = client.WithRateLimit(rateGate, "gpt-4", loggerFactory);
var messages = new[] { new UserChatMessage("Hello!") };
var response = await rateLimitedClient.CompleteChatAsync(messages);
Console.WriteLine(response.Content[0].Text);
Monitoring Usage
public class MonitoringService
{
private readonly ITokenRateGate _rateGate;
public MonitoringService(ITokenRateGate rateGate)
{
_rateGate = rateGate;
}
public void LogCurrentUsage()
{
var stats = _rateGate.GetUsageStats();
Console.WriteLine($"Current Usage: {stats.CurrentUsage}/{stats.EffectiveCapacity} tokens");
Console.WriteLine($"Reserved: {stats.ReservedTokens} tokens");
Console.WriteLine($"Available: {stats.AvailableTokens} tokens");
Console.WriteLine($"Usage: {stats.UsagePercentage:F1}%");
Console.WriteLine($"Near Capacity: {stats.IsNearCapacity}");
}
}
Configuration Options
| Option | Default | Description |
|---|---|---|
TokenLimit |
500000 | Maximum tokens per window (TPM limit) |
WindowSeconds |
60 | Time window in seconds for token tracking |
SafetyBufferPercentage |
0.05 (5%) | Percentage of TokenLimit reserved as safety buffer<br>Effective limit = TokenLimit * (1 - SafetyBufferPercentage) |
MaxConcurrentRequests |
1000 | Maximum concurrent active reservations |
MaxRequestsPerMinute |
null | Optional RPM limit (enforced in addition to token limit)<br>If both are configured, whichever is more restrictive applies |
RequestWindowSeconds |
120 | Time window for RPM tracking (default: max(120s, 2×WindowSeconds)) |
MaxWaitTime |
null (unlimited) | Maximum time to wait for capacity in the queue before timing out<br>Note: Only applies to capacity queue waiting, NOT semaphore waiting<br>Set to null for unlimited waiting (recommended for most use cases) |
OutputEstimationStrategy |
FixedMultiplier | How to estimate output tokens when not provided |
OutputMultiplier |
0.5 | Multiplier for FixedMultiplier strategy |
DefaultOutputTokens |
1000 | Fixed output for FixedAmount strategy |
Output Estimation Strategies
- FixedMultiplier: Multiply input tokens by
OutputMultiplier(default 0.5) - FixedAmount: Add a fixed
DefaultOutputTokens(default 1000) - Conservative: Assume output = input (reserve 2x input tokens)
How It Works
TokenRateGate uses a dual-component capacity system:
Current Capacity = Historical Usage + Active Reservations
- Token Estimation: Before making an API call, estimate input and output tokens
- Capacity Check: Checks both:
- TPM Check:
(Historical Usage + Active Reservations + Requested Tokens) <= (TokenLimit - SafetyBuffer) - RPM Check:
Current Request Count < MaxRequestsPerMinute - Both must pass - whichever is more restrictive applies
- TPM Check:
- Reservation: If capacity available, reserves tokens immediately. Otherwise, queues the request with timeout.
- API Call: Make your LLM API call
- Record Actual Usage (Optional but recommended): Call
RecordActualUsage()with actual tokens from response- If recorded: Actual usage tracked in sliding window for WindowSeconds
- If not recorded: Reserved capacity freed immediately on disposal
- Disposal: When
usingblock ends, reservation is released and queued requests are processed - Sliding Window Cleanup:
- Token timeline cleaned up every WindowSeconds
- Request timeline cleaned up every RequestWindowSeconds (separate window for RPM)
- Stale reservations removed after 10x WindowSeconds
Advanced Topics
Health Checks
builder.Services.AddHealthChecks()
.AddTokenRateGate(name: "tokenrategate", tags: ["rate-limiting"]);
Custom Token Estimation
// Configure estimation strategy
builder.Services.AddTokenRateGate(options =>
{
options.OutputEstimationStrategy = OutputEstimationStrategy.Conservative;
// Now reserves 2x input tokens (assumes output = input)
});
Logging
TokenRateGate provides detailed structured logging:
builder.Services.AddLogging(logging =>
{
logging.AddConsole();
logging.SetMinimumLevel(LogLevel.Debug); // See detailed token tracking
});
Samples
Check the samples/ directory for complete examples:
- OpenAIIntegration: Basic OpenAI usage, streaming, monitoring
- AzureOpenAI.BasicUsage: Azure OpenAI integration
- More samples available in the repository
Performance
- Minimal Overhead: Token estimation uses efficient tiktoken library
- Optimized Queuing: Fast capacity checks with double-check locking
- High Throughput: Achieves >95% capacity utilization under load
- Concurrent Requests: Supports high concurrency with proper synchronization
See tests/TokenRateGate.PerformanceTests for benchmarks.
Testing
# Run all tests
dotnet test
# Run specific test categories
dotnet test --filter "Category=Integration"
dotnet test --filter "Category=Performance"
Requirements
- .NET 6.0, 8.0, or 9.0
- OpenAI SDK (for OpenAI integration):
OpenAINuGet package - Azure OpenAI SDK (for Azure integration):
Azure.AI.OpenAINuGet package
Packages
User-Facing Packages
- TokenRateGate ⭐: Complete solution - includes Base + OpenAI + Azure (recommended for most users)
- TokenRateGate.Base: Core engine + DI + Extensions (for custom LLM providers)
- TokenRateGate.OpenAI: OpenAI SDK integration (can be used with Base)
- TokenRateGate.Azure: Azure OpenAI SDK integration (can be used with Base)
Internal Packages (Included in Base)
You don't need to install these individually - they're included in TokenRateGate.Base:
- TokenRateGate.Core: Core rate limiting engine
- TokenRateGate.Abstractions: Interfaces and abstractions
- TokenRateGate.Extensions: Base implementations
- TokenRateGate.Extensions.DependencyInjection: DI support
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a pull request
License
MIT License - Copyright © 2025 Marko Mrdja
See LICENSE for details.
Acknowledgments
- Uses tiktoken for accurate token counting
- Built for the OpenAI SDK for .NET
- Supports Azure OpenAI SDK
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net6.0
- Microsoft.Extensions.Logging.Abstractions (>= 9.0.7)
- TokenRateGate.Abstractions (>= 0.9.1)
- TokenRateGate.Core (>= 0.9.1)
-
net8.0
- Microsoft.Extensions.Logging.Abstractions (>= 9.0.7)
- TokenRateGate.Abstractions (>= 0.9.1)
- TokenRateGate.Core (>= 0.9.1)
-
net9.0
- Microsoft.Extensions.Logging.Abstractions (>= 9.0.7)
- TokenRateGate.Abstractions (>= 0.9.1)
- TokenRateGate.Core (>= 0.9.1)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on TokenRateGate.Extensions:
| Package | Downloads |
|---|---|
|
TokenRateGate.Base
Core rate limiting solution for custom LLM providers. Includes engine, DI support, and character-based token estimation. Use this for Anthropic Claude, Google Gemini, or custom APIs. For OpenAI/Azure, install TokenRateGate instead. |
GitHub repositories
This package is not used by any popular GitHub repositories.
v0.9.0 (Public Beta)
- Initial public release
- Solves HTTP 429 "Rate limit exceeded" errors for Azure OpenAI, OpenAI, and Anthropic APIs
- Token-based rate limiting (TPM - Tokens Per Minute)
- Request-based rate limiting (RPM - Requests Per Minute)
- Smart queuing with timeout support
- Safety buffer to prevent exact limit hits
- Comprehensive integration with OpenAI and Azure SDKs
- Multi-framework support (.NET 6.0, 8.0, 9.0)