Microsoft.ML.Tokenizers
2.0.0
Prefix Reserved
See the version list below for details.
dotnet add package Microsoft.ML.Tokenizers --version 2.0.0
NuGet\Install-Package Microsoft.ML.Tokenizers -Version 2.0.0
<PackageReference Include="Microsoft.ML.Tokenizers" Version="2.0.0" />
<PackageVersion Include="Microsoft.ML.Tokenizers" Version="2.0.0" />
<PackageReference Include="Microsoft.ML.Tokenizers" />
paket add Microsoft.ML.Tokenizers --version 2.0.0
#r "nuget: Microsoft.ML.Tokenizers, 2.0.0"
#:package Microsoft.ML.Tokenizers@2.0.0
#addin nuget:?package=Microsoft.ML.Tokenizers&version=2.0.0
#tool nuget:?package=Microsoft.ML.Tokenizers&version=2.0.0
About
Microsoft.ML.Tokenizers provides an abstraction for tokenizers as well as implementations of common tokenization algorithms.
Key Features
- Extensible tokenizer architecture that allows for specialization of Normalizer, PreTokenizer, Model/Encoder, Decoder
- BPE - Byte pair encoding model
- English Roberta model
- Tiktoken model
- Llama model
- Phi2 model
How to Use
using Microsoft.ML.Tokenizers;
using System.IO;
using System.Net.Http;
//
// Using Tiktoken Tokenizer
//
// Initialize the tokenizer for the `gpt-4o` model. This instance should be cached for all subsequent use.
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
string source = "Text tokenization is the process of splitting a string into a list of tokens.";
Console.WriteLine($"Tokens: {tokenizer.CountTokens(source)}");
// prints: Tokens: 16
var trimIndex = tokenizer.GetIndexByTokenCountFromEnd(source, 5, out string processedText, out _);
Console.WriteLine($"5 tokens from end: {processedText.Substring(trimIndex)}");
// prints: 5 tokens from end: a list of tokens.
trimIndex = tokenizer.GetIndexByTokenCount(source, 5, out processedText, out _);
Console.WriteLine($"5 tokens from start: {processedText.Substring(0, trimIndex)}");
// prints: 5 tokens from start: Text tokenization is the
IReadOnlyList<int> ids = tokenizer.EncodeToIds(source);
Console.WriteLine(string.Join(", ", ids));
// prints: 1199, 4037, 2065, 374, 279, 1920, 315, 45473, 264, 925, 1139, 264, 1160, 315, 11460, 13
//
// Using Llama Tokenizer
//
// Open a stream to the remote Llama tokenizer model data file.
using HttpClient httpClient = new();
const string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model";
using Stream remoteStream = await httpClient.GetStreamAsync(modelUrl);
// Create the Llama tokenizer using the remote stream. This should be cached for all subsequent use.
Tokenizer llamaTokenizer = LlamaTokenizer.Create(remoteStream);
string input = "Hello, world!";
ids = llamaTokenizer.EncodeToIds(input);
Console.WriteLine(string.Join(", ", ids));
// prints: 1, 15043, 29892, 3186, 29991
Console.WriteLine($"Tokens: {llamaTokenizer.CountTokens(input)}");
// prints: Tokens: 5
Main Types
The main types provided by this library are:
Microsoft.ML.Tokenizers.TokenizerMicrosoft.ML.Tokenizers.BpeTokenizerMicrosoft.ML.Tokenizers.EnglishRobertaTokenizerMicrosoft.ML.Tokenizers.TiktokenTokenizerMicrosoft.ML.Tokenizers.NormalizerMicrosoft.ML.Tokenizers.PreTokenizer
Additional Documentation
Related Packages
Feedback & Contributing
Microsoft.ML.Tokenizers is released as open source under the MIT license. Bug reports and contributions are welcome at the GitHub repository.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
| .NET Core | netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
| .NET Standard | netstandard2.0 is compatible. netstandard2.1 was computed. |
| .NET Framework | net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
| MonoAndroid | monoandroid was computed. |
| MonoMac | monomac was computed. |
| MonoTouch | monotouch was computed. |
| Tizen | tizen40 was computed. tizen60 was computed. |
| Xamarin.iOS | xamarinios was computed. |
| Xamarin.Mac | xamarinmac was computed. |
| Xamarin.TVOS | xamarintvos was computed. |
| Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.0
- Google.Protobuf (>= 3.30.2)
- Microsoft.Bcl.AsyncInterfaces (>= 9.0.4)
- Microsoft.Bcl.HashCode (>= 6.0.0)
- Microsoft.Bcl.Memory (>= 9.0.4)
- System.Buffers (>= 4.6.1)
- System.IO.Pipelines (>= 9.0.4)
- System.Memory (>= 4.6.3)
- System.Runtime.CompilerServices.Unsafe (>= 6.1.2)
- System.Text.Encodings.Web (>= 9.0.4)
- System.Text.Json (>= 9.0.4)
-
net8.0
- Google.Protobuf (>= 3.30.2)
NuGet packages (87)
Showing the top 5 NuGet packages that depend on Microsoft.ML.Tokenizers:
| Package | Downloads |
|---|---|
|
Microsoft.ML.Tokenizers.Data.O200kBase
The Microsoft.ML.Tokenizers.Data.O200kBase includes the Tiktoken tokenizer data file o200k_base.tiktoken, which is utilized by models such as gpt-4o. |
|
|
Microsoft.Agents.AI
Provides Microsoft Agent Framework core functionality. |
|
|
Microsoft.ML.Tokenizers.Data.Cl100kBase
The Microsoft.ML.Tokenizers.Data.Cl100kBase class includes the Tiktoken tokenizer data file cl100k_base.tiktoken, which is utilized by models such as GPT-4. |
|
|
Microsoft.Agents.AI.OpenAI
Provides Microsoft Agent Framework support for OpenAI. |
|
|
Microsoft.Agents.AI.Workflows
Provides Microsoft Agent Framework support for workflows. |
GitHub repositories (1)
Showing the top 1 popular GitHub repositories that depend on Microsoft.ML.Tokenizers:
| Repository | Stars |
|---|---|
|
microsoft/semantic-kernel
Integrate cutting-edge LLM technology quickly and easily into your apps
|
| Version | Downloads | Last Updated |
|---|---|---|
| 3.0.0-preview.26160.2 | 5,046 | 3/12/2026 |
| 2.0.0 | 940,645 | 11/11/2025 |
| 2.0.0-preview.25527.5 | 29,150 | 10/29/2025 |
| 2.0.0-preview.25503.2 | 11,808 | 10/3/2025 |
| 2.0.0-preview.25373.1 | 25,577 | 7/28/2025 |
| 2.0.0-preview.1.25127.4 | 141,223 | 2/28/2025 |
| 2.0.0-preview.1.25125.4 | 2,530 | 2/25/2025 |
| 1.0.3 | 83,594 | 10/28/2025 |
| 1.0.2 | 1,276,223 | 2/26/2025 |
| 1.0.1 | 445,936 | 1/15/2025 |
| 1.0.0 | 919,688 | 11/14/2024 |
| 0.22.0 | 64,972 | 11/13/2024 |
| 0.22.0-preview.24526.1 | 3,631 | 10/27/2024 |
| 0.22.0-preview.24522.7 | 3,773 | 10/23/2024 |
| 0.22.0-preview.24378.1 | 367,501 | 7/29/2024 |
| 0.22.0-preview.24271.1 | 223,288 | 5/21/2024 |
| 0.22.0-preview.24179.1 | 223,391 | 4/2/2024 |
| 0.22.0-preview.24162.2 | 23,423 | 3/13/2024 |
| 0.21.1 | 240,045 | 1/18/2024 |
| 0.21.0 | 87,672 | 11/27/2023 |