Darcara.TextAnalysis
0.1.2
dotnet add package Darcara.TextAnalysis --version 0.1.2
NuGet\Install-Package Darcara.TextAnalysis -Version 0.1.2
<PackageReference Include="Darcara.TextAnalysis" Version="0.1.2" />
<PackageVersion Include="Darcara.TextAnalysis" Version="0.1.2" />
<PackageReference Include="Darcara.TextAnalysis" />
paket add Darcara.TextAnalysis --version 0.1.2
#r "nuget: Darcara.TextAnalysis, 0.1.2"
#:package Darcara.TextAnalysis@0.1.2
#addin nuget:?package=Darcara.TextAnalysis&version=0.1.2
#tool nuget:?package=Darcara.TextAnalysis&version=0.1.2
Sentence splitting, named entity recognition, translation and more
Sentence splitting with SaT / WtP
Segment Any Text (June 2024) is the successor to Where's the Point (July 2023). The code from both papers is available on GitHub.
SaT supports 85 languages. The detailed list is available in their GitHub readme.
Models for SaT come in 3 flavors:
- Base models with 1, 3, 6, 9 or 12 layers available on HuggingFace
More layers means higher accuracy, but longer inference time - Low-Rank Adaptation (LoRA) modules are available for 3 and 12 layer base models in their respective repositories
The LoRA modules enable the base models to be adapted to specific domains and styles - Supervised Mixture (sm) models with 1, 3, 6, 9, 12 layers available on HuggingFace
SM models have been trained with a "supervised mixture" of diverse styles and corruptions. They score higher both on english and multilingual text.
This project supports the *-sm model family in onnx format.
Configuration
The SaT-Models benefit greatly from the GPU.
For running on GPU setting the SessionConfiguration.Batching to batch=4 is best.
For running on CPU setting the SessionConfiguration.Batching to batch=1 with IterOperationThreads=1 and IntraOperationThreads=2 will . Higher values for IntraOperationThreads will slightly decrease computing time, but use a lot more processing power. It is preferable to sentencize multiple text in parallel.
A consuming project must nce a proper ONNX-runtime. For Windows deployments Microsoft.ML.OnnxRuntime.DirectML with Microsoft.AI.DirectML
will yield the best performance.
Setting the
RuntimeIdentifier in the project csproj to win-x64 is required.
Evaluation
The corpora scores are from the original SaT Github
This benchmark used the novel "The Adventures of Tom Sawyer, Complete by Mark Twain" from Project Gutenberg
The -model columns give the speed of only the model runtime, whereas -complete includes all pre and post data preparations, including the word tokenization.
| Model | English Score¹ | Multilingual Score¹ | CPU-model | GPU-model |
|---|---|---|---|---|
| 1L | 88.5 | 84.3 | ||
| 1L‑sm | 88.2 | 87.9 | ||
| 3L | 93.7 | 89.2 | ||
| 3L‑sm | 96.5 | 93.5 | ||
| 6L | 94.1 | 89.7 | ||
| 6L‑sm | 96.9 | 95.1 | ||
| 9L | 94.3 | 90.3 | ||
| 12L | 94.0 | 90.4 | ||
| 12L‑sm | 97.4 | 96.0 |
¹ From the original SaT-Github
Implementation notes
Word tokenization is done by sentencepiece using the xlm-roberta-base (Alt1, Alt2) model. It is used in C# with the help of the SentencePieceTokenizer library.
See:
- https://www.kaggle.com/code/samuellongenbach/xlm-roberta-tokenizers-issue/notebook
- https://github.com/google/sentencepiece/issues/1042#issuecomment-2295028056
- Seems to have no resolution, other than re-writing the model, since I don't know how to "modify the indexing scheme to start from 1" ?
Language Identification
With this library you can identify languages with ILanguageDetector detector = new LinguaLanguageDetector(useLowAccuracy: false);
The LinguaLanguageDetector is based on Panlingo.Lingua, but optimized to use less memory and provide additional detection methods.
The overloads for ReadOnlySpan<Char> (instead of String.SubString) reduce memory traffic and overloads for ReadOnlySpan<Byte> means you can use already utf8-ized strings directly.
The following table contains the evaluated libraries:
| Library | Model | Accuracy | Reliability¹ | Time per prediction² | Memory³ | Unsupported Languages |
|---|---|---|---|---|---|---|
| Panlingo.CLD2 | CLD2 | word: 32.14%<br/>pairs: 65.14%<br/>sent: 91.18% | word: 93.97%<br/>pairs: 91.58%<br/>sent: 94.91% | 0.005ms | 15 KiB | Hebrew, Norwegian_Bokmal |
| Panlingo.CLD3 | CLD3 | word: 43.22%<br/>pairs: 60.53%<br/>sent: 84.17% | word: 48.07%<br/>pairs: 64.70%<br/>sent: 87.23% | 0.038ms | 15 KiB | Hebrew, Ganda, Norwegian_Bokmal, Norwegian_Nynorsk, Tagalog, Tswana, Tsonga |
| Panlingo.FastText | FastText - 176 compressed | word: 45.36%<br/>pairs: 58.92%<br/>sent: 78.43% | word: 52.34%<br/>pairs: 60.10%<br/>sent: 78.43% | 0.086ms | 10 MiB | Ganda, Maori, Norwegian_Bokmal, Shona, Sotho_Southern, Tswana, Tsonga, Xhosa, Zulu |
| Panlingo.FastText | FastText - 176 | word: 50.70%<br/>pairs: 64.55%<br/>sent: 80.71% | - | 0.104ms | 142 MiB | Ganda, Maori, Norwegian_Bokmal, Shona, Sotho_Southern, Tswana, Tsonga, Xhosa, Zulu |
| Panlingo.FastText | FastText - 217 | word: 52.01%<br/>pairs: 69.98%<br/>sent: 84.87% | - | 0.940ms | 1.18 GiB | Arabic, Azerbaijani, Persian, Latin, Latvian, Mongolian, Malay_macrolanguage, Albanian, Swahili_macrolanguage |
| FastText.NetWrapper | FastText - 176 | word: 50.70%<br/>pairs: 64.55%<br/>sent: 80.71% | - | 0.009ms | 135 MiB | Ganda, Maori, Norwegian_Bokmal, Shona, Sotho_Southern, Tswana, Tsonga, Xhosa, Zulu |
| FastText.NetWrapper | FastText - 217 | word: 52.01%<br/>pairs: 69.98%<br/>sent: 84.87% | - | 0.081ms | 1.13 GiB | Arabic, Azerbaijani, Persian, Latin, Latvian, Mongolian, Malay_macrolanguage, Albanian, Swahili_macrolanguage |
| Panlingo.Whatlang | Whatlang | word: 40.32%<br/>pairs: 51.31%<br/>sent: 68.23% | - | 0.042ms | 75 Kib | Bosnian, Welsh, Basque, Persian, Irish, Icelandic, Kazakh, Ganda, Maori, Mongolian, Malay_macrolanguage, Norwegian_Nynorsk, Somali, Albanian, Sotho_Southern, Swahili_macrolanguage, Tswana, Tsonga, Xhosa, Yoruba, Chinese |
| Panlingo.Lingua<br/>Wrapper around original Rust library | Lingua - Low accuracy | word: 60.16%<br/>pairs: 78.35%<br/>sent: 93.38% | word: 62.38%<br/>pairs: 80.20%<br/>sent: 93.99% | 0.089ms | 100 Mib | - |
| Panlingo.Lingua<br/>Wrapper around original Rust library | Lingua - High accuracy | word: 73.94%<br/>pairs: 89.06%<br/>sent: 96.01% | - | 0.264ms | 1 Gib | - |
| SearchPioneer.Lingua<br/>pure .NET port | Lingua - Low accuracy | word: 59.89%<br/>pairs: 78.23%<br/>sent: 93.28% | - | 0.452ms | 100 Mib | - |
| SearchPioneer.Lingua<br/>pure .NET port | Lingua - High accuracy | word: 73.64%<br/>pairs: 88.98%<br/>sent: 95.83% | - | 0.565ms | 1 Gib | - |
¹ Reliability is the sum of accurate predictions and the 'unknown' predictions of a library. It is usually better to know that the library can not ascertain the language properly instead of giving out a random wrong language.
² Average for 1000 word, bi-words and sentence predictions for each language. Actual timings will depend on your CPU.
³ Approximate memory requirements for one predictor. Includes native memory for libraries, rules and ML-models.
Compiling onnx runtime on Windows
Prerequisites
Reference https://onnxruntime.ai/docs/build/inferencing.html
- Python 3.12
- CMake
- Visual Studio 2022 (with MSVC v143 C++ x64/x86 BuildTools(v14.41-17.11))
- Make sure the build folder is empty or missing before starting
git clone https://github.com/microsoft/onnxruntime
-- or --
git fetch
git checkout v1.20.1
onnxruntime> PATH=%PATH%;C:\Program Files\Python312
onnxruntime> build.bat --cmake_path "C:\Program Files\CMake\bin\cmake.exe" --ctest_path "C:\Program Files\CMake\bin\ctest.exe" --config Release --build_shared_lib --parallel --compile_no_warning_as_error --skip_tests --use_mimalloc --use_dml
Currently with problems:
--build_nuget and --use_extensions
The result will be in build\Windows\Release\Release
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net9.0
- Hardware.Info (>= 101.1.0)
- IsoEnums (>= 1.0.0)
- Microsoft.Extensions.Logging.Abstractions (>= 9.0.10)
- Microsoft.ML.OnnxRuntime.Managed (>= 1.23.2)
- Neco.Common (>= 0.2.1)
- protobuf-net (>= 3.2.56)
- SentencePieceTokenizer (>= 0.1.4)
- System.Numerics.Tensors (>= 9.0.10)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.