Ivilson.AI.VllmChatClient 2.0.20

This package has been deprecated.

Suggested Alternatives

Additional Details

This package has moved to VllmChatClient. Please install VllmChatClient for future updates. The API remains compatible.

dotnet add package Ivilson.AI.VllmChatClient --version 2.0.20

NuGet\Install-Package Ivilson.AI.VllmChatClient -Version 2.0.20

This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.

<PackageReference Include="Ivilson.AI.VllmChatClient" Version="2.0.20" />

For projects that support PackageReference, copy this XML node into the project file to reference the package.

<PackageVersion Include="Ivilson.AI.VllmChatClient" Version="2.0.20" />
                    

                            Directory.Packages.props

<PackageReference Include="Ivilson.AI.VllmChatClient" />
                    

                            Project file

For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.

paket add Ivilson.AI.VllmChatClient --version 2.0.20

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

#r "nuget: Ivilson.AI.VllmChatClient, 2.0.20"

#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.

#:package Ivilson.AI.VllmChatClient@2.0.20

#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.

#addin nuget:?package=Ivilson.AI.VllmChatClient&version=2.0.20
                    

                            Install as a Cake Addin

#tool nuget:?package=Ivilson.AI.VllmChatClient&version=2.0.20
                    

                            Install as a Cake Tool

The NuGet Team does not provide support for this client. Please contact its maintainers for support.

vllmchatclient

C# vLLM Chat Client

A comprehensive .NET 10 chat client library that supports various LLM models including OpenAI GPT 系列, Claude 4.6 / 4.5, GPT-OSS-120B, Nemotron-3 Super 120B, Qwen3, Qwen3-Next, Qwen 3.5, Qwen 3.6, QwQ-32B, Gemma3, Gemma4, DeepSeek-R1, DeepSeek-V3.2, Kimi K2 / Kimi 2.5, GLM-5 / GLM 4.6 / 4.7 / 4.7 Flash / 4.5, Gemini 3, MiniMax-M2.5, MiMo v2 Pro / MiMo v2 Flash with advanced reasoning capabilities.

✅ This project has been upgraded to .NET 10. All main libraries and test projects now target net10.0.

✅ 本项目已升级到 .NET 10，当前主要类库与测试项目均已切换到 net10.0。

📢 Latest Update

Added VllmApiMode to VllmBaseChatClient, allowing callers to choose between OpenAI-compatible /chat/completions, vLLM / OpenAI /responses, and Anthropic Messages API formats.
Added full request / response / streaming conversion for vLLM Responses API and Anthropic Messages API, including reasoning output, tool calls, usage parsing, and streaming tool-call argument deltas.
Added Anthropic-format integration tests for GLM-5.1 and Qwen 3.6 Plus via DashScope Anthropic endpoints.
Upgraded Microsoft.Extensions.AI to 10.5.2.
VllmBaseChatClient now supports vLLM structured JSON output via both:
- response_format = { type: "json_schema", json_schema: { name, description, schema, strict } }
- extra_body = { structured_outputs: { json: schema } }
Google native endpoints now use Google official structured output field generationConfig.responseJsonSchema, while non-Google endpoints continue to use the vLLM / OpenAI-compatible response_format + extra_body.structured_outputs path.
Added live json_schema tests for all chat clients based on VllmBaseChatClient.
Serial verification passed on: Claude, DeepSeek-R1, DeepSeek-V3.2, Gemma3, Gemma4 Google native, GLM-4.5, GPT-OSS, Kimi 2.5, MiMo v2 Flash, MiniMax-M2.7, Nemotron, OpenAI GPT, Qwen3-Next, Qwen3-VL.
Note for reasoning models: json_schema tests may require a larger MaxOutputTokens budget. For MiniMax-M2.7, increasing it from 300 to 3000 was necessary because reasoning tokens consumed most of the smaller limit.

🚀 Features

✅ Multi-model Support: OpenAI GPT 系列, Claude 4.6 / 4.5, Nemotron-3 Super 120B, Qwen3, Qwen3-Next, Qwen 3.5 (supports multiple modelIds, including Qwen3-VL), QwQ, Gemma3,Gemma4,DeepSeek-R1, DeepSeek-V3.2, GLM-5 / GLM-4 / glm-4.6 / glm-4.7 / glm-4.7-flash / glm-4.5, GPT-OSS-120B/20B, Kimi K2 / Kimi 2.5, Gemini 3, MiniMax-M2.5, MiMo v2 Pro / MiMo v2 Flash
✅ Reasoning Chain Support: Built-in thinking/reasoning capabilities for supported models (GLM supports Zhipu official thinking parameter via GlmChatOptions.ThinkingEnabled)
✅ Stream Function Calls: Real-time function calling with streaming responses
✅ Multiple Deployment Options: Local vLLM deployment and cloud API support
✅ Performance Optimized: Efficient streaming and memory management
✅ .NET 10 Ready: Full compatibility with the latest .NET platform

📦 Project Repository

GitHub: https://github.com/iwaitu/vllmchatclient

License: MPL-2.0 (changed from MIT to Mozilla Public License 2.0)

本次更新

🆕 vLLM Responses API 与 Anthropic Messages API 支持

新增 VllmApiMode：VllmBaseChatClient 构造函数新增 apiMode 参数，可选择：
- VllmApiMode.ChatCompletions：默认 OpenAI 兼容 /chat/completions 接口。
- VllmApiMode.Responses：vLLM / OpenAI Responses API 格式。
- VllmApiMode.AnthropicMessages：Anthropic Messages API 格式。
Responses API 兼容：支持普通响应、流式响应、reasoning 输出、函数调用、usage 解析。
Anthropic API 兼容：支持 x-api-key / anthropic-version 认证头、/v1/messages 请求格式、system 分离、tool_use / tool_result、thinking block、流式 input_json_delta 工具参数拼接。
客户端适配：继承 VllmBaseChatClient 的主要客户端构造函数均可传入 VllmApiMode，包括 GLM、Qwen3Next、Claude、DeepSeek、Gemma4、GPT-OSS、Kimi、MiMo、MiniMax、Nemotron、OpenAI GPT 等。
测试覆盖：新增 ResponsesApiModeTests、AnthropicApiModeTests、Glm5AnthropicTests、Qwen36PlusAnthropicTests，覆盖 Responses / Anthropic 两种新协议，以及 DashScope Anthropic endpoint 下的 GLM-5.1 与 Qwen 3.6 Plus。

Demo：使用 Responses API

using Microsoft.Extensions.AI;

IChatClient client = new VllmOpenAiGptClient(
    endpoint: "http://localhost:8000/v1",
    token: "EMPTY",
    modelId: "openai/gpt-oss-20b",
    apiMode: VllmApiMode.Responses);

var response = await client.GetResponseAsync(
[
    new ChatMessage(ChatRole.User, "用一句话介绍 vLLM Responses API")
]);

Console.WriteLine(response.Text);

Demo：使用 Anthropic Messages API

using Microsoft.Extensions.AI;

var apiKey = Environment.GetEnvironmentVariable("VLLM_ALIYUN_API_KEY");

IChatClient client = new VllmQwen3NextChatClient(
    endpoint: "https://dashscope.aliyuncs.com/apps/anthropic",
    token: apiKey,
    modelId: "qwen3.6-plus",
    apiMode: VllmApiMode.AnthropicMessages);

var options = new VllmChatOptions
{
    ThinkingEnabled = true,
    MaxOutputTokens = 3000,
};

var response = await client.GetResponseAsync(
[
    new ChatMessage(ChatRole.System, "你是一个智能助手，名字叫菲菲"),
    new ChatMessage(ChatRole.User, "你是谁？")
],
options);

Console.WriteLine(response.Text);

🆕 Gemma 4 原生 API / vLLM 双支持

Gemma 4 思维链格式说明：Gemma 4 在 vLLM / OpenAI-compatible 路径下的 thinking / reasoning 输出格式可能会随着 vLLM 版本变化而变化；当前库已对部分非标准 thought 内容做兼容处理，但如遇格式差异仍建议以所部署的 vLLM 版本行为为准，参考：https://github.com/vllm-project/vllm/pull/39027。
新增 VllmGemma4ChatClient：统一支持 Gemma 4 的 Google 原生 API 与 vLLM / OpenAI 兼容接口。
按 endpoint 自动切换协议：
- Google 官方 URL（如 generativelanguage.googleapis.com）使用原生 generateContent / streamGenerateContent 请求格式。
- 其他 URL 使用 vLLM/OpenAI-compatible 的 /chat/completions 请求格式。
认证头自动切换：
- Google 原生 API 使用 x-goog-api-key
- vLLM / OpenAI-compatible 接口使用 Authorization: Bearer ...
Google 原生 API 支持能力：文本对话、流式输出、图片输入、thinking 控制、自动/手动工具调用。
Google 原生结构化输出：当 endpoint 是 Google 官方 URL 时，JSON Schema 输出会走 Google 官方 generationConfig.responseJsonSchema；非 Google URL 仍走 response_format=json_schema 与 extra_body.structured_outputs.json。
vLLM 兼容支持：支持 Gemma 4 的聊天、流式、JSON 输出、图片输入，以及工具调用场景。
思维链处理修复：Google 原生返回中的 thought / thinking 内容不再混入最终答案文本，而是通过 ReasoningChatResponse / ReasoningChatResponseUpdate 单独暴露。
测试覆盖：已补充 Gemma4Tests、Gemma4ProviderCompatibilityTests、Gemma4NativeToolCallingTests，分别覆盖 Google 原生与 vLLM 两条链路。

🆕 MiMo 与命名空间调整

许可变更：项目许可文件已从 MIT 修改为 MPL-2.0（Mozilla Public License 2.0）。
新增 VllmMimoChatClient：新增对小米 MiMo 云端接口的适配，支持 mimo-v2-pro、mimo-v2-flash。
MiMo 请求兼容：使用 api-key 请求头，并按官方 OpenAI 兼容接口发送 extra_body: { "thinking": { "type": "enabled" | "disabled" } }。
命名空间调整：VllmGlmChatClient 从 Microsoft.Extensions.AI.VllmChatClient.Glm4 调整为 Microsoft.Extensions.AI。
命名空间调整：VllmKimiK2ChatClient 从 Microsoft.Extensions.AI.VllmChatClient.Kimi 调整为 Microsoft.Extensions.AI。

🆕 Nemotron-3 Super 120B 思维链开关支持

新增 VllmNemotronChatClient 请求适配：面向 OpenRouter 的 nvidia/nemotron-3-super-120b-a12b:free 模型。
思维链开关：支持通过 VllmChatOptions.ThinkingEnabled 控制是否发送 reasoning: { enabled: true|false }。
OpenRouter 端点兼容：传入 https://openrouter.ai/api/v1 时会自动补全为 /chat/completions。
新增兼容性测试：验证 reasoning.enabled 在开关两种状态下都能正确发送。

🆕 Claude 4.6 / 4.5 思维链支持

新增 VllmClaudeChatClient：专门适配 OpenRouter 等平台提供的 Claude 模型。
思维链参数适配：支持 Claude 4.6 推出的 reasoning: { effort: "high"|"medium"|"low" } 参数（通过 VllmChatOptions.ThinkingEnabled = true 开启，默认使用 high）。
响应格式解析：支持从模型返回的 reasoning 字符串或 reasoning_details 数组中提取思维链内容，并统一封装进 ReasoningChatResponse。
Token 优化：针对 Claude 默认较大的 token 限制进行了保护性设置，避免 OpenRouter 额度报错。

🆕 OpenAI GPT 系列支持

新增 VllmOpenAiGptClient：专门适配 OpenAI 官方或 OpenRouter 提供的 GPT 系列模型（如 gpt-4o, gpt-5.2-codex 等）。
推理分段支持：支持包含思维链的 GPT 系列模型，通过 OpenAiGptChatOptions 控制推理级别 (ReasoningLevel)。
灵活配置：内置 ExcludeReasoning 选项，允许控制是否在输出中包含推理过程。

🆕 DeepSeek V3.2 思维链支持

VllmDeepseekV3ChatClient 思维链修复：
- 修正请求格式：DashScope API 使用 enable_thinking: true（顶层布尔值），而非 Kimi 格式的 thinking: {type: "enabled"}。
- 模型返回的 reasoning_content 字段现在可以正确解析并输出。
- 非流式响应通过 ReasoningChatResponse.Reason 获取思维链内容。
- 流式响应通过 ReasoningChatResponseUpdate.Thinking 区分思考阶段与最终回答。
- 支持通过 VllmChatOptions.ThinkingEnabled = true 开启思维链。
- 兼容 DashScope 平台 deepseek-v3.2 模型。

🐛 Bug Fixes

VllmGptOssChatClient 流式函数调用 Bug 修复：
- 修复了流式手动函数调用（Manual Function Call）时，模型返回 tool_calls 后第一个流结束、导致无法获取最终文本回复的问题。
- 新增 GetStreamingResponseAsync 重写：自动检测调用方已将工具结果追加到 messages，并自动发起第二轮流式请求，实现无缝的工具调用 → 最终回复流程。
- 现在 StreamChatManualFunctionCallTest 可以在单个 await foreach 循环中完成完整的工具调用流程，无需手动编写 "Second turn" 逻辑。
- 简化了默认系统提示词，去除了"tool_calls 时 content 必须为空"的硬性约束。

🔄 `VllmQwen3NextChatClient` 重构 — 统一多模型适配

VllmQwen3NextChatClient 已适配多个模型系列，通过构造函数 modelId 或 ChatOptions.ModelId 切换，无需再使用独立的 Client 类：
- qwen3.5-397b-a17b（Qwen 3.5，最新）
- qwen3-next-80b-a3b-thinking / qwen3-next-80b-a3b-instruct
- qwen3-vl-30b-a3b-thinking / qwen3-vl-30b-a3b-instruct（多模态，支持图片输入）
- qwen3-vl-32b-thinking / qwen3-vl-32b-instruct（多模态）
- qwen3-vl-235b-a22b-thinking / qwen3-vl-235b-a22b-instruct（多模态，人工验证通过）
删除已整合的模型类（功能已由 VllmQwen3NextChatClient 或基类统一覆盖）：
- VllmQwen2507ChatClient（qwen3-235b-a22b-instruct-2507）— 已删除
- VllmQwen2507ReasoningChatClient（qwen3-235b-a22b-thinking-2507）— 已删除
- 对应测试 Qwen2507ChatTests.cs、Qwen2507ReasoningChatTests.cs、Qwen3coderNextTests.cs 同步删除
删除 VllmChatClientNuget.Test 测试项目（已不再需要）。

🧩 基类重构与适配器增强

VllmBaseChatClient 基类增强：提取公共逻辑（请求构建、流式解析、推理内容处理）到基类，子类只需重写特定差异部分。
VllmDeepseekR1ChatClient 重构：继承 VllmBaseChatClient，精简代码，仅保留 DeepSeek R1 特有的 ReasoningContent 流式处理逻辑。
VllmGptOssChatClient 重构：继承 VllmBaseChatClient，精简大量重复代码，增强推理流式处理。

🛠️ 本地 Skill 自动加载

新增 VllmChatOptions 的 skill 自动加载功能：默认从运行目录 ./skills/*.md 读取本地 skills，并自动注入系统提示词。
可通过 EnableSkills（默认 true）/ SkillDirectoryPath 控制开关与路径。
内置工具 ListSkillFiles 和 ReadSkillFile，模型可在对话中按需查询和读取 skill 文件。
新增 SimpleSkillSmokeTests 测试类验证 skill 功能。

📝 其他更新

新增 Qwen 3.5 支持（qwen3.5-397b-a17b），通过 VllmQwen3NextChatClient 接入。
新增 MiMo 支持：VllmMimoChatClient 支持 mimo-v2-pro、mimo-v2-flash。
VllmQwen3NextChatClient 新增 Qwen3.5 提供商兼容逻辑：
- 当 API URL 的 host 以 aliyuncs.com 结尾时，按阿里云官方接口发送顶层 enable_thinking。
- 其他端点（如自建 vLLM / OpenAI 兼容网关）按顶层 chat_template_kwargs.enable_thinking 发送。
VllmQwen3NextChatClient 默认启用 legacy 文本工具调用兜底，兼容 Qwen3/Qwen3.5 返回的 <tool_call>...</tool_call> 格式。
修复 Qwen3Next 非流式普通 JSON 文本误清洗问题：仅在存在真实工具调用残留时清理标签，避免误删 JSON 外层花括号。
Qwen3.5 前缀模型已放宽为支持多模态输入（图片）。
新增 MiniMax-M2.5 支持，VllmMiniMaxChatClient 兼容 M2.5 / M2.1。
新增 GLM 4.7 Flash 支持。
新增 GLM 4.6/4.7/5 思维链支持：VllmGlmChatClient，支持推理分段流式输出（思考/答案）与函数调用。
新增 GlmChatOptions：通过 ThinkingEnabled 开关控制是否在请求体中发送智普官方平台所需的 thinking: { type: "enabled" }（默认关闭）。
新增 KimiChatOptions：通过 ThinkingEnabled 开关控制 Moonshot/Kimi 2.5 所需的 thinking: { type: "enabled" | "disabled" }。
修复/完善 VllmKimiK2ChatClient 思维链解析。
新增标签提取示例（基于 JSON 解析与正则匹配）。
新增 Gemini 3 支持（VllmGemini3ChatClient），详见 docs/Gemini3* 系列文档。
Gemini 3 兼容双提供商：同一个 VllmGemini3ChatClient 可同时适配 Google 原生 API 与 OpenRouter（自动按 endpoint 切换认证头）。
OpenRouter 兼容增强：请求体映射 reasoning.enabled，并修复工具回传消息字段（tool_call_id / tool_calls）以支持多轮函数调用。
OpenRouter 的 thoughtSignature 在部分模型/响应中可能缺失，测试已调整为“有则校验、无则跳过严格断言”。

🔥 Latest Updates

🆕 Gemma 4 Support

Gemma 4 reasoning format note: on the vLLM / OpenAI-compatible path, Gemma 4 thinking / reasoning output format may change across vLLM versions. This library includes compatibility handling for some non-standard embedded thought content, but actual behavior still depends on the deployed vLLM version. See: https://github.com/vllm-project/vllm/pull/39027
VllmGemma4ChatClient added: one client now supports both Google native Gemma API and vLLM / OpenAI-compatible endpoints.
Endpoint-based protocol switching:
- Google native URLs → generateContent / streamGenerateContent
- Other URLs → /chat/completions
Structured JSON output routing:
- Google native URLs → generationConfig.responseJsonSchema
- Other URLs → response_format=json_schema + extra_body.structured_outputs
Auth header auto-switch:
- Google native → x-goog-api-key
- vLLM/OpenAI-compatible → Authorization: Bearer ...
Supported capabilities:
- chat / streaming chat
- thinking toggle
- JSON output
- image input
- automatic and manual tool calling
Reasoning separation: Google native thought parts are exposed as reasoning updates and no longer leak into final answer text.
Tests added: provider compatibility, native tool calling, and external integration coverage for both Google native and vLLM paths.

🆕 MiMo Support and Namespace Changes

License updated: the project license file has changed from MIT to MPL-2.0 (Mozilla Public License 2.0).
VllmMimoChatClient added: supports Xiaomi MiMo cloud models mimo-v2-pro and mimo-v2-flash.
MiMo request compatibility: uses the api-key header and sends extra_body: { "thinking": { "type": "enabled" | "disabled" } } for thinking control.
Namespace change: VllmGlmChatClient moved from Microsoft.Extensions.AI.VllmChatClient.Glm4 to Microsoft.Extensions.AI.
Namespace change: VllmKimiK2ChatClient moved from Microsoft.Extensions.AI.VllmChatClient.Kimi to Microsoft.Extensions.AI.

🆕 Nemotron-3 Super 120B Reasoning Toggle Support

VllmNemotronChatClient updated: tailored for the OpenRouter nvidia/nemotron-3-super-120b-a12b:free model.
Reasoning toggle: use VllmChatOptions.ThinkingEnabled to send reasoning: { enabled: true|false }.
OpenRouter endpoint normalization: https://openrouter.ai/api/v1 is automatically normalized to /chat/completions.
Compatibility tests added: verifies the reasoning.enabled payload in both enabled and disabled modes.

🆕 Claude 4.6 / 4.5 Thinking Chain Support

VllmClaudeChatClient added: Specifically designed for Claude models via platforms like OpenRouter.
Thinking Parameter Adaptation: Supports the new reasoning: { effort: "high" } format introduced in Claude 4.6.
Reasoning Extraction: Efficiently extracts reasoning content from both reasoning (string) and reasoning_details (array) response fields.
Token Optimization: Includes default MaxTokens limits to prevent credit-related errors on cloud providers.

🆕 OpenAI GPT Series Support

VllmOpenAiGptClient added: Specifically designed for OpenAI official or OpenRouter GPT models.
Reasoning Level Control: Fine-tune model reasoning depth via OpenAiGptChatOptions.ReasoningLevel.
Reasoning Toggle: Use ExcludeReasoning to easily include or omit the thinking process from the output.

🆕 DeepSeek V3.2 Thinking Chain Support

VllmDeepseekV3ChatClient thinking chain fixed:
- Corrected request format: DashScope API uses enable_thinking: true (top-level boolean) instead of thinking: {type: "enabled"}.
- reasoning_content field in model responses is now correctly parsed and output.
- Non-streaming: access thinking via ReasoningChatResponse.Reason.
- Streaming: use ReasoningChatResponseUpdate.Thinking to distinguish thinking vs final answer.
- Enable via VllmChatOptions.ThinkingEnabled = true.
- Compatible with DashScope platform deepseek-v3.2 model.

🐛 Bug Fixes

VllmGptOssChatClient Streaming Function Call Bug Fixed:
- Fixed an issue where the stream ended after model returned tool_calls, leaving the final text response empty.
- Added GetStreamingResponseAsync override: automatically detects when the caller has appended tool results to messages and initiates a follow-up streaming request seamlessly.
- StreamChatManualFunctionCallTest now works in a single await foreach loop without needing manual "Second turn" logic.
- Simplified the default system prompt by removing the strict "content must be empty when tool_calls present" constraint.

🆕 GLM 4.6 / 4.7 / 5 Thinking Model Support

VllmGlmChatClient added with full reasoning (thinking) stream separation.
Supports glm-5, glm-4.7, glm-4.7-flash, glm-4.6, glm-4.5.
Compatible with existing tool/function invocation pipeline.
Supports Zhipu official platform thinking parameter via GlmChatOptions.ThinkingEnabled.

🆕 New GPT-OSS-20B/120B Support

VllmGptOssChatClient - Support for OpenAI's GPT-OSS-120B model with full reasoning capabilities
Advanced reasoning chain processing with ReasoningChatResponseUpdate
Compatible with OpenRouter and other GPT-OSS providers
Enhanced debugging and performance optimizations

🆕 GLM-4 Support

VllmGlmZ1ChatClient - Support for GLM-4 models with reasoning capabilities
VllmGlm4ChatClient - Standard GLM-4 chat functionality

🔄 Base Class Refactoring & Model Consolidation

VllmBaseChatClient enhanced: common logic (request building, streaming parsing, reasoning content handling) extracted to base class; subclasses only override specific differences.
VllmDeepseekR1ChatClient refactored: inherits VllmBaseChatClient, retains only DeepSeek R1-specific ReasoningContent streaming logic.
VllmGptOssChatClient refactored: inherits VllmBaseChatClient, significantly reduced duplicate code, enhanced reasoning streaming.
Removed VllmQwen2507ChatClient and VllmQwen2507ReasoningChatClient (consolidated into VllmQwen3NextChatClient).
Removed VllmChatClientNuget.Test project.

🛠️ Local Skill Auto-Loading

VllmChatOptions now supports automatic skill loading from ./skills/*.md files, injected into system prompts.
Controlled via EnableSkills (default true) / SkillDirectoryPath.
Built-in tools ListSkillFiles and ReadSkillFile allow models to query and read skill files during conversation.

🆕 Qwen3-Next / Qwen 3.5 Multi-Model Adaptation

VllmQwen3NextChatClient now supports multiple model families via modelId:
- qwen3.5-397b-a17b (Qwen 3.5, latest)
- qwen3-next-80b-a3b-thinking / qwen3-next-80b-a3b-instruct
- qwen3-vl-30b-a3b-thinking / qwen3-vl-30b-a3b-instruct (multimodal, image input)
- qwen3-vl-32b-thinking / qwen3-vl-32b-instruct (multimodal)
- qwen3-vl-235b-a22b-thinking / qwen3-vl-235b-a22b-instruct (multimodal, manually verified)
Unified API: switch model by passing the desired modelId in constructor or per-request via ChatOptions.ModelId.
Thinking models expose ReasoningChatResponse / streaming ReasoningChatResponseUpdate; instruct models output standard responses.
New examples: Serial/Parallel tool calls, manual tool orchestration in streaming, JSON-only output formatting.

🆕 Kimi K2 Support

VllmKimiK2ChatClient added.
Supports Kimi models including kimi-k2-thinking and kimi-k2.5.
Seamless reasoning streaming via ReasoningChatResponseUpdate (thinking vs final answer segments).
Full function invocation support (automatic or manual tool call handling).

🆕 Kimi 2.5 Thinking Toggle (Moonshot)

New KimiChatOptions.ThinkingEnabled to control request payload:
- ThinkingEnabled = true → thinking: { "type": "enabled" }
- ThinkingEnabled = false → thinking: { "type": "disabled" }
Kimi reasoning text is taken from reasoningContent / streaming delta.reasoning_content (not </think> markers).

🆕 Gemini 3 Support & Tool Calling

VllmGemini3ChatClient added (Google Gemini API)。
Features: text & streaming, ReasoningLevel (Normal/Low), full tool calling (single / parallel / automatic / streaming)。
Tests: Gemini3Test 全部通过（含多轮与并行工具调用）、GeminiDebugTest 覆盖原生 API 思维签名与多轮函数调用调试。
Docs: 详见 docs/Gemini3* 文档合集。

🆕 Gemini 3 OpenRouter Compatibility

VllmGemini3ChatClient now supports both Google native Gemini API and OpenRouter in one client.
Auth header auto-switch by endpoint:
- Google native: x-goog-api-key
- OpenRouter/OpenAI-compatible: Authorization: Bearer ...
OpenRouter reasoning mapping: sends top-level reasoning.enabled to match provider requirements.
Tool-calling protocol compatibility: fixed tool_call_id / tool_calls request field names, and improved multi-turn tool-result roundtrip compatibility.
In OpenRouter tests, thoughtSignature may be absent depending on model/provider behavior; assertions are now provider-tolerant.

🆕 MiniMax-M2.5 Support

VllmMiniMaxChatClient added for MiniMax-M2.5 / M2.1 model support.
Full streaming chat and function calling (parallel tool calls supported).
Compatible with DashScope API endpoint.
Tests: MiniMaxTests covering chat, streaming, function calls (serial/parallel/manual), and JSON output.

🆕 Qwen 3.5 Support

VllmQwen3NextChatClient now supports Qwen 3.5 (qwen3.5-397b-a17b) via DashScope API.
Full reasoning chain and function calling support.
Use the same VllmQwen3NextChatClient with modelId = "qwen3.5-397b-a17b".

🏗️ Supported Clients

Client	Deployment	Model Support	Reasoning	Function Calls
`VllmOpenAiGptClient`	OpenRouter/Cloud	OpenAI GPT Series	✅ Full	✅ Stream
`VllmClaudeChatClient`	OpenRouter/Cloud	Claude 4.6 / 4.5	✅ Full	✅ Stream
`VllmNemotronChatClient`	OpenRouter/Cloud	Nemotron-3 Super 120B (`nvidia/nemotron-3-super-120b-a12b:free`)	✅ Toggle (`reasoning.enabled`)	❌
`VllmGptOssChatClient`	OpenRouter/Cloud	GPT-OSS-120B/20B	✅ Full	✅ Stream
`VllmQwen3ChatClient`	Local vLLM	Qwen3-32B/235B	✅ Toggle	✅ Stream
`VllmQwen3NextChatClient`	Cloud API (DashScope compatible)	Multiple modelIds (e.g. qwen3-next-80b-a3b-thinking / qwen3-next-80b-a3b-instruct)	✅ (thinking model)	✅ Stream
`VllmQwen3NextChatClient`	Cloud API (DashScope compatible)	qwen3-vl-30b-a3b-thinking / qwen3-vl-30b-a3b-instruct	✅ (thinking model)	✅ Stream
`VllmQwen3NextChatClient`	Cloud API (DashScope compatible)	qwen3-vl-32b-thinking / qwen3-vl-32b-instruct	✅ (thinking model)	✅ Stream
`VllmQwen3NextChatClient`	Cloud API (DashScope compatible)	qwen3-vl-235b-a22b-thinking / qwen3-vl-235b-a22b-instruct (manual verified)	✅ (thinking model)	✅ Stream
`VllmQwqChatClient`	Local vLLM	QwQ-32B	✅ Full	✅ Stream
`VllmGemmaChatClient`	Local vLLM	Gemma3-27B	❌	✅ Stream
`VllmGemma4ChatClient`	Google native API / Local vLLM / OpenAI-compatible	gemma-4-31b-it	✅ Toggle	✅ Stream
`VllmGemini3ChatClient`	Cloud API (Google Gemini / OpenRouter)	gemini-3-pro-preview / google/gemini-3.1-*	Signature (hidden, provider-dependent)	✅ Stream
`VllmDeepseekR1ChatClient`	Cloud API	DeepSeek-R1	✅ Full	❌
`VllmDeepseekV3ChatClient`	Cloud API (DashScope / DeepSeek official / DeepSeek Anthropic)	DeepSeek-V3.2 / deepseek-v4 / deepseek-v4-flash	✅ (via `VllmChatOptions`)	✅ Stream
`VllmGlmChatClient`	Cloud API (Zhipu official) / OpenAI compatible	glm-5 / glm-4.6 / glm-4.7 / glm-4.7-flash / glm-4.5	✅ Full (via `GlmChatOptions`)	✅ Stream
`VllmKimiK2ChatClient`	Cloud API (DashScope)	kimi-k2-(thinking/instruct) / kimi-k2.5	✅ (thinking model)	✅ Stream
`VllmMimoChatClient`	Cloud API (Xiaomi MiMo)	mimo-v2-pro / mimo-v2-flash	✅ Toggle (via `extra_body.thinking.type`)	✅ Stream
`VllmMiniMaxChatClient`	Cloud API (DashScope)	MiniMax-M2.5 / M2.1	✅	✅ Stream
`VllmQwen3NextChatClient`	Cloud API (DashScope compatible)	qwen3.5-397b-a17b	✅ (thinking model)	✅ Stream

注：Gemini 3 的推理采用加密的 thought signature，不输出可读推理文本；OpenRouter 场景下 thoughtSignature 可能缺失，函数调用在当前实现中无需显式回传签名亦可完成多轮调用。

🐳 Docker Deployment Examples

Qwen3 vLLM Deployment:

docker run -it --gpus all -p 8000:8000 \
  -v /models/Qwen3-32B-FP8:/models/Qwen3-32B-FP8 \
  --restart always \
  -e VLLM_USE_V1=1 \
  vllm/llm-openai:v0.8.5 \
  --model /models/Qwen3-32B-FP8 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --trust-remote-code \
  --max-model-len 131072 \
  --tensor-parallel-size 2 \
  --gpu_memory_utilization 0.8 \
  --served-model-name "qwen3"

🆕 Qwen3.5 Thinking Toggle Compatibility

VllmQwen3NextChatClient now switches the no-thinking request format based on the API URL:

Alibaba Cloud / DashScope (*.aliyuncs.com) → top-level enable_thinking
Self-hosted vLLM / OpenAI-compatible gateways → top-level chat_template_kwargs.enable_thinking

using Microsoft.Extensions.AI;

var client = new VllmQwen3NextChatClient(
    "https://your-vllm-endpoint/v1/{1}",
    apiKey,
    "qwen3.5-122b-a10b");

var messages = new List<ChatMessage>
{
    new(ChatRole.System, "你是一个智能助手，名字叫菲菲"),
    new(ChatRole.User, "请仅输出单个json对象格式的问候语，不要输出任何解释、前后缀文本、markdown或 codeblock。")
};

var options = new VllmChatOptions
{
    ThinkingEnabled = false,
    MaxOutputTokens = 1024,
};

var response = await client.GetResponseAsync(messages, options);
Console.WriteLine(response.Text);

For Qwen3.5 models, VllmQwen3NextChatClient also supports:

multimodal image input for qwen3.5* model IDs
legacy text tool-call fallback for <tool_call>...</tool_call> outputs

QwQ vLLM Deployment:

docker run -it --gpus all -p 8000:8000 \
  -v /models/Qwen3-32B-FP8:/models/Qwen3-32B-FP8 \
  --restart always \
  -e VLLM_USE_V1=1 \
  vllm/llm-openai:v0.8.5 \
  --model /models/Qwen3-32B-FP8 \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --trust-remote-code \
  --max-model-len 131072 \
  --tensor-parallel-size 2 \
  --gpu_memory_utilization 0.8 \
  --served-model-name "qwen3"

Gemma3 vLLM Deployment:

docker run -it --gpus all -p 8000:8000 \
  -v /models/gemma-3-27b-it-FP8-Dynamic:/models/gemma-3-27b-it-FP8-Dynamic \
  -v /home/lc/work/gemma3.jinja:/home/lc/work/gemma3.jinja \
  -e TZ=Asia/Shanghai \
  -e VLLM_USE_V1=1 \
  --restart always \
  vllm/llm-openai:v0.8.2 \
  --model /models/gemma-3-27b-it-FP8-Dynamic \
  --enable-auto-tool-choice \
  --tool-call-parser pythonic \
  --chat-template /home/lc/work/gemma3.jinja \
  --trust-remote-code \
  --max-model-len 128000 \
  --tensor-parallel-size 2 \
  --gpu_memory_utilization 0.8 \
  --served-model-name "gemma3"

💻 Usage Examples

🆕 Gemma 4 Example

using Microsoft.Extensions.AI;

// Google native API
IChatClient gemma4Native = new VllmGemma4ChatClient(
    "https://generativelanguage.googleapis.com/v1beta",
    Environment.GetEnvironmentVariable("GEMINI_API_KEY"),
    "gemma-4-31b-it");

// Local vLLM / OpenAI-compatible API
IChatClient gemma4Vllm = new VllmGemma4ChatClient(
    "http://localhost:8000/v1/{1}",
    Environment.GetEnvironmentVariable("VLLM_API_KEY"),
    "gemma4");

var messages = new List<ChatMessage>
{
    new(ChatRole.System, "你是一个智能助手，名字叫菲菲"),
    new(ChatRole.User, "请介绍一下你自己")
};

var options = new VllmChatOptions
{
    ThinkingEnabled = true,
    MaxOutputTokens = 1024,
};

var response = await gemma4Native.GetResponseAsync(messages, options);
Console.WriteLine(response.Text);

🆕 GLM 4.6/4.7/4.7-Flash Thinking Example

using Microsoft.Extensions.AI;

IChatClient glm46 = new VllmGlmChatClient(
    "http://localhost:8000/{0}/{1}", // or your OpenAI-compatible endpoint
    null,
    "glm-4.6");

// Enable Zhipu official platform thinking chain parameter:
// thinking: { "type": "enabled" }
var opts = new GlmChatOptions { ThinkingEnabled = true };

var messages = new List<ChatMessage>
{
    new(ChatRole.System, "你是一个智能助手，名字叫菲菲"),
    new(ChatRole.User, "解释一下快速排序的思想并举一个简单例子。")
};

string reasoning = string.Empty;
string answer = string.Empty;
await foreach (var update in glm46.GetStreamingResponseAsync(messages, opts))
{
    if (update is ReasoningChatResponseUpdate r)
    {
        if (r.Thinking)
            reasoning += r.Text; // reasoning phase
        else
            answer += r.Text;    // final answer phase
    }
    else
    {
        answer += update.Text;
    }
}
Console.WriteLine($"Reasoning: {reasoning}\nAnswer: {answer}");

🆕 Claude 4.6 / 4.5 with Reasoning (OpenRouter)

using Microsoft.Extensions.AI;

// Initialize Claude client (OpenRouter)
IChatClient claude = new VllmClaudeChatClient(
    "https://openrouter.ai/api/v1",
    "your-api-key",
    "anthropic/claude-4.6-sonnet");

var messages = new List<ChatMessage>
{
    new(ChatRole.System, "你是一个拥有强大逻辑推理能力的智能助手。"),
    new(ChatRole.User, "解释一下为什么天空是蓝色的？请详细思考。")
};

// Enable high-effort reasoning
var options = new VllmChatOptions { ThinkingEnabled = true };

// Non-streaming example:
var response = await claude.GetResponseAsync(messages, options);
if (response is ReasoningChatResponse r)
{
    Console.WriteLine($"🧠 Thinking:\n{r.Reason}");
    Console.WriteLine($"💬 Answer:\n{r.Text}");
}

// Streaming example:
await foreach (var update in claude.GetStreamingResponseAsync(messages, options))
{
    if (update is ReasoningChatResponseUpdate ru)
    {
        if (ru.Thinking)
            Console.Write(ru.Text); // Reasoning phase
        else
            Console.Write(ru.Text); // Answer phase
    }
}

🆕 Nemotron-3 Super 120B with Reasoning Toggle (OpenRouter)

using Microsoft.Extensions.AI;

IChatClient nemotron = new VllmNemotronChatClient(
    "https://openrouter.ai/api/v1",
    "your-openrouter-api-key",
    "nvidia/nemotron-3-super-120b-a12b:free");

var messages = new List<ChatMessage>
{
    new(ChatRole.User, "How many r's are in the word strawberry?")
};

var options = new VllmChatOptions
{
    ThinkingEnabled = true
};

var response = await nemotron.GetResponseAsync(messages, options);

if (response is ReasoningChatResponse reasoningResponse)
{
    Console.WriteLine($"Thinking: {reasoningResponse.Reason}");
}

Console.WriteLine($"Answer: {response.Text}");

🆕 OpenAI GPT Series with Reasoning (OpenRouter)

using Microsoft.Extensions.AI;

// Initialize OpenAI GPT client (OpenRouter)
IChatClient gptClient = new VllmOpenAiGptClient(
    "https://openrouter.ai/api/v1",
    "your-api-key",
    "openai/gpt-5.2-codex");

var messages = new List<ChatMessage>
{
    new(ChatRole.System, "You are a coding expert."),
    new(ChatRole.User, "Write a complex regex for email validation and explain it.")
};

// Set reasoning level and other options
var options = new OpenAiGptChatOptions 
{ 
    ReasoningLevel = OpenAiGptReasoningLevel.High,
    Temperature = 0.5f 
};

// Streaming with reasoning
await foreach (var update in gptClient.GetStreamingResponseAsync(messages, options))
{
    if (update is ReasoningChatResponseUpdate r)
    {
        if (r.Thinking)
            Console.Write(r.Text); // Reasoning phase
        else
            Console.Write(r.Text); // Answer phase
    }
}

🆕 GPT-OSS-120B with Reasoning (OpenRouter)

using Microsoft.Extensions.AI;
using Microsoft.Extensions.AI.VllmChatClient.GptOss;

[Description("Gets weather information")]
static string GetWeather(string city) => $"Weather in {city}: Sunny, 25°C";

// Initialize GPT-OSS client
IChatClient gptOssClient = new VllmGptOssChatClient(
    "https://openrouter.ai/api/v1", 
    "your-api-token", 
    "openai/gpt-oss-120b");

var messages = new List<ChatMessage>
{
    new ChatMessage(ChatRole.System, "You are a helpful assistant with reasoning capabilities."),
    new ChatMessage(ChatRole.User, "What's the weather like in Tokyo? Please think through this step by step.")
};

var chatOptions = new ChatOptions
{
    Temperature = 0.7f,
    ReasoningLevel = GptOssReasoningLevel.Medium,    // Set reasoning level,controls depth of reasoning
    Tools = [AIFunctionFactory.Create(GetWeather)]
};

// Stream response with reasoning
string reasoning = string.Empty;
string answer = string.Empty;

await foreach (var update in gptOssClient.GetStreamingResponseAsync(messages, chatOptions))
{
    if (update is ReasoningChatResponseUpdate reasoningUpdate)
    {
        if (reasoningUpdate.Thinking)
        {
            // Capture the model's reasoning process
            reasoning += reasoningUpdate.Reasoning;
            Console.WriteLine($"🧠 Thinking: {reasoningUpdate.Reasoning}");
        }
        else
        {
            // Capture the final answer
            answer += reasoningUpdate.Text;
            Console.WriteLine($"💬 Response: {reasoningUpdate.Text}");
        }
    }
}

Console.WriteLine($"\n📝 Full Reasoning: {reasoning}");
Console.WriteLine($"✅ Final Answer: {answer}");

🆕 Qwen3-Next 80B (Thinking vs Instruct)

using Microsoft.Extensions.AI;

// Choose model: reasoning variant or instruct variant
var apiKey = "your-dashscope-api-key";
// Reasoning (with thinking chain)
IChatClient thinkingClient = new VllmQwen3NextChatClient(
    "https://dashscope.aliyuncs.com/compatible-mode/v1/{1}",
    apiKey,
    "qwen3-next-80b-a3b-thinking");

// Instruct (no reasoning chain)
IChatClient instructClient = new VllmQwen3NextChatClient(
    "https://dashscope.aliyuncs.com/compatible-mode/v1/{1}",
    apiKey,
    "qwen3-next-80b-a3b-instruct");

var messages = new List<ChatMessage>
{
    new(ChatRole.System, "你是一个智能助手，名字叫菲菲"),
    new(ChatRole.User,   "简单介绍下量子计算。")
};

// Reasoning streaming example
await foreach (var update in thinkingClient.GetStreamingResponseAsync(messages))
{
    if (update is ReasoningChatResponseUpdate r)
    {
        if (r.Thinking)
            Console.Write(r.Text);   // reasoning / thinking phase
        else
            Console.Write(r.Text);   // final answer phase
    }
    else
    {
        Console.Write(update.Text);
    }
}

// Instruct (single response)
var resp = await instructClient.GetResponseAsync(messages);
Console.WriteLine(resp.Text);

🆕 Qwen3-Next Advanced Function Calls (Serial / Parallel / Manual Streaming)

using Microsoft.Extensions.AI;

[Description("获取南宁的天气情况")]
static string GetWeather() => "现在正在下雨。";

[Description("Searh")]
static string Search([Description("需要搜索的问题")] string question) => "南宁市青秀区方圆广场北面站前路1号。";

IChatClient baseClient = new VllmQwen3NextChatClient(
    "https://dashscope.aliyuncs.com/compatible-mode/v1/{1}",
    Environment.GetEnvironmentVariable("VLLM_ALIYUN_API_KEY"),
    "qwen3-next-80b-a3b-thinking");

IChatClient client = new ChatClientBuilder(baseClient)
    .UseFunctionInvocation()
    .Build();

var messages = new List<ChatMessage>
{
    new(ChatRole.System, "你是一个智能助手，名字叫菲菲，调用工具时仅能输出工具调用内容，不能输出其他文本。"),
    new(ChatRole.User, "南宁火车站在哪里？我出门需要带伞吗？")
};

ChatOptions opts = new()
{
    Tools = [AIFunctionFactory.Create(GetWeather), AIFunctionFactory.Create(Search)]
};

// Parallel tool calls example (also supports serial depending on prompt)
await foreach (var update in client.GetStreamingResponseAsync(messages, opts))
{
    if (update is ReasoningChatResponseUpdate r)
    {
        Console.Write(r.Text);
    }
    else
    {
        Console.Write(update.Text);
    }
}

// Manual streaming tool orchestration
messages = new()
{
    new(ChatRole.System, "你是一个智能助手，名字叫菲菲"),
    new(ChatRole.User, "南宁火车站在哪里？我出门需要带伞吗？")
};
string answer = string.Empty;
await foreach (var update in client.GetStreamingResponseAsync(messages, opts))
{
    if (update.FinishReason == ChatFinishReason.ToolCalls)
    {
        foreach (var fc in update.Contents.OfType<FunctionCallContent>())
        {
            messages.Add(new ChatMessage(ChatRole.Assistant, [fc]));
            if (fc.Name == "GetWeather")
            {
                messages.Add(new ChatMessage(ChatRole.Tool, [new FunctionResultContent(fc.CallId, GetWeather())]));
            }
            else if (fc.Name == "Search")
            {
                messages.Add(new ChatMessage(ChatRole.Tool, [new FunctionResultContent(fc.CallId, Search("南宁火车站"))]));
            }
        }
    }
    else
    {
        answer += update.Text;
    }
}
Console.WriteLine(answer);

🆕 JSON-only Output (No Code Block)

using Microsoft.Extensions.AI;

var messages = new List<ChatMessage>
{
    new(ChatRole.System, "你是一个智能助手，名字叫菲菲"),
    new(ChatRole.User, "请输出json格式的问候语，不要使用 codeblock。")
};
var options = new ChatOptions { MaxOutputTokens = 100 };
var resp = await baseClient.GetResponseAsync(messages, options);
var text = resp.Text; // Ensure no ``` code blocks and extract JSON via regex if needed

Qwen3 with Reasoning Toggle

using Microsoft.Extensions.AI;

[Description("Gets the weather")]
static string GetWeather() => Random.Shared.NextDouble() > 0.1 ? "It's sunny" : "It's raining";

IChatClient vllmclient = new VllmQwen3ChatClient("http://localhost:8000/{0}/{1}", null, "qwen3");
IChatClient client2 = new ChatClientBuilder(vllmclient)
    .UseFunctionInvocation()
    .Build();

var messages2 = new List<ChatMessage>
{
    new ChatMessage(ChatRole.System, "你是一个智能助手，名字叫菲菲"),
    new ChatMessage(ChatRole.User, "今天天气如何？")
};

Qwen3ChatOptions chatOptions = new()
{
    Tools = [AIFunctionFactory.Create(GetWeather)],
    NoThinking = true  // Toggle reasoning on/off
};

string res = string.Empty;
await foreach (var update in client2.GetStreamingResponseAsync(messages2, chatOptions))
{
    res += update.Text;
}

QwQ with Full Reasoning Support

using Microsoft.Extensions.AI;

[Description("Gets the weather")]
static string GetWeather() => Random.Shared.NextDouble() > 0.5 ? "It's sunny" : "It's raining";

IChatClient vllmclient2 = new VllmQwqChatClient("http://localhost:8000/{0}/{1}", null, "qwq");

var messages3 = new List<ChatMessage>
{
    new ChatMessage(ChatRole.System, "你是一个智能助手，名字叫菲菲"),
    new ChatMessage(ChatRole.User, "今天天气如何？")
};

ChatOptions chatOptions2 = new()
{
    Tools = [AIFunctionFactory.Create(GetWeather)]
};

// Stream with reasoning separation
private async Task<(string answer, string reasoning)> StreamChatResponseAsync(
    List<ChatMessage> messages, ChatOptions chatOptions)
{
    string answer = string.Empty;
    string reasoning = string.Empty;
    
    await foreach (var update in vllmclient2.GetStreamingResponseAsync(messages, chatOptions))
    {
        if (update is ReasoningChatResponseUpdate reasoningUpdate)
        {
            if (!reasoningUpdate.Thinking)
            {
                answer += reasoningUpdate.Text;
            }
            else
            {
                reasoning += reasoningUpdate.Text;
            }
        }
        else
        {
            answer += update.Text;
        }
    }
    return (answer, reasoning);
}

var (answer3, reasoning3) = await StreamChatResponseAsync(messages3, chatOptions2);

DeepSeek-R1 with Reasoning

using Microsoft.Extensions.AI;

IChatClient client3 = new VllmDeepseekR1ChatClient(
    "https://dashscope.aliyuncs.com/compatible-mode/v1/{1}", 
    "your-api-key", 
    "deepseek-r1");

var messages4 = new List<ChatMessage>
{
    new ChatMessage(ChatRole.System, "你是一个智能助手，名字叫菲菲"),
    new ChatMessage(ChatRole.User, "你是谁？")
};

string res4 = string.Empty;
string think = string.Empty;

await foreach (ReasoningChatResponseUpdate update in client3.GetStreamingResponseAsync(messages4))
{
    if (update.Thinking)
    {
        think += update.Text;
    }
    else
    {
        res4 += update.Text;
    }
}

🆕 DeepSeek-V3.2 / DeepSeek-V4 with Thinking Chain

using Microsoft.Extensions.AI;

// Initialize DeepSeek client (DashScope API / DeepSeek official API)
IChatClient dsV3 = new VllmDeepseekV3ChatClient(
    "https://dashscope.aliyuncs.com/compatible-mode/v1/{1}",
    "your-api-key",
    "deepseek-v3.2");

// VllmDeepseekV3ChatClient now also supports DeepSeek-V4 series models:
// - deepseek-v4
// - deepseek-v4-flash

var messages = new List<ChatMessage>
{
    new(ChatRole.System, "你是一个智能助手，名字叫菲菲"),
    new(ChatRole.User, "请解释一下相对论。")
};

// Enable thinking chain via VllmChatOptions
var options = new VllmChatOptions { ThinkingEnabled = true };

// Non-streaming: access reasoning via ReasoningChatResponse.Reason
var response = await dsV3.GetResponseAsync(messages, options);
if (response is ReasoningChatResponse reasoningResponse)
{
    Console.WriteLine($"🧠 Thinking: {reasoningResponse.Reason}");
    Console.WriteLine($"💬 Answer: {reasoningResponse.Text}");
}

// Streaming: distinguish thinking vs answer phases
string thinking = string.Empty;
string answer = string.Empty;
await foreach (var update in dsV3.GetStreamingResponseAsync(messages, options))
{
    if (update is ReasoningChatResponseUpdate r)
    {
        if (r.Thinking)
            thinking += r.Text;  // reasoning phase
        else
            answer += r.Text;    // final answer phase
    }
    else
    {
        answer += update.Text;
    }
}
Console.WriteLine($"🧠 Thinking: {thinking}");
Console.WriteLine($"💬 Answer: {answer}");

🔧 Advanced Features

Reasoning Chain Processing

All reasoning-capable clients support the ReasoningChatResponseUpdate interface:

await foreach (var update in client.GetStreamingResponseAsync(messages, options))
{
    if (update is ReasoningChatResponseUpdate reasoningUpdate)
    {
        if (reasoningUpdate.Thinking)
        {
            // Process thinking/reasoning content
            Console.WriteLine($"🤔 Reasoning: {reasoningUpdate.Reasoning}");
        }
        else
        {
            // Process final response
            Console.WriteLine($"💬 Answer: {reasoningUpdate.Text}");
        }
    }
}

Function Calling with Streaming

All clients support real-time function calling:

[Description("Search for location information")]
static string Search([Description("Search query")] string query)
{
    return "Location found: Beijing, China";
}

ChatOptions options2 = new()
{
    Tools = [AIFunctionFactory.Create(Search)],
    Temperature = 0.7f
};

await foreach (var update in client.GetStreamingResponseAsync(messages, options2))
{
    // Handle function calls and responses in real-time
    foreach (var content in update.Contents)
    {
        if (content is FunctionCallContent functionCall)
        {
            Console.WriteLine($"🔧 Calling: {functionCall.Name}");
        }
    }
}

🏆 Performance & Optimizations

Stream Processing: Efficient real-time response handling
Memory Management: Optimized for long conversations
Error Handling: Robust error recovery and debugging support
JSON Parsing: High-performance serialization with System.Text.Json
Connection Pooling: Shared HttpClient for optimal resource usage

📋 Requirements

.NET 10.0 or higher
Microsoft.Extensions.AI framework
System.Text.Json with source generation for JSON processing

NativeAOT Support

Ivilson.AI.VllmChatClient is built with AOT compatibility analyzers enabled. The core package does not depend on Newtonsoft.Json and uses System.Text.Json source-generation metadata for internal request and response DTOs.

API mode	Non-streaming	Streaming	Tool calls	Notes
Chat Completions	Supported	Supported	Supported	Tool arguments are surfaced as `JsonElement`-backed dictionary values.
Responses	Supported	Supported	Supported	Covered by the NativeAOT smoke publish project.
Anthropic Messages	Supported	Supported	Supported	Content blocks and tool input use explicit DTOs/`JsonElement`.

Dynamic user payloads are intentionally kept at the boundary. For custom tool argument/result models, prefer passing JsonSerializerOptions backed by a source-generated JsonSerializerContext; otherwise values may be returned as JsonElement instead of arbitrary runtime objects.

🤝 Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests。

📄 License

This project is licensed under the MLP-2.0 License. See the LICENSE file for details.

Product	Compatible and additional computed target framework versions.
.NET	net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed.

Compatible target framework(s)

Included target framework(s) (in package)

Learn more about Target Frameworks and .NET Standard.

net10.0
- Microsoft.Extensions.AI (>= 10.5.2)

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version	Downloads	Last Updated
2.0.20	136	6/2/2026
2.0.18	137	5/18/2026
2.0.16	120	5/12/2026
2.0.12	112	5/11/2026
2.0.11	128	4/27/2026
2.0.9	111	4/27/2026
2.0.5	128	4/16/2026
2.0.3	133	4/5/2026
2.0.2	143	4/3/2026
2.0.1	118	4/3/2026
2.0.0	134	3/21/2026
1.9.7	119	3/19/2026
1.9.6	147	3/15/2026
1.9.5	129	3/11/2026
1.9.4	127	3/11/2026
1.9.3	118	3/11/2026
1.9.2	128	3/5/2026
1.9.1	126	3/3/2026
1.9.0	127	2/26/2026
1.8.9	122	2/26/2026

Ivilson.AI.VllmChatClient 2.0.20

vllmchatclient

C# vLLM Chat Client

📢 Latest Update

🚀 Features

📦 Project Repository

本次更新

🆕 vLLM Responses API 与 Anthropic Messages API 支持

Demo：使用 Responses API

Demo：使用 Anthropic Messages API

🆕 Gemma 4 原生 API / vLLM 双支持

🆕 MiMo 与命名空间调整

🆕 Nemotron-3 Super 120B 思维链开关支持

🆕 Claude 4.6 / 4.5 思维链支持

🆕 OpenAI GPT 系列支持

🆕 DeepSeek V3.2 思维链支持

🐛 Bug Fixes

🔄 VllmQwen3NextChatClient 重构 — 统一多模型适配

🧩 基类重构与适配器增强

🛠️ 本地 Skill 自动加载

📝 其他更新

🔥 Latest Updates

🆕 Gemma 4 Support

🆕 MiMo Support and Namespace Changes

🆕 Nemotron-3 Super 120B Reasoning Toggle Support

🆕 Claude 4.6 / 4.5 Thinking Chain Support

🆕 OpenAI GPT Series Support

🆕 DeepSeek V3.2 Thinking Chain Support

🐛 Bug Fixes

🆕 GLM 4.6 / 4.7 / 5 Thinking Model Support

🆕 New GPT-OSS-20B/120B Support

🆕 GLM-4 Support

🔄 Base Class Refactoring & Model Consolidation

🛠️ Local Skill Auto-Loading

🆕 Qwen3-Next / Qwen 3.5 Multi-Model Adaptation

🆕 Kimi K2 Support

🆕 Kimi 2.5 Thinking Toggle (Moonshot)

🆕 Gemini 3 Support & Tool Calling

🆕 Gemini 3 OpenRouter Compatibility

🆕 MiniMax-M2.5 Support

🆕 Qwen 3.5 Support

🏗️ Supported Clients

🐳 Docker Deployment Examples

Qwen3 vLLM Deployment:

🆕 Qwen3.5 Thinking Toggle Compatibility

QwQ vLLM Deployment:

Gemma3 vLLM Deployment:

💻 Usage Examples

🆕 Gemma 4 Example

🆕 GLM 4.6/4.7/4.7-Flash Thinking Example

🆕 Claude 4.6 / 4.5 with Reasoning (OpenRouter)

🆕 Nemotron-3 Super 120B with Reasoning Toggle (OpenRouter)

🆕 OpenAI GPT Series with Reasoning (OpenRouter)

🆕 GPT-OSS-120B with Reasoning (OpenRouter)

🆕 Qwen3-Next 80B (Thinking vs Instruct)

🆕 Qwen3-Next Advanced Function Calls (Serial / Parallel / Manual Streaming)

🆕 JSON-only Output (No Code Block)

Qwen3 with Reasoning Toggle

QwQ with Full Reasoning Support

DeepSeek-R1 with Reasoning

🆕 DeepSeek-V3.2 / DeepSeek-V4 with Thinking Chain

🔧 Advanced Features

Reasoning Chain Processing

Function Calling with Streaming

🏆 Performance & Optimizations

📋 Requirements

NativeAOT Support

🤝 Contributing

📄 License

net10.0

NuGet packages

GitHub repositories

🔄 `VllmQwen3NextChatClient` 重构 — 统一多模型适配