TensorRT-LLM 支持的模型格式和配置方法

支持的模型格式

TensorRT-LLM 主要支持从以下格式的模型进行转换：

1. Hugging Face 格式（最常用）

支持绝大多数流行的 Transformer 模型
通过 from_hugging_face() API 直接加载

2. Meta 原始格式

例如 LLaMA 的原始 checkpoint 格式
使用 from_meta_ckpt() 接口

3. JAX/Keras 格式

如 Gemma 模型支持 JAX 和 Keras 格式

4. NVIDIA NeMo 格式

可通过 NeMo Export-Deploy 工具链转换

5. DeepSpeed 格式

支持 DeepSpeed checkpoint 转换

支持的模型架构

TensorRT-LLM 支持超过 50 种主流模型

大语言模型：

LLaMA 系列：LLaMA、LLaMA 2、LLaMA 3/3.1、Code LLaMA
Mistral 系列：Mistral、Mixtral、Mistral NeMo
Qwen 系列：Qwen、Qwen1.5、Qwen2、Qwen3
其他：GPT、GPT-J、GPT-NeoX、Falcon、Baichuan、ChatGLM、Phi、Gemma、DeepSeek 等

多模态模型：

LLaVA、CogVLM、BLIP2、Phi-3-vision 等

配置流程

TensorRT-LLM 的工作流程分为三个阶段

原始模型 → [转换] → TensorRT-LLM Checkpoint → [构建] → TensorRT Engine → [加载] → 推理

第一步：模型转换（Convert）

方式 1：使用 convert_checkpoint.py 脚本

# 以 LLaMA 为例
python examples/models/core/llama/convert_checkpoint.py \
    --model_dir /path/to/hf/model \
    --output_dir ./tllm_checkpoint \
    --dtype float16

方式 2：使用 Python API

from tensorrt_llm.models.llama import LLaMAForCausalLM

# 从 Hugging Face 格式转换
llama = LLaMAForCausalLM.from_hugging_face(
    hf_model_dir="/path/to/hf/model",
    dtype="float16",
    mapping=mapping  # 并行配置
)

# 保存 checkpoint
llama.save_checkpoint(output_dir)

第二步：构建引擎（Build）

使用 trtllm-build CLI 工具

trtllm-build \
  --checkpoint_dir ./tllm_checkpoint \
  --output_dir ./trt_engines \
  --gemm_plugin float16 \
  --max_batch_size 64 \
  --max_input_len 512 \
  --max_output_len 256

关键参数说明：

--checkpoint_dir: 转换后的 checkpoint 路径
--output_dir: 输出的引擎文件路径
--gemm_plugin: 精度格式（float16/fp8等）
--max_batch_size: 最大批处理大小
--max_input_len/--max_output_len: 最大输入/输出长度
--tp_size: 张量并行大小（多 GPU 时使用）

第三步：推理部署（Serve）

方式 1：Python LLM API

from tensorrt_llm import BuildConfig, SamplingParams
from tensorrt_llm._tensorrt_engine import LLM

config = BuildConfig(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    max_input_len=4096,
    max_output_len=256,
)

llm = LLM(config)

sampling_params = SamplingParams(temperature=0.7, top_p=0.9)
outputs = llm.generate(["Your prompt here"], sampling_params=sampling_params)

方式 2：OpenAI 兼容服务

trtllm-serve "meta-llama/Meta-Llama-3-8B-Instruct" \
  --max_batch_size 64 \
  --port 8000

方式 3：Triton Inference Server

使用 TensorRT-LLM backend 进行生产级部署

量化配置

TensorRT-LLM 支持多种量化格式以优化性能

量化类型	适用场景	配置方法
FP16	默认推荐，精度损失最小	`--dtype float16`
FP8	Hopper 架构 GPU（H100等）	需硬件支持
INT8 SmoothQuant	平衡精度和性能	使用 `quantize.py`
INT4 AWQ	极致压缩，大模型部署	使用 Modelopt 工具
W4A16/W4A8	权重量化	通过量化 API 配置

量化示例：

# 使用 Modelopt 进行量化
from tensorrt_llm.models import PretrainedModel

PretrainedModel.quantize(
    hf_model_dir="/path/to/model",
    output_dir="./quantized_checkpoint",
    quant_config=quant_config,
    mapping=mapping
)

硬件和软件要求

支持的 GPU 架构：

Ampere（A100 等）
Hopper（H100 等）
Ada Lovelace（RTX 4090 等）
Blackwell（B200 等）

关键依赖：

CUDA Toolkit
TensorRT
PyTorch（TensorRT-LLM 基于 PyTorch 架构）

总结建议

入门路径：从 Hugging Face 模型开始，使用 convert_checkpoint.py → trtllm-build → trtllm-serve
精度选择：先用 FP16 建立基线，再根据需求尝试 FP8 或 INT4 量化
多 GPU 部署：通过 --tp_size 配置张量并行
生产环境：建议使用 Triton Inference Server 进行服务化部署

#技术支持 #技术前沿 #本地模型

#TensorRT-LLM #本地部署

TensorRT-LLM 支持的模型格式和配置方法

http://localhost:8090//archives/1770953113910

作者

昊昱天合

发布于

2026年02月13日

更新于

2026年02月13日

许可协议

云上OpenClaw(Clawdbot)快速接入企业微信指南上一篇

昊昱天合，给您拜年了！下一篇