微调 API 列表

以下 API 基于 Thinking Machines Lab 开源的 Tinker 项目（Apache License 2.0）进行开发。

我们感谢开源社区的贡献，并在此基础上针对潞晨云的云端训练与推理基础设施进行了深度定制和增强。

Service Client API

`get_server_capabilities()`

功能描述

查询 API 服务器当前支持的模型和功能，返回可用于训练或推理的模型列表。通常用于首次检查服务器可用性和模型信息。

输入

无

输出

GetServerCapabilitiesResponse 对象，包括：

supported_models（List[SupportedModel]）：模型信息对象列表，每个 SupportedModel 包含：
- model_name（Optional[str]）：模型名称/标识符

示例代码

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"

service_client = ServiceClient(base_url=BASE_URL, api_key=API_KEY)
capabilities = service_client.get_server_capabilities()
print(f"Server supports {len(capabilities.supported_models)} models")
for model in capabilities.supported_models:
    if model.model_name:
        print(f"  - {model.model_name}")

额外说明

支持异步版本 get_server_capabilities_async()。

`create_lora_training_client(base_model, rank=32, seed=None, train_mlp=True, train_attn=True, train_unembed=True)`

功能描述

创建一个基于 LoRA 的训练模型实例，并返回 TrainingClient 用于训练。

输入

base_model（字符串）：基础模型名称，必须是微调服务支持的模型
rank（整数，默认 32）：LoRA rank，影响模型容量和参数量
seed（可选整数）：随机种子，用于可重复初始化 LoRA 权重
train_mlp（布尔，默认 True）：是否对 MLP 层应用 LoRA
train_attn（布尔，默认 True）：是否对注意力层应用 LoRA
train_unembed（布尔，默认 True）：是否对输出层应用 LoRA

输出

TrainingClient：绑定到新创建 LoRA

示例代码

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"

service_client = ServiceClient(base_url=BASE_URL, api_key=API_KEY)
training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen3-8B"
)
print(f"Created model ID {training_client.model_id}")

自定义 LoRA 配置：

training_client = service_client.create_lora_training_client(
    base_model="Qwen/Qwen2.5-7B-Instruct",
    rank=64,
    seed=42,
    train_mlp=False,
    train_attn=True, # only fine tune on attention layers
    train_unembed=False
)

额外说明

可自定义 LoRA 配置，例如只对注意力层进行微调
支持异步版本 create_lora_training_client_async()

`create_training_client_from_state(path)`

功能描述

从保存的检查点恢复训练会话，可继续训练。

输入

path（字符串）：检查点 URI，例如：
- 训练检查点：hpcai://{training_run_id}/weights/{checkpoint_id}
- 推理检查点：hpcai://{training_run_id}/sampler_weights/{checkpoint_id}

输出

TrainingClient：绑定到恢复模型的训练客户端

示例代码

从已保存的 Checkpoint 处继续训练：

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
CHECKPOINT_PATH = "hpcai://{model_id}/weights/step_0010"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
training_client = service_client.create_training_client_from_state(CHECKPOINT_PATH)

print(f"Restored model ID: {training_client.model_id}")

# Continue training
training_data = [...]  # Your training data
fwd_bwd = training_client.forward_backward(training_data, loss_fn="cross_entropy")
optim = training_client.optim_step(hpcai.AdamParams(learning_rate=1e-4))
fwd_bwd.result()
optim.result()

# Clean up
training_client.unload_model().result()

使用 RestClient 获取检查点路径并加载模型：

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# List checkpoints for a training run
training_run_id = "eb78693b-380d-40f8-a709-ffe0da185718"
checkpoints_response = rest_client.list_checkpoints(training_run_id).result()

# Find the latest checkpoint
if checkpoints_response.checkpoints:
    latest_checkpoint = checkpoints_response.checkpoints[-1]
    checkpoint_path = latest_checkpoint.checkpoint_path
    print(f"Restoring from: {checkpoint_path}")

    # Restore training client
    training_client = service_client.create_training_client_from_state(checkpoint_path)
    print(f"Restored model ID: {training_client.model_id}")

`create_training_client(model_id: types.ModelID | None = None)`

功能描述

创建一个 TrainingClient 实例，用于操作已存在模型或新模型。

如果提供 model_id，则客户端绑定到该模型；如果未提供，则客户端未绑定模型，需要加载 checkpoint 或创建模型后才能使用。

输入参数

model_id（types.ModelID | None，可选）：服务器上已存在模型的唯一标识符。
- 提供时：客户端绑定到该模型。
- 不提供时（默认 None）：客户端未绑定模型，需要后续加载 checkpoint 或创建模型才能使用。

输出

TrainingClient：用于执行训练操作的客户端实例。

使用示例

1. 为已有模型创建客户端

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"

# 已有模型 ID
existing_model_id = "eb78693b-380d-40f8-a709-ffe0da185718"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
training_client = service_client.create_training_client(model_id=existing_model_id)

print(f"Client bound to model: {training_client.model_id}")

# 使用客户端进行训练
training_data = [...]  # 训练数据
fwd_bwd = training_client.forward_backward(training_data, loss_fn="cross_entropy")
optim = training_client.optim_step(hpcai.AdamParams(learning_rate=1e-4))
fwd_bwd.result()
optim.result()

2. 未提供 model_id，先加载 checkpoint 再训练

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
CHECKPOINT_PATH = "hpcai://eb78693b-380d-40f8-a709-ffe0da185718/weights/step_0010"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)

# 创建客户端但未绑定模型
training_client = service_client.create_training_client(model_id=None)

# 加载 checkpoint 绑定模型
load_future = training_client.load_state(CHECKPOINT_PATH)
load_future.result()

print(f"Model loaded: {training_client.model_id}")

# 现在可以使用客户端进行训练
training_data = [...]  # 训练数据
result = training_client.forward_backward(training_data, loss_fn="cross_entropy")
result.result()

`create_rest_client()`

功能描述

创建一个 RestClient 实例，用于访问 REST API 接口，支持查询训练运行信息、checkpoint 列表及其他元数据。

输入参数

无

输出

RestClient：用于执行 REST API 操作的客户端实例，例如查询 checkpoint、训练运行信息等。

使用示例

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

Training Client API

`forward(data, loss_fn)`

功能描述

执行模型的前向推理（不计算梯度）。

输入

data（List[Datum]）：输入数据
loss_fn（str）：损失函数名称（目前仅支持 "cross_entropy"）

输出

APIFuture[ForwardBackwardOutput]：异步结果句柄

ForwardBackwardOutput(
  loss_fn_output_type='cross_entropy',
  loss_fn_outputs=[
    {
      'logprobs': TensorData(...),
      'elementwise_loss': TensorData(...)
    }
  ],
  metrics={
    'loss:mean': 4.447656154632568,
    'num_examples:sum': 1.0,
    'step:max': 0.0
  }
)

示例代码

from hpcai import types

data = [
    types.Datum(
        model_input=types.ModelInput.from_ints([1, 2, 3, 4, 5]),
        loss_fn_inputs={
            "target_tokens": types.TensorData(
                data=[2, 3, 4, 5, 6],
                dtype="int64",
                shape=[5]
            ),
            "weights": types.TensorData(
                data=[1.0, 1.0, 1.0, 1.0, 1.0],
                dtype="float32",
                shape=[5]
            )
        }
    )
]

out = training_client.forward(data, loss_fn="cross_entropy")
res = out.result()
print(res)

`forward_backward(data, loss_fn)`

功能描述

执行前向+反向训练（计算梯度），梯度将累积直到调用 optim_step。

输入

同 forward

输出

同 forward

示例代码

from hpcai import types

data = [
    types.Datum(
        model_input=types.ModelInput.from_ints([1, 2, 3, 4, 5]),
        loss_fn_inputs={
            "target_tokens": types.TensorData(
                data=[2, 3, 4, 5, 6],
                dtype="int64",
                shape=[5]
            ),
            "weights": types.TensorData(
                data=[1.0, 1.0, 1.0, 1.0, 1.0],
                dtype="float32",
                shape=[5]
            )
        }
    )
]

out = training_client.forward_backward(data, loss_fn="cross_entropy")
res = out.result()
print(res)

`optim_step(adam_params)`

功能描述

执行一次 Adam 优化步骤，将累积的梯度应用到模型参数上。

输入

adam_params：Adam 参数配置

输出

OptimStepResponse（包含优化指标）

示例代码

res = training_client.optim_step(
    adam_params=types.AdamParams(learning_rate=1e-4)
).result()

`get_info`

功能描述 获取当前训练会话的元数据与配置信息。内容包括：model_id（训练 run id）、模型架构信息、LoRA 配置等。

输入

无

输出

返回 GetInfoResponse 对象，包含以下字段：

type：响应类型（通常为 get_info）
model_data：ModelData 对象，包含模型架构与模型名称
model_id：模型 / 训练 run 的唯一 ID
is_lora：是否启用了 LoRA
lora_rank：LoRA rank（如果开启了 LoRA）
model_name：基础模型名称

示例代码

info = training_client.get_info()
print(info)

输出示例

GetInfoResponse(
    type='get_info',
    model_data=ModelData(arch='', model_name='Qwen/Qwen3-8B'),
    model_id='e5c88495-46e9-43df-9bf8-3185aceaa222',
    is_lora=True,
    lora_rank=16,
    model_name='Qwen/Qwen3-8B'
)

`get_tokenizer`

功能描述

返回当前训练运行所使用的模型对应的 transformers 预训练 tokenizer。

你可以用它进行编码、解码等操作（跟 Hugging Face 原生 tokenizer 一致）。

输入

无

输出

返回 Hugging Face PreTrainedTokenizer 对象。

示例代码

tokenizer = training_client.get_tokenizer()
print(tokenizer.encode("Hello, world!"))

输出示例

[9707, 11, 1879, 0]

`save_state(name: str)`

功能描述

保存已加载模型的可训练权重（例如，如果启用了 LoRA，仅保存 LoRA adapter 的权重）以便恢复训练。返回异步句柄 (APIFuture)，可获取 SaveWeightsResponse。

输入参数

name（str）：用户自定义的 checkpoint ID，用于标识保存的权重。

输出

APIFuture[SaveWeightsResponse]：异步对象，返回包含保存权重 URL 的响应对象，格式如下：

hpcai://{model_id}/weights/{checkpoint_id}_training

使用示例

# 保存训练权重
res = training_client.save_state("initial").result()

异步版本

save_state_async(name: str)

`load_state(path: str)`

功能描述

加载用户指定 checkpoint 的可训练权重（例如，如果启用了 LoRA，仅加载 LoRA adapter 的权重），返回异步句柄 (APIFuture) 可获取 LoadWeightsResponse。

输入参数

path（str）：已保存 checkpoint 的 URL，格式如下：
- 推理使用的 sampler checkpoint: "hpcai://{model_id}/weights/{checkpoint_id}_sampler"
- 恢复训练使用的 training checkpoint: "hpcai://{model_id}/weights/{checkpoint_id}_training"

输出

APIFuture[LoadWeightsResponse]：异步对象，返回包含已加载权重 URL 的响应对象。

使用示例

# 加载训练 checkpoint
res = training_client.load_state(
    'hpcai://e8b29733-9efa-476a-b62b-40b8f9c6d999/weights/initial_training'
).result()

异步版本

load_state_async(path: str)

`save_weights_for_sampler(name: str)`

功能描述

保存已加载模型的可训练权重（例如，如果启用了 LoRA，仅保存 LoRA adapter 的权重）以用于推理。返回异步句柄 (APIFuture)，可获取 SaveWeightsForSamplerResponse。

输入参数

name（str）：用户自定义的 checkpoint ID，用于标识保存的权重。

输出

APIFuture[SaveWeightsForSamplerResponse]：异步对象，返回包含保存权重 URL 的响应对象，格式如下：

hpcai://{model_id}/weights/{checkpoint_id}_sampler

使用示例

# 保存用于推理的模型权重
res = training_client.save_weights_for_sampler("initial").result()

异步版本

save_weights_for_sampler_async(name: str)

`unload_model()`

功能描述

停止当前训练会话并释放资源（如 GPU）。

注意：该 API 不会自动保存模型权重，已保存的 checkpoint 不受影响，仍可下载使用。

输入参数

输出

APIFuture[UnloadModelResponse]：异步对象，返回包含卸载模型 ID 的响应对象

使用示例

# 停止训练并释放资源
res = training_client.unload_model().result()

# 可在卸载后仍然列出已保存的 checkpoint
checkpoints = rest_client.list_checkpoints(training_client.model_id).result()

异步版本

unload_model_async()

RestClient API

`list_training_runs(limit=20, offset=0)`

功能描述

分页获取训练运行列表。

输入

limit：返回的最大条目数
offset：偏移量

输出

TrainingRunsResponse：响应对象，包含以下内容：

training_runs（list[TrainingRun]）：训练运行对象列表，每个对象包含：
- training_run_id (str)：训练运行的唯一标识
- base_model (str)：基础模型名称
- model_owner (str)：模型的创建者/拥有者
- is_lora (bool)：是否为 LoRA 模型
- corrupted (bool)：模型是否处于损坏状态
- lora_rank (int | None)：LoRA rank（如适用），否则为 None
- last_request_time (datetime)：该训练运行最近一次的请求时间
- last_checkpoint（Checkpoint | None）：最新的训练检查点
- last_sampler_checkpoint (Checkpoint | None)：最新的采样（推理）检查点
cursor (Cursor)：分页元数据，包含：
- offset (int)：当前请求使用的偏移量
- limit（int）：当前请求的返回条目上限
- total_count (int)：可用训练运行的总数量

示例代码

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# List first 10 training runs
response = rest_client.list_training_runs(limit=10, offset=0).result()

print(f"Found {len(response.training_runs)} training runs")
print(f"Total available: {response.cursor.total_count}")

for run in response.training_runs:
    print(f"  - {run.training_run_id}: {run.base_model}")
    if run.is_lora:
        print(f"    LoRA rank: {run.lora_rank}")

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TARGET_MODEL = "Qwen/Qwen3-8B"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# List training runs and filter
response = rest_client.list_training_runs(limit=100, offset=0).result()

matching_runs = [
    run for run in response.training_runs
    if run.base_model == TARGET_MODEL
]

print(f"Found {len(matching_runs)} training runs for {TARGET_MODEL}:")
for run in matching_runs:
    print(f"  - {run.training_run_id} (LoRA rank: {run.lora_rank})")

额外说明

支持异步版本 get_server_capabilities_async()。

`get_training_run(training_run_id: types.ModelID)`

功能描述

通过 training_run_id 获取特定训练运行的详细信息，返回该训练运行的完整元数据。

输入参数

training_run_id（types.ModelID）：训练运行的唯一标识符，必须对应服务器上已有的训练运行。

输出

TrainingRun 对象，包含以下字段：
- training_run_id（str）：训练运行唯一标识符
- base_model（str）：基础模型名称
- model_owner（str）：所有者/创建者标识
- is_lora（bool）：是否为 LoRA 模型
- corrupted（bool）：模型是否处于损坏状态
- lora_rank（int | None）：LoRA rank（若适用），否则为 None
- last_request_time（datetime）：最近请求时间
- last_checkpoint（Checkpoint | None）：最近的训练 checkpoint
- last_sampler_checkpoint（Checkpoint | None）：最近的 sampler checkpoint

使用示例

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# 获取训练运行信息
training_run = rest_client.get_training_run(TRAINING_RUN_ID).result()

print(f"Training Run ID: {training_run.training_run_id}")
print(f"Base Model: {training_run.base_model}")
print(f"Is LoRA: {training_run.is_lora}")
if training_run.is_lora:
    print(f"LoRA Rank: {training_run.lora_rank}")
print(f"Owner: {training_run.model_owner}")
print(f"Last Request: {training_run.last_request_time}")
print(f"Corrupted: {training_run.corrupted}")

异步版本

get_training_run_async(training_run_id: types.ModelID)

`get_training_run_by_checkpoint_path(checkpoint_path: str)`

功能描述

通过 checkpoint_path 获取特定训练运行的详细信息，返回该训练运行的完整元数据。

输入参数

checkpoint_path（str）：checkpoint 路径，使用 hpcai:// 协议，格式如下：
- 训练 checkpoint: "hpcai://{training_run_id}/weights/{checkpoint_id}"
- 推理 sampler checkpoint: "hpcai://{training_run_id}/sampler_weights/{checkpoint_id}"

输出

TrainingRun 对象，包含以下字段：
- training_run_id（str）：训练运行唯一标识符
- base_model（str）：基础模型名称
- model_owner（str）：所有者/创建者标识
- is_lora（bool）：是否为 LoRA 模型
- corrupted（bool）：模型是否处于损坏状态
- lora_rank（int | None）：LoRA rank（若适用），否则为 None
- last_request_time（datetime）：最近请求时间
- last_checkpoint（Checkpoint | None）：最近的训练 checkpoint
- last_sampler_checkpoint（Checkpoint | None）：最近的 sampler checkpoint

使用示例

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
CHECKPOINT_PATH = "hpcai://eb78693b-380d-40f8-a709-ffe0da185718/weights/step_0010"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# 获取训练运行信息
training_run = rest_client.get_training_run_by_checkpoint_path(CHECKPOINT_PATH).result()

print(f"Training Run ID: {training_run.training_run_id}")
print(f"Base Model: {training_run.base_model}")
print(f"Is LoRA: {training_run.is_lora}")
if training_run.is_lora:
    print(f"LoRA Rank: {training_run.lora_rank}")

异步版本

get_training_run_by_checkpoint_path_async(checkpoint_path: str)

`list_checkpoints(training_run_id: types.ModelID)`

功能描述 列出指定训练运行的所有可用 checkpoint。返回的 checkpoint 包含两类：

training：包含训练过程中保存的完整模型权重，可用于恢复训练。
sampler：针对推理优化的 checkpoint，体积较小，加载速度更快。

输入参数

training_run_id（types.ModelID）：训练运行的唯一标识符，必须对应服务器上已有的训练运行。

输出

CheckpointsListResponse：包含 checkpoint 列表，字段如下：
- checkpoints（list[Checkpoint]）：checkpoint 对象列表，每个对象包含：
  - checkpoint_id（str）：checkpoint 唯一标识符
  - checkpoint_type（CheckpointType）："training" 或 "sampler"
  - time（datetime）：checkpoint 创建时间（带时区信息，通常为 UTC）
  - checkpoint_path（str）：checkpoint 完整路径，格式 "hpcai://{training_run_id}/{type}/{checkpoint_id}"

使用示例

列出所有 checkpoint

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# 列出所有 checkpoint
checkpoints_response = rest_client.list_checkpoints(TRAINING_RUN_ID).result()

for checkpoint in checkpoints_response.checkpoints:
    print(f"{checkpoint.checkpoint_type}: {checkpoint.checkpoint_id}")

获取最新 checkpoint 并恢复训练

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

checkpoints_response = rest_client.list_checkpoints(TRAINING_RUN_ID).result()

if checkpoints_response.checkpoints:
    # 按时间排序，取最新 checkpoint
    latest_checkpoint = sorted(
        checkpoints_response.checkpoints,
        key=lambda cp: cp.time,
        reverse=True
    )[0]
    
    # 使用最新 checkpoint 恢复训练
    training_client = service_client.create_training_client_from_state(
        latest_checkpoint.checkpoint_path
    )

异步版本

list_checkpoints_async(training_run_id: types.ModelID)

注意事项 目前仅支持保存 LoRA adapter 权重，后续将支持保存优化器状态。

`download_checkpoint_archive(training_run_id: types.ModelID, checkpoint_id: str)`

功能描述

通过 training_run_id 和 checkpoint_id 下载完整 checkpoint 的压缩 tar.gz 文件。下载的 archive 包含所有模型权重、配置和元数据，可用于在本地运行训练好的模型。返回原始字节流，可直接保存或处理。

输入参数

training_run_id（types.ModelID）：训练运行的唯一标识符，必须对应服务器上已有的训练运行。
checkpoint_id（str）：要下载的 checkpoint 的标识符，对应 list_checkpoints() 返回的 checkpoint_id 字段。

输出

bytes：包含 checkpoint archive 数据的字节流，archive 为 tar.gz 格式，包含：
- 模型权重（LoRA: adapter_model.bin 或 adapter_model.safetensors）
- 配置文件（adapter_config.json、tokenizer 文件）
- 其他 checkpoint 元数据

使用示例

下载并保存 checkpoint

import hpcai
from pathlib import Path

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
CHECKPOINT_ID = "step_0010"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# 下载 checkpoint archive
archive_data = rest_client.download_checkpoint_archive(
    training_run_id=TRAINING_RUN_ID,
    checkpoint_id=CHECKPOINT_ID
).result()

# 保存到本地文件
output_path = f"./checkpoints/checkpoint_{CHECKPOINT_ID}.tar.gz"
Path("./checkpoints").mkdir(parents=True, exist_ok=True)
with open(output_path, "wb") as f:
    f.write(archive_data)

下载并解压 checkpoint

import hpcai
import tarfile
from pathlib import Path

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
CHECKPOINT_ID = "final"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# 下载 checkpoint
archive_data = rest_client.download_checkpoint_archive(
    training_run_id=TRAINING_RUN_ID,
    checkpoint_id=CHECKPOINT_ID
).result()

# 保存 archive
archive_path = Path(f"./checkpoints/checkpoint_{CHECKPOINT_ID}.tar.gz")
archive_path.parent.mkdir(parents=True, exist_ok=True)
archive_path.write_bytes(archive_data)

# 解压 archive
extract_dir = Path(f"./checkpoints/extracted_{CHECKPOINT_ID}")
extract_dir.mkdir(parents=True, exist_ok=True)

with tarfile.open(archive_path, "r:gz") as tar:
    tar.extractall(extract_dir)

异步版本

download_checkpoint_archive_async(training_run_id: types.ModelID, checkpoint_id: str)

注意事项 下载 checkpoint 会消耗带宽和存储空间，建议仅在本地分析或备份时使用。

`download_checkpoint_archive_by_checkpoint_path(checkpoint_path: str)`

功能描述

通过 checkpoint_path 下载完整 checkpoint 的压缩 tar.gz 文件。下载的 archive 包含所有模型权重、配置和元数据，可用于在本地运行训练好的模型。返回原始字节流，可直接保存或处理。

输入参数

checkpoint_path（str）：checkpoint 路径，使用 hpcai:// 协议，格式如下：
- 训练 checkpoint: "hpcai://{training_run_id}/weights/{checkpoint_id}"
- 推理 sampler checkpoint: "hpcai://{training_run_id}/sampler_weights/{checkpoint_id}"

输出

bytes：包含 checkpoint archive 数据的字节流，archive 为 tar.gz 格式，包含：
- 模型权重（LoRA: adapter_model.bin 或 adapter_model.safetensors）
- 配置文件（adapter_config.json、tokenizer 文件）
- 其他 checkpoint 元数据

使用示例

下载并保存 checkpoint

import hpcai
from pathlib import Path

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
CHECKPOINT_PATH = "hpcai://eb78693b-380d-40f8-a709-ffe0da185718/weights/step_0010"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# 下载 checkpoint archive
archive_data = rest_client.download_checkpoint_archive_by_checkpoint_path(
    checkpoint_path=CHECKPOINT_PATH
).result()

# 保存到本地文件
output_path = f"./checkpoints/{CHECKPOINT_PATH}.tar.gz"
Path("./checkpoints").mkdir(parents=True, exist_ok=True)
with open(output_path, "wb") as f:
    f.write(archive_data)

下载并解压 checkpoint

import hpcai
import tarfile
from pathlib import Path

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
CHECKPOINT_PATH = "hpcai://eb78693b-380d-40f8-a709-ffe0da185718/weights/step_0010"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# 下载 checkpoint
archive_data = rest_client.download_checkpoint_archive_by_checkpoint_path(
    checkpoint_path=CHECKPOINT_PATH
).result()

# 保存 archive
archive_path = Path(f"./checkpoints/{CHECKPOINT_PATH}.tar.gz")
archive_path.parent.mkdir(parents=True, exist_ok=True)
archive_path.write_bytes(archive_data)

# 解压 archive
extract_dir = Path(f"./checkpoints/extracted_{CHECKPOINT_PATH}")
extract_dir.mkdir(parents=True, exist_ok=True)

with tarfile.open(archive_path, "r:gz") as tar:
    tar.extractall(extract_dir)

异步版本

download_checkpoint_archive_by_checkpoint_path_async(checkpoint_path: str)

注意事项 下载 checkpoint 会消耗带宽和存储空间，建议仅在本地分析或备份时使用。

`delete_checkpoint(training_run_id: types.ModelID, checkpoint_id: str)`

功能描述

永久删除指定训练运行的 checkpoint。此操作会从服务器存储中彻底移除 checkpoint，无法恢复。适用于释放存储空间或删除不再需要的 checkpoint。

输入参数

training_run_id（types.ModelID）：训练运行的唯一标识符，必须对应服务器上已有的训练运行。
checkpoint_id（str）：要删除的 checkpoint 的标识符，对应 list_checkpoints() 返回的 checkpoint_id 字段。

输出无（异步操作通过 APIFuture 处理）

使用示例

基本删除

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
CHECKPOINT_ID = "step_0010"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# WARNING: 这将永久删除 checkpoint
try:
    rest_client.delete_checkpoint(
        training_run_id=TRAINING_RUN_ID,
        checkpoint_id=CHECKPOINT_ID
    ).result()
except Exception as e:
    print(f"Failed to delete checkpoint: {e}")

安全删除（先检查 checkpoint 是否存在）

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
CHECKPOINT_ID = "step_0010"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# 先检查 checkpoint 是否存在
checkpoints_response = rest_client.list_checkpoints(TRAINING_RUN_ID).result()
checkpoint_exists = any(
    cp.checkpoint_id == CHECKPOINT_ID
    for cp in checkpoints_response.checkpoints
)

if checkpoint_exists:
    print(f"Found checkpoint {CHECKPOINT_ID}")
    print("WARNING: This will permanently delete the checkpoint!")
    
    # 可选：删除前下载备份
    # archive_data = rest_client.download_checkpoint_archive(
    #     training_run_id=TRAINING_RUN_ID,
    #     checkpoint_id=CHECKPOINT_ID
    # ).result()
    # with open(f"backup_{CHECKPOINT_ID}.tar.gz", "wb") as f:
    #     f.write(archive_data)
    
    # 删除 checkpoint
    rest_client.delete_checkpoint(
        training_run_id=TRAINING_RUN_ID,
        checkpoint_id=CHECKPOINT_ID
    ).result()
else:
    print(f"Checkpoint {CHECKPOINT_ID} does not exist")

异步版本

delete_checkpoint_async(training_run_id: types.ModelID, checkpoint_id: str)

`delete_checkpoint_by_checkpoint_path(checkpoint_path: str)`

功能描述

永久删除指定 checkpoint。此操作会从服务器存储中彻底移除 checkpoint，无法恢复。适用于释放存储空间或删除不再需要的 checkpoint。

输入参数

checkpoint_path（str）：要删除的 checkpoint 路径，使用 hpcai:// 协议。格式如下：
- 训练 checkpoint: "hpcai://{training_run_id}/weights/{checkpoint_id}"
- 推理 sampler checkpoint: "hpcai://{training_run_id}/sampler_weights/{checkpoint_id}"

输出无（异步操作通过 APIFuture 处理）

使用示例

基本用法（直接删除）

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
CHECKPOINT_PATH = "hpcai://eb78693b-380d-40f8-a709-ffe0da185718/weights/step_0010"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# WARNING: 这将永久删除 checkpoint
try:
    rest_client.delete_checkpoint_by_checkpoint_path(
        checkpoint_path=CHECKPOINT_PATH
    ).result()
except Exception as e:
    print(f"Failed to delete checkpoint: {e}")

安全删除（先检查 checkpoint 是否存在）

import hpcai

BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
CHECKPOINT_PATH = "hpcai://eb78693b-380d-40f8-a709-ffe0da185718/weights/step_0010"

service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()

# 先检查 checkpoint 是否存在
checkpoints_response = rest_client.list_checkpoints(TRAINING_RUN_ID).result()
checkpoint_exists = any(
    cp.checkpoint_path == CHECKPOINT_PATH
    for cp in checkpoints_response.checkpoints
)

if checkpoint_exists:
    print(f"Found checkpoint {CHECKPOINT_PATH}")
    print("WARNING: This will permanently delete the checkpoint!")
    rest_client.delete_checkpoint_by_checkpoint_path(
        checkpoint_path=CHECKPOINT_PATH
    ).result()
else:
    print(f"Checkpoint {CHECKPOINT_PATH} does not exist")

异步版本

delete_checkpoint_by_checkpoint_path_async(checkpoint_path: str)

Service Client API​

get_server_capabilities()​

create_lora_training_client(base_model, rank=32, seed=None, train_mlp=True, train_attn=True, train_unembed=True)​

create_training_client_from_state(path)​

create_training_client(model_id: types.ModelID | None = None)​

1. 为已有模型创建客户端​

2. 未提供 model_id，先加载 checkpoint 再训练​

create_rest_client()​

Training Client API​

forward(data, loss_fn)​

forward_backward(data, loss_fn)​

optim_step(adam_params)​

get_info​

get_tokenizer​

save_state(name: str)​

load_state(path: str)​

save_weights_for_sampler(name: str)​

unload_model()​

RestClient API​

list_training_runs(limit=20, offset=0)​

get_training_run(training_run_id: types.ModelID)​

get_training_run_by_checkpoint_path(checkpoint_path: str)​

list_checkpoints(training_run_id: types.ModelID)​

download_checkpoint_archive(training_run_id: types.ModelID, checkpoint_id: str)​

download_checkpoint_archive_by_checkpoint_path(checkpoint_path: str)​

delete_checkpoint(training_run_id: types.ModelID, checkpoint_id: str)​

delete_checkpoint_by_checkpoint_path(checkpoint_path: str)​

Service Client API

`get_server_capabilities()`

`create_lora_training_client(base_model, rank=32, seed=None, train_mlp=True, train_attn=True, train_unembed=True)`

`create_training_client_from_state(path)`

`create_training_client(model_id: types.ModelID | None = None)`

1. 为已有模型创建客户端

2. 未提供 model_id，先加载 checkpoint 再训练

`create_rest_client()`

Training Client API

`forward(data, loss_fn)`

`forward_backward(data, loss_fn)`

`optim_step(adam_params)`

`get_info`

`get_tokenizer`

`save_state(name: str)`

`load_state(path: str)`

`save_weights_for_sampler(name: str)`

`unload_model()`

RestClient API

`list_training_runs(limit=20, offset=0)`

`get_training_run(training_run_id: types.ModelID)`

`get_training_run_by_checkpoint_path(checkpoint_path: str)`

`list_checkpoints(training_run_id: types.ModelID)`

`download_checkpoint_archive(training_run_id: types.ModelID, checkpoint_id: str)`

`download_checkpoint_archive_by_checkpoint_path(checkpoint_path: str)`

`delete_checkpoint(training_run_id: types.ModelID, checkpoint_id: str)`

`delete_checkpoint_by_checkpoint_path(checkpoint_path: str)`