微调 API 列表
以下 API 基于 Thinking Machines Lab 开源的 Tinker 项目(Apache License 2.0)进行开发。
我们感谢开源社区的贡献,并在此基础上针对 潞晨云的云端训练与推理基础设施 进行了深度定制和增强。
Service Client API
get_server_capabilities()
功能描述
查询 API 服务器当前支持的模型和功能,返回可用于训练或推理的模型列表。通常用于首次检查服务器可用性和模型信息。
输入
无
输出
GetServerCapabilitiesResponse 对象,包括:
supported_models(List[SupportedModel]):模型信息对象列表,每个SupportedModel包含:model_name(Optional[str]):模型名称/标识符
示例代码
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
service_client = ServiceClient(base_url=BASE_URL, api_key=API_KEY)
capabilities = service_client.get_server_capabilities()
print(f"Server supports {len(capabilities.supported_models)} models")
for model in capabilities.supported_models:
if model.model_name:
print(f" - {model.model_name}")
额外说明
支持异步版本 get_server_capabilities_async()。
create_lora_training_client(base_model, rank=32, seed=None, train_mlp=True, train_attn=True, train_unembed=True)
功能描述
创建一个基于 LoRA 的训练模型实例,并返回 TrainingClient 用于训练。
输入
base_model(字符串):基础模型名称,必须是微调服务支持的模型rank(整数,默认 32):LoRA rank,影响模型容量和参数量seed(可选整数):随机种子,用于可重复初始化 LoRA 权重train_mlp(布尔,默认 True):是否对 MLP 层应用 LoRAtrain_attn(布尔,默认 True):是否对注意力层应用 LoRAtrain_unembed(布尔,默认 True):是否对输出层应用 LoRA
输出
TrainingClient:绑定到新创建 LoRA
示例代码
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
service_client = ServiceClient(base_url=BASE_URL, api_key=API_KEY)
training_client = service_client.create_lora_training_client(
base_model="Qwen/Qwen3-8B"
)
print(f"Created model ID {training_client.model_id}")
自定义 LoRA 配置:
training_client = service_client.create_lora_training_client(
base_model="Qwen/Qwen2.5-7B-Instruct",
rank=64,
seed=42,
train_mlp=False,
train_attn=True, # only fine tune on attention layers
train_unembed=False
)
额外说明
- 可自定义 LoRA 配置,例如只对注意力层进行微调
- 支持异步版本
create_lora_training_client_async()
create_training_client_from_state(path)
功能描述
从保存的检查点恢复训练会话,可继续训练。
输入
path(字符串):检查点 URI,例如:- 训练检查点:
hpcai://{training_run_id}/weights/{checkpoint_id} - 推理检查点:
hpcai://{training_run_id}/sampler_weights/{checkpoint_id}
- 训练检查点:
输出
TrainingClient:绑定到恢复模型的训练客户端
示例代码
从已保存的 Checkpoint 处继续训练:
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
CHECKPOINT_PATH = "hpcai://{model_id}/weights/step_0010"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
training_client = service_client.create_training_client_from_state(CHECKPOINT_PATH)
print(f"Restored model ID: {training_client.model_id}")
# Continue training
training_data = [...] # Your training data
fwd_bwd = training_client.forward_backward(training_data, loss_fn="cross_entropy")
optim = training_client.optim_step(hpcai.AdamParams(learning_rate=1e-4))
fwd_bwd.result()
optim.result()
# Clean up
training_client.unload_model().result()
使用 RestClient 获取检查点路径并加载模型:
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# List checkpoints for a training run
training_run_id = "eb78693b-380d-40f8-a709-ffe0da185718"
checkpoints_response = rest_client.list_checkpoints(training_run_id).result()
# Find the latest checkpoint
if checkpoints_response.checkpoints:
latest_checkpoint = checkpoints_response.checkpoints[-1]
checkpoint_path = latest_checkpoint.checkpoint_path
print(f"Restoring from: {checkpoint_path}")
# Restore training client
training_client = service_client.create_training_client_from_state(checkpoint_path)
print(f"Restored model ID: {training_client.model_id}")
create_training_client(model_id: types.ModelID | None = None)
功能描述
创建一个 TrainingClient 实例,用于操作已存在模型或新模型。
如果提供 model_id,则客户端绑定到该模型;如果未提供,则客户端未绑定模型,需要加载 checkpoint 或创建模型后才能使用。
输入参数
-
model_id(types.ModelID | None,可选):服务器上已存在模型的唯一标识符。
- 提供时:客户端绑定到该模型。
- 不提供时(默认
None):客户端未绑定模型,需要后续加载 checkpoint 或创建模型才能使用。
输出
- TrainingClient:用于执行训练操作的客户端实例。
使用示例
1. 为已有模型创建客户端
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
# 已有模型 ID
existing_model_id = "eb78693b-380d-40f8-a709-ffe0da185718"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
training_client = service_client.create_training_client(model_id=existing_model_id)
print(f"Client bound to model: {training_client.model_id}")
# 使用客户端进行训练
training_data = [...] # 训练数据
fwd_bwd = training_client.forward_backward(training_data, loss_fn="cross_entropy")
optim = training_client.optim_step(hpcai.AdamParams(learning_rate=1e-4))
fwd_bwd.result()
optim.result()
2. 未提供 model_id,先加载 checkpoint 再训练
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
CHECKPOINT_PATH = "hpcai://eb78693b-380d-40f8-a709-ffe0da185718/weights/step_0010"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
# 创建客户端但未绑定模型
training_client = service_client.create_training_client(model_id=None)
# 加载 checkpoint 绑定模型
load_future = training_client.load_state(CHECKPOINT_PATH)
load_future.result()
print(f"Model loaded: {training_client.model_id}")
# 现在可以使用客户端进行训练
training_data = [...] # 训练数据
result = training_client.forward_backward(training_data, loss_fn="cross_entropy")
result.result()
create_rest_client()
功能描述
创建一个 RestClient 实例,用于访问 REST API 接口,支持查询训练运行信息、checkpoint 列表及其他元数据。
输入参数
无
输出
- RestClient:用于执行 REST API 操作的客户端实例,例如查询 checkpoint、训练运行信息等。
使用示例
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
Training Client API
forward(data, loss_fn)
功能描述
执行模型的前向推理(不计算梯度)。
输入
- data(
List[Datum]):输入数据 - loss_fn(str):损失函数名称(目前仅支持
"cross_entropy")
输出
APIFuture[ForwardBackwardOutput]:异步结果句柄
ForwardBackwardOutput(
loss_fn_output_type='cross_entropy',
loss_fn_outputs=[
{
'logprobs': TensorData(...),
'elementwise_loss': TensorData(...)
}
],
metrics={
'loss:mean': 4.447656154632568,
'num_examples:sum': 1.0,
'step:max': 0.0
}
)
示例代码
from hpcai import types
data = [
types.Datum(
model_input=types.ModelInput.from_ints([1, 2, 3, 4, 5]),
loss_fn_inputs={
"target_tokens": types.TensorData(
data=[2, 3, 4, 5, 6],
dtype="int64",
shape=[5]
),
"weights": types.TensorData(
data=[1.0, 1.0, 1.0, 1.0, 1.0],
dtype="float32",
shape=[5]
)
}
)
]
out = training_client.forward(data, loss_fn="cross_entropy")
res = out.result()
print(res)
forward_backward(data, loss_fn)
功能描述
执行前向+反向训练(计算梯度),梯度将累积直到调用 optim_step。
输入
同 forward
输出
同 forward
示例代码
from hpcai import types
data = [
types.Datum(
model_input=types.ModelInput.from_ints([1, 2, 3, 4, 5]),
loss_fn_inputs={
"target_tokens": types.TensorData(
data=[2, 3, 4, 5, 6],
dtype="int64",
shape=[5]
),
"weights": types.TensorData(
data=[1.0, 1.0, 1.0, 1.0, 1.0],
dtype="float32",
shape=[5]
)
}
)
]
out = training_client.forward_backward(data, loss_fn="cross_entropy")
res = out.result()
print(res)
optim_step(adam_params)
功能描述
执行一次 Adam 优化步骤,将累积的梯度应用到模型参数上。
输入
- adam_params:Adam 参数配置
输出
OptimStepResponse(包含优化指标)
示例代码
res = training_client.optim_step(
adam_params=types.AdamParams(learning_rate=1e-4)
).result()
get_info
功能描述
获取当前训练会话的元数据与配置信息。内容包括:model_id(训练 run id)、模型架构信息、LoRA 配置等。
输入
无
输出
返回 GetInfoResponse 对象,包含以下字段:
- type:响应类型(通常为
get_info) - model_data:
ModelData对象,包含模型架构与模型名称 - model_id:模型 / 训练 run 的唯一 ID
- is_lora:是否启用了 LoRA
- lora_rank:LoRA rank(如果开启了 LoRA)
- model_name:基础模型名称
示例代码
info = training_client.get_info()
print(info)
输出示例
GetInfoResponse(
type='get_info',
model_data=ModelData(arch='', model_name='Qwen/Qwen3-8B'),
model_id='e5c88495-46e9-43df-9bf8-3185aceaa222',
is_lora=True,
lora_rank=16,
model_name='Qwen/Qwen3-8B'
)
get_tokenizer
功能描述
返回当前训练运行所使用的模型对应的 transformers 预训练 tokenizer。
你可以用它进行编码、解码等操作(跟 Hugging Face 原生 tokenizer 一致)。
输入
无
输出
返回 Hugging Face PreTrainedTokenizer 对象。
示例代码
tokenizer = training_client.get_tokenizer()
print(tokenizer.encode("Hello, world!"))
输出示例
[9707, 11, 1879, 0]
save_state(name: str)
功能描述
保存已加载模型的可训练权重(例如,如果启用了 LoRA,仅保存 LoRA adapter 的权重)以便恢复训练。返回异步句柄 (APIFuture),可获取 SaveWeightsResponse。
输入参数
- name(str):用户自定义的 checkpoint ID,用于标识保存的权重。
输出
APIFuture[SaveWeightsResponse]:异步对象,返回包含保存权重 URL 的响应对象,格式如下:
hpcai://{model_id}/weights/{checkpoint_id}_training
使用示例
# 保存训练权重
res = training_client.save_state("initial").result()
异步版本
save_state_async(name: str)
load_state(path: str)
功能描述
加载用户指定 checkpoint 的可训练权重(例如,如果启用了 LoRA,仅加载 LoRA adapter 的权重),返回异步句柄 (APIFuture) 可获取 LoadWeightsResponse。
输入参数
-
path(str):已保存 checkpoint 的 URL,格式如下:
- 推理使用的 sampler checkpoint:
"hpcai://{model_id}/weights/{checkpoint_id}_sampler" - 恢复训练使用的 training checkpoint:
"hpcai://{model_id}/weights/{checkpoint_id}_training"
- 推理使用的 sampler checkpoint:
输出
- APIFuture[LoadWeightsResponse]:异步对象,返回包含已加载权重 URL 的响应对象。
使用示例
# 加载训练 checkpoint
res = training_client.load_state(
'hpcai://e8b29733-9efa-476a-b62b-40b8f9c6d999/weights/initial_training'
).result()
异步版本
load_state_async(path: str)
save_weights_for_sampler(name: str)
功能描述
保存已加载模型的可训练权重(例如,如果启用了 LoRA,仅保存 LoRA adapter 的权重)以用于推理。返回异步句柄 (APIFuture),可获取 SaveWeightsForSamplerResponse。
输入参数
- name(str):用户自定义的 checkpoint ID,用于标识保存的权重。
输出
- APIFuture[SaveWeightsForSamplerResponse]:异步对象,返回包含保存权重 URL 的响应对象,格式如下:
hpcai://{model_id}/weights/{checkpoint_id}_sampler
使用示例
# 保存用于推理的模型权重
res = training_client.save_weights_for_sampler("initial").result()
异步版本
save_weights_for_sampler_async(name: str)
unload_model()
功能描述
停止当前训练会话并释放资源(如 GPU)。
注意:该 API 不会自动保存模型权重,已保存的 checkpoint 不受影响,仍可下载使用。
输入参数
- 无
输出
- APIFuture[UnloadModelResponse]:异步对象,返回包含卸载模型 ID 的响应对象
使用示例
# 停止训练并释放资源
res = training_client.unload_model().result()
# 可在卸载后仍然列出已保存的 checkpoint
checkpoints = rest_client.list_checkpoints(training_client.model_id).result()
异步版本
unload_model_async()
RestClient API
list_training_runs(limit=20, offset=0)
功能描述
分页获取训练运行列表。
输入
limit:返回的最大条目数offset:偏移量
输出
TrainingRunsResponse:响应对象,包含以下内容:
training_runs(list[TrainingRun]):训练运行对象列表,每个对象包含:training_run_id (str):训练运行的唯一标识base_model (str):基础模型名称model_owner (str):模型的创建者/拥有者is_lora (bool):是否为 LoRA 模型corrupted (bool):模型是否处于损坏状态lora_rank (int | None):LoRA rank(如适用),否则为 Nonelast_request_time (datetime):该训练运行最近一次的请求时间last_checkpoint(Checkpoint | None):最新的训练检查点last_sampler_checkpoint (Checkpoint | None):最新的采样(推理)检查点
cursor (Cursor):分页元数据,包含:offset (int):当前请求使用的偏移量limit(int):当前请求的返回条目上限total_count (int):可用训练运行的总数量
示例代码
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# List first 10 training runs
response = rest_client.list_training_runs(limit=10, offset=0).result()
print(f"Found {len(response.training_runs)} training runs")
print(f"Total available: {response.cursor.total_count}")
for run in response.training_runs:
print(f" - {run.training_run_id}: {run.base_model}")
if run.is_lora:
print(f" LoRA rank: {run.lora_rank}")
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TARGET_MODEL = "Qwen/Qwen3-8B"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# List training runs and filter
response = rest_client.list_training_runs(limit=100, offset=0).result()
matching_runs = [
run for run in response.training_runs
if run.base_model == TARGET_MODEL
]
print(f"Found {len(matching_runs)} training runs for {TARGET_MODEL}:")
for run in matching_runs:
print(f" - {run.training_run_id} (LoRA rank: {run.lora_rank})")
额外说明
支持异步版本 get_server_capabilities_async()。
get_training_run(training_run_id: types.ModelID)
功能描述
通过 training_run_id 获取特定训练运行的详细信息,返回该训练运行的完整元数据。
输入参数
- training_run_id(types.ModelID):训练运行的唯一标识符,必须对应服务器上已有的训练运行。
输出
-
TrainingRun 对象,包含以下字段:
- training_run_id(str):训练运行唯一标识符
- base_model(str):基础模型名称
- model_owner(str):所有者/创建者标识
- is_lora(bool):是否为 LoRA 模型
- corrupted(bool):模型是否处于损坏状态
- lora_rank(int | None):LoRA rank(若适用),否则为 None
- last_request_time(datetime):最近请求时间
- last_checkpoint(Checkpoint | None):最近的训练 checkpoint
- last_sampler_checkpoint(Checkpoint | None):最近的 sampler checkpoint
使用示例
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# 获取训练运行信息
training_run = rest_client.get_training_run(TRAINING_RUN_ID).result()
print(f"Training Run ID: {training_run.training_run_id}")
print(f"Base Model: {training_run.base_model}")
print(f"Is LoRA: {training_run.is_lora}")
if training_run.is_lora:
print(f"LoRA Rank: {training_run.lora_rank}")
print(f"Owner: {training_run.model_owner}")
print(f"Last Request: {training_run.last_request_time}")
print(f"Corrupted: {training_run.corrupted}")
异步版本
get_training_run_async(training_run_id: types.ModelID)
get_training_run_by_checkpoint_path(checkpoint_path: str)
功能描述
通过 checkpoint_path 获取特定训练运行的详细信息,返回该训练运行的完整元数据。
输入参数
-
checkpoint_path(str):checkpoint 路径,使用
hpcai://协议,格式如下:- 训练 checkpoint:
"hpcai://{training_run_id}/weights/{checkpoint_id}" - 推理 sampler checkpoint:
"hpcai://{training_run_id}/sampler_weights/{checkpoint_id}"
- 训练 checkpoint:
输出
-
TrainingRun 对象,包含以下字段:
- training_run_id(str):训练运行唯一标识符
- base_model(str):基础模型名称
- model_owner(str):所有者/创建者标识
- is_lora(bool):是否为 LoRA 模型
- corrupted(bool):模型是否处于损坏状态
- lora_rank(int | None):LoRA rank(若适用),否则为 None
- last_request_time(datetime):最近请求时间
- last_checkpoint(Checkpoint | None):最近的训练 checkpoint
- last_sampler_checkpoint(Checkpoint | None):最近的 sampler checkpoint
使用示例
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
CHECKPOINT_PATH = "hpcai://eb78693b-380d-40f8-a709-ffe0da185718/weights/step_0010"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# 获取训练运行信息
training_run = rest_client.get_training_run_by_checkpoint_path(CHECKPOINT_PATH).result()
print(f"Training Run ID: {training_run.training_run_id}")
print(f"Base Model: {training_run.base_model}")
print(f"Is LoRA: {training_run.is_lora}")
if training_run.is_lora:
print(f"LoRA Rank: {training_run.lora_rank}")
异步版本
get_training_run_by_checkpoint_path_async(checkpoint_path: str)
list_checkpoints(training_run_id: types.ModelID)
功能描述 列出指定训练运行的所有可用 checkpoint。 返回的 checkpoint 包含两类:
- training:包含训练过程中保存的完整模型权重,可用于恢复训练。
- sampler:针对推理优化的 checkpoint,体积较小,加载速度更快。
输入参数
- training_run_id(types.ModelID):训练运行的唯一标识符,必须对应服务器上已有的训练运行。
输出
-
CheckpointsListResponse:包含 checkpoint 列表,字段如下:
-
checkpoints(list[Checkpoint]):checkpoint 对象列表,每个对象包含:
- checkpoint_id(str):checkpoint 唯一标识符
- checkpoint_type(CheckpointType):
"training"或"sampler" - time(datetime):checkpoint 创建时间(带时区信息,通常为 UTC)
- checkpoint_path(str):checkpoint 完整路径,格式
"hpcai://{training_run_id}/{type}/{checkpoint_id}"
-
使用示例
- 列出所有 checkpoint
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# 列出所有 checkpoint
checkpoints_response = rest_client.list_checkpoints(TRAINING_RUN_ID).result()
for checkpoint in checkpoints_response.checkpoints:
print(f"{checkpoint.checkpoint_type}: {checkpoint.checkpoint_id}")
- 获取最新 checkpoint 并恢复训练
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
checkpoints_response = rest_client.list_checkpoints(TRAINING_RUN_ID).result()
if checkpoints_response.checkpoints:
# 按时间排序,取最新 checkpoint
latest_checkpoint = sorted(
checkpoints_response.checkpoints,
key=lambda cp: cp.time,
reverse=True
)[0]
# 使用最新 checkpoint 恢复训练
training_client = service_client.create_training_client_from_state(
latest_checkpoint.checkpoint_path
)
异步版本
list_checkpoints_async(training_run_id: types.ModelID)
注意事项 目前仅支持保存 LoRA adapter 权重,后续将支持保存优化器状态。
download_checkpoint_archive(training_run_id: types.ModelID, checkpoint_id: str)
功能描述
通过 training_run_id 和 checkpoint_id 下载完整 checkpoint 的压缩 tar.gz 文件。
下载的 archive 包含所有模型权重、配置和元数据,可用于在本地运行训练好的模型。返回原始字节流,可直接保存或处理。
输入参数
- training_run_id(types.ModelID):训练运行的唯一标识符,必须对应服务器上已有的训练运行。
- checkpoint_id(str):要下载的 checkpoint 的标识符,对应
list_checkpoints()返回的checkpoint_id字段。
输出
-
bytes:包含 checkpoint archive 数据的字节流,archive 为 tar.gz 格式,包含:
- 模型权重(LoRA:
adapter_model.bin或adapter_model.safetensors) - 配置文件(
adapter_config.json、tokenizer 文件) - 其他 checkpoint 元数据
- 模型权重(LoRA:
使用示例
- 下载并保存 checkpoint
import hpcai
from pathlib import Path
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
CHECKPOINT_ID = "step_0010"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# 下载 checkpoint archive
archive_data = rest_client.download_checkpoint_archive(
training_run_id=TRAINING_RUN_ID,
checkpoint_id=CHECKPOINT_ID
).result()
# 保存到本地文件
output_path = f"./checkpoints/checkpoint_{CHECKPOINT_ID}.tar.gz"
Path("./checkpoints").mkdir(parents=True, exist_ok=True)
with open(output_path, "wb") as f:
f.write(archive_data)
- 下载并解压 checkpoint
import hpcai
import tarfile
from pathlib import Path
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
CHECKPOINT_ID = "final"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# 下载 checkpoint
archive_data = rest_client.download_checkpoint_archive(
training_run_id=TRAINING_RUN_ID,
checkpoint_id=CHECKPOINT_ID
).result()
# 保存 archive
archive_path = Path(f"./checkpoints/checkpoint_{CHECKPOINT_ID}.tar.gz")
archive_path.parent.mkdir(parents=True, exist_ok=True)
archive_path.write_bytes(archive_data)
# 解压 archive
extract_dir = Path(f"./checkpoints/extracted_{CHECKPOINT_ID}")
extract_dir.mkdir(parents=True, exist_ok=True)
with tarfile.open(archive_path, "r:gz") as tar:
tar.extractall(extract_dir)
异步版本
download_checkpoint_archive_async(training_run_id: types.ModelID, checkpoint_id: str)
注意事项 下载 checkpoint 会消耗带宽和存储空间,建议仅在本地分析或备份时使用。
download_checkpoint_archive_by_checkpoint_path(checkpoint_path: str)
功能描述
通过 checkpoint_path 下载完整 checkpoint 的压缩 tar.gz 文件。
下载的 archive 包含所有模型权重、配置和元数据,可用于在本地运行训练好的模型。返回原始字节流,可直接保存或处理。
输入参数
-
checkpoint_path(str):checkpoint 路径,使用
hpcai://协议,格式如下:- 训练 checkpoint:
"hpcai://{training_run_id}/weights/{checkpoint_id}" - 推理 sampler checkpoint:
"hpcai://{training_run_id}/sampler_weights/{checkpoint_id}"
- 训练 checkpoint:
输出
-
bytes:包含 checkpoint archive 数据的字节流,archive 为 tar.gz 格式,包含:
- 模型权重(LoRA:
adapter_model.bin或adapter_model.safetensors) - 配置文件(
adapter_config.json、tokenizer 文件) - 其他 checkpoint 元数据
- 模型权重(LoRA:
使用示例
- 下载并保存 checkpoint
import hpcai
from pathlib import Path
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
CHECKPOINT_PATH = "hpcai://eb78693b-380d-40f8-a709-ffe0da185718/weights/step_0010"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# 下载 checkpoint archive
archive_data = rest_client.download_checkpoint_archive_by_checkpoint_path(
checkpoint_path=CHECKPOINT_PATH
).result()
# 保存到本地文件
output_path = f"./checkpoints/{CHECKPOINT_PATH}.tar.gz"
Path("./checkpoints").mkdir(parents=True, exist_ok=True)
with open(output_path, "wb") as f:
f.write(archive_data)
- 下载并解压 checkpoint
import hpcai
import tarfile
from pathlib import Path
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
CHECKPOINT_PATH = "hpcai://eb78693b-380d-40f8-a709-ffe0da185718/weights/step_0010"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# 下载 checkpoint
archive_data = rest_client.download_checkpoint_archive_by_checkpoint_path(
checkpoint_path=CHECKPOINT_PATH
).result()
# 保存 archive
archive_path = Path(f"./checkpoints/{CHECKPOINT_PATH}.tar.gz")
archive_path.parent.mkdir(parents=True, exist_ok=True)
archive_path.write_bytes(archive_data)
# 解压 archive
extract_dir = Path(f"./checkpoints/extracted_{CHECKPOINT_PATH}")
extract_dir.mkdir(parents=True, exist_ok=True)
with tarfile.open(archive_path, "r:gz") as tar:
tar.extractall(extract_dir)
异步版本
download_checkpoint_archive_by_checkpoint_path_async(checkpoint_path: str)
注意事项 下载 checkpoint 会消耗带宽和存储空间,建议仅在本地分析或备份时使用。
delete_checkpoint(training_run_id: types.ModelID, checkpoint_id: str)
功能描述
永久删除指定训练运行的 checkpoint。此操作会从服务器存储中彻底移除 checkpoint,无法恢复。适用于释放存储空间或删除不再需要的 checkpoint。
输入参数
- training_run_id(types.ModelID):训练运行的唯一标识符,必须对应服务器上已有的训练运行。
- checkpoint_id(str):要删除的 checkpoint 的标识符,对应
list_checkpoints()返回的checkpoint_id字段。
输出
无(异步操作通过 APIFuture 处理)
使用示例
- 基本删除
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
CHECKPOINT_ID = "step_0010"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# WARNING: 这将永久删除 checkpoint
try:
rest_client.delete_checkpoint(
training_run_id=TRAINING_RUN_ID,
checkpoint_id=CHECKPOINT_ID
).result()
except Exception as e:
print(f"Failed to delete checkpoint: {e}")
- 安全删除(先检查 checkpoint 是否存在)
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
CHECKPOINT_ID = "step_0010"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# 先检查 checkpoint 是否存在
checkpoints_response = rest_client.list_checkpoints(TRAINING_RUN_ID).result()
checkpoint_exists = any(
cp.checkpoint_id == CHECKPOINT_ID
for cp in checkpoints_response.checkpoints
)
if checkpoint_exists:
print(f"Found checkpoint {CHECKPOINT_ID}")
print("WARNING: This will permanently delete the checkpoint!")
# 可选:删除前下载备份
# archive_data = rest_client.download_checkpoint_archive(
# training_run_id=TRAINING_RUN_ID,
# checkpoint_id=CHECKPOINT_ID
# ).result()
# with open(f"backup_{CHECKPOINT_ID}.tar.gz", "wb") as f:
# f.write(archive_data)
# 删除 checkpoint
rest_client.delete_checkpoint(
training_run_id=TRAINING_RUN_ID,
checkpoint_id=CHECKPOINT_ID
).result()
else:
print(f"Checkpoint {CHECKPOINT_ID} does not exist")
异步版本
delete_checkpoint_async(training_run_id: types.ModelID, checkpoint_id: str)
delete_checkpoint_by_checkpoint_path(checkpoint_path: str)
功能描述
永久删除指定 checkpoint。此操作会从服务器存储中彻底移除 checkpoint,无法恢复。 适用于释放存储空间或删除不再需要的 checkpoint。
输入参数
-
checkpoint_path(str):要删除的 checkpoint 路径,使用
hpcai://协议。格式如下:- 训练 checkpoint:
"hpcai://{training_run_id}/weights/{checkpoint_id}" - 推理 sampler checkpoint:
"hpcai://{training_run_id}/sampler_weights/{checkpoint_id}"
- 训练 checkpoint:
输出
无(异步操作通过 APIFuture 处理)
使用示例
- 基本用法(直接删除)
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
CHECKPOINT_PATH = "hpcai://eb78693b-380d-40f8-a709-ffe0da185718/weights/step_0010"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# WARNING: 这将永久删除 checkpoint
try:
rest_client.delete_checkpoint_by_checkpoint_path(
checkpoint_path=CHECKPOINT_PATH
).result()
except Exception as e:
print(f"Failed to delete checkpoint: {e}")
- 安全删除(先检查 checkpoint 是否存在)
import hpcai
BASE_URL = "https://cloud.luchentech.com/finetunesdk"
API_KEY = "your-api-key-here"
TRAINING_RUN_ID = "eb78693b-380d-40f8-a709-ffe0da185718"
CHECKPOINT_PATH = "hpcai://eb78693b-380d-40f8-a709-ffe0da185718/weights/step_0010"
service_client = hpcai.ServiceClient(base_url=BASE_URL, api_key=API_KEY)
rest_client = service_client.create_rest_client()
# 先检查 checkpoint 是否存在
checkpoints_response = rest_client.list_checkpoints(TRAINING_RUN_ID).result()
checkpoint_exists = any(
cp.checkpoint_path == CHECKPOINT_PATH
for cp in checkpoints_response.checkpoints
)
if checkpoint_exists:
print(f"Found checkpoint {CHECKPOINT_PATH}")
print("WARNING: This will permanently delete the checkpoint!")
rest_client.delete_checkpoint_by_checkpoint_path(
checkpoint_path=CHECKPOINT_PATH
).result()
else:
print(f"Checkpoint {CHECKPOINT_PATH} does not exist")
异步版本
delete_checkpoint_by_checkpoint_path_async(checkpoint_path: str)