Colossal-Eval
ColossalEval ,提供了一个统一的流程,用于在不同的公共数据集或您自己的数据集上评估语言模型,同时利用传统指标和 GPT 的帮助。在潞晨云,我们预置了 ColossalEval 镜像,可以给到您开箱即用的体验。
使用步骤
1. 云平台机器配置
❗️需要挂载公开数据 机器添加 ssh 公钥:
cd /root/.ssh
vi authorized_keys
- 添加 public key 后保存
2. 准备 Colossal-Eval 测试
Step 1: 创建eval结果文件夹
# root路径下
mkdir eval_results eval_results/inference eval_results/evaluation 2
❗️请按照3. 测试不同的评估数据集 完成配置后再启动Step 2和Step 3指令,先推理再评估
Step 2: 推理启动指令
- 配置
/root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/inference.sh
torchrun --nproc_per_node=4 inference.py \
--config "/root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/config/inference/config.json" \
--load_dataset \
--tp_size 1 \
--inference_save_path "/root/eval_results/inference"
- 启动推理
cd /root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/
sh inference.sh
Step 3: 评估启动指令
- 配置
/root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/eval_dataset.sh
python eval_dataset.py \
--config "/root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/config/evaluation/config.json" \
--inference_results_path "/root/eval_results/inference" \
--evaluation_results_save_path "/root/eval_results/evaluation"
- 启动评估
cd /root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/
sh eval_dataset.sh
3. 测试不同数据集
主要通过修改以下两个config文件来测试不同的数据集:
/root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/config/inference/config.json
/root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/config/evaluation/config.json
测试 1
配置 inference/config.json
A800 / H800
{
"model": [
{
"name": "Colossal-LLaMA-2-7b-base",
"model_class": "HuggingFaceCausalLM",
"parameters": {
"path": "/root/commonData/Colossal-LLaMA-2-7b-base",
"model_max_length": 4096,
"tokenizer_path": "",
"tokenizer_kwargs": {
"trust_remote_code": true
},
"peft_path": null,
"model_kwargs": {
"torch_dtype": "torch.bfloat16",
"trust_remote_code": true
},
"prompt_template": "plain",
"batch_size": 8
}
}
],
"dataset": [
{
"name": "mmlu",
"dataset_class": "MMLUDataset",
"debug": false,
"few_shot": true,
"path": "/root/commonData/colossaleval_data/mmlu",
"save_path": "inference_data/mmlu.json"
},
{
"name": "cmmlu",
"dataset_class": "CMMLUDataset",
"debug": false,
"few_shot": true,
"path": "/root/commonData/colossaleval_data/cmmlu",
"save_path": "inference_data/cmmlu.json"
}
]
}
4090
{
"model": [
{
"name": "Colossal-LLaMA-2-7b-base",
"model_class": "HuggingFaceCausalLM",
"parameters": {
"path": "/root/commonData/Colossal-LLaMA-2-7b-base",
"model_max_length": 4096,
"tokenizer_path": "",
"tokenizer_kwargs": {
"trust_remote_code": true
},
"peft_path": null,
"model_kwargs": {
"torch_dtype": "torch.bfloat16",
"trust_remote_code": true
},
"prompt_template": "plain",
"batch_size": 2
}
}
],
"dataset": [
{
"name": "mmlu",
"dataset_class": "MMLUDataset",
"debug": false,
"few_shot": true,
"path": "/root/commonData/colossaleval_data/mmlu",
"save_path": "inference_data/mmlu.json"
},
{
"name": "cmmlu",
"dataset_class": "CMMLUDataset",
"debug": false,
"few_shot": true,
"path": "/root/commonData/colossaleval_data/cmmlu",
"save_path": "inference_data/cmmlu.json"
}
]
}
配置 evaluation/config.json
{
"model": [
{
"name": "Colossal-LLaMA-2-7b-base"
}
],
"dataset": [
{
"name": "cmmlu",
"metrics": [
"first_token_accuracy",
"single_choice_accuracy",
"perplexity",
"ppl_score",
"ppl_score_over_choices"
]
},
{
"name": "mmlu",
"metrics": [
"first_token_accuracy",
"single_choice_accuracy",
"perplexity",
"ppl_score",
"ppl_score_over_choices"
]
}
]
}
推理结束:
评估结束:
4. 测试结束
- 原始推理结果储存地址:
/root/eval_results/inference/Colossal-LLaMA-2-7b-base
- 最新评估结果:
/root/eval_results/evaluation