Skip to main content

Colossal-Eval

ColossalEval ,提供了一个统一的流程,用于在不同的公共数据集或您自己的数据集上评估语言模型,同时利用传统指标和 GPT 的帮助。在潞晨云,我们预置了 ColossalEval 镜像,可以给到您开箱即用的体验。

使用步骤

1. 云平台机器配置

❗️需要挂载公开数据 机器添加 ssh 公钥:

  1. cd /root/.ssh
  2. vi authorized_keys
  3. 添加 public key 后保存

2. 准备 Colossal-Eval 测试

Step 1: 创建eval结果文件夹

# root路径下
mkdir eval_results eval_results/inference eval_results/evaluation 2

❗️请按照3. 测试不同的评估数据集 完成配置后再启动Step 2和Step 3指令,先推理再评估

Step 2: 推理启动指令

  1. 配置 /root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/inference.sh
torchrun --nproc_per_node=4 inference.py \
--config "/root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/config/inference/config.json" \
--load_dataset \
--tp_size 1 \
--inference_save_path "/root/eval_results/inference"
  1. 启动推理
cd /root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/
sh inference.sh

Step 3: 评估启动指令

  1. 配置 /root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/eval_dataset.sh
python eval_dataset.py \
--config "/root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/config/evaluation/config.json" \
--inference_results_path "/root/eval_results/inference" \
--evaluation_results_save_path "/root/eval_results/evaluation"
  1. 启动评估
cd /root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/
sh eval_dataset.sh

3. 测试不同数据集

主要通过修改以下两个config文件来测试不同的数据集:

/root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/config/inference/config.json
/root/ColossalAI/applications/ColossalEval/examples/dataset_evaluation/config/evaluation/config.json

测试 1

配置 inference/config.json A800 / H800

{
"model": [
{
"name": "Colossal-LLaMA-2-7b-base",
"model_class": "HuggingFaceCausalLM",
"parameters": {
"path": "/root/commonData/Colossal-LLaMA-2-7b-base",
"model_max_length": 4096,
"tokenizer_path": "",
"tokenizer_kwargs": {
"trust_remote_code": true
},
"peft_path": null,
"model_kwargs": {
"torch_dtype": "torch.bfloat16",
"trust_remote_code": true
},
"prompt_template": "plain",
"batch_size": 8
}
}
],
"dataset": [
{
"name": "mmlu",
"dataset_class": "MMLUDataset",
"debug": false,
"few_shot": true,
"path": "/root/commonData/colossaleval_data/mmlu",
"save_path": "inference_data/mmlu.json"
},
{
"name": "cmmlu",
"dataset_class": "CMMLUDataset",
"debug": false,
"few_shot": true,
"path": "/root/commonData/colossaleval_data/cmmlu",
"save_path": "inference_data/cmmlu.json"
}
]
}

4090

{
"model": [
{
"name": "Colossal-LLaMA-2-7b-base",
"model_class": "HuggingFaceCausalLM",
"parameters": {
"path": "/root/commonData/Colossal-LLaMA-2-7b-base",
"model_max_length": 4096,
"tokenizer_path": "",
"tokenizer_kwargs": {
"trust_remote_code": true
},
"peft_path": null,
"model_kwargs": {
"torch_dtype": "torch.bfloat16",
"trust_remote_code": true
},
"prompt_template": "plain",
"batch_size": 2
}
}
],
"dataset": [
{
"name": "mmlu",
"dataset_class": "MMLUDataset",
"debug": false,
"few_shot": true,
"path": "/root/commonData/colossaleval_data/mmlu",
"save_path": "inference_data/mmlu.json"
},
{
"name": "cmmlu",
"dataset_class": "CMMLUDataset",
"debug": false,
"few_shot": true,
"path": "/root/commonData/colossaleval_data/cmmlu",
"save_path": "inference_data/cmmlu.json"
}
]
}

配置 evaluation/config.json

{
"model": [
{
"name": "Colossal-LLaMA-2-7b-base"
}
],
"dataset": [
{
"name": "cmmlu",
"metrics": [
"first_token_accuracy",
"single_choice_accuracy",
"perplexity",
"ppl_score",
"ppl_score_over_choices"
]
},
{
"name": "mmlu",
"metrics": [
"first_token_accuracy",
"single_choice_accuracy",
"perplexity",
"ppl_score",
"ppl_score_over_choices"
]
}
]
}

推理结束:

评估结束:

4. 测试结束

  • 原始推理结果储存地址:/root/eval_results/inference/Colossal-LLaMA-2-7b-base
  • 最新评估结果:/root/eval_results/evaluation