learn_about_humanEval

在这里记录一下评估模型代码生成能力常用的Benchmark：humanEval

HumanEval 数据集是 OpenAI 为评估代码生成模型而创建的一个数据集，专门用于测试模型的编程能力。该数据集中一共有164条数据，均以python书写。

数据格式

举例一条数据如下：

{
  "task_id": "HumanEval/163", 
  "prompt": "\ndef generate_integers(a, b):\n    \"\"\"\n    Given two positive integers a and b, return the even digits between a\n    and b, in ascending order.\n\n    For example:\n    generate_integers(2, 8) => [2, 4, 6, 8]\n    generate_integers(8, 2) => [2, 4, 6, 8]\n    generate_integers(10, 14) => []\n    \"\"\"\n",
  "entry_point": "generate_integers", 
  "canonical_solution": "    lower = max(2, min(a, b))\n    upper = min(8, max(a, b))\n\n    return [i for i in range(lower, upper+1) if i % 2 == 0]\n", 
  "test": "def check(candidate):\n\n    # Check some simple cases\n    assert candidate(2, 10) == [2, 4, 6, 8], \"Test 1\"\n    assert candidate(10, 2) == [2, 4, 6, 8], \"Test 2\"\n    assert candidate(132, 2) == [2, 4, 6, 8], \"Test 3\"\n    assert candidate(17,89) == [], \"Test 4\"\n\n    # Check some edge cases that are easy to work out by hand.\n    assert True, \"This prints if this assert fails 2 (also good for debugging!)\"\n\n"
}

在prompt中，告诉了模型函数名、函数定义和预期输出，entry_point为函数名，canonical_solution提供了示例函数，test为测试用例。

HumanEval如何评测代码正确性？

在HumanEval中，评测的代码位于execution.py中，拼接程序如下：

# Construct the check program and run it.
check_program = (
    problem["prompt"] + completion + "\n" +
    problem["test"] + "\n" +
    f"check({problem['entry_point']})"
    )

之后运行这个拼接之后的代码，test中包含了很多测试用例，模型需要判断是否通过所有测试用例，全部通过才代表passed。但是模型生成的代码并不完全是符合要求的格式：仅包含代码块，实际也会生成函数头，但经过检查发现这样没有影响评测的结果，why？如果函数头重复，那么生成的代码类似以下格式：

from typing import List

# 这个函数头重复了一遍
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
    for idx, elem in enumerate(numbers):
        for idx2, elem2 in enumerate(numbers):
            if idx != idx2:
                distance = abs(elem - elem2)
                if distance < threshold:
                    return True

    return False


METADATA = {
    'author': 'jt',
    'dataset': 'test'
}


def check(candidate):
    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
    assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False

这是一个有趣的现象，说明即使模型生成的代码中包含了函数头，拼接后的代码仍然可以通过评测。这是因为Python的函数定义覆盖特性以及代码执行顺序允许这种情况下的正确执行。实际上，最终的有效定义是completion中的那个函数，因为它会覆盖掉prompt中的不完整定义。这个机制确保了模型生成的完整代码可以正确执行。

评价指标 'Pass@k'

HumanEval 使用 pass@k 作为主要评估指标，它衡量的是模型生成代码在多次尝试中成功解决问题的概率。具体计算方式如下： - k = 1：模型只生成一个代码解决方案，评估该方案是否通过所有测试用例。 - k > 1：模型可以生成多个候选代码，评估在这 k 个候选方案中，至少有一个通过所有测试用例的情况。

[Record] 两天复现`DeepSeek on HumanEval-python`🤩

在这里记录一下，由于项目需要，所以复现了一下DeepSeek在HumanEval上的测评。

复现使用的大模型是DeepSeek-V2.5，在这里并没有从Huggingface上下载开源大模型，只是使用API调用的方法进行评测。

首先clone下来DeepSeek-Coder的系列代码，阅读代码发现，他们在评测humanEval的时候，并不是单纯把prompt输入进去，而是对prompt进行了修改，增加了一段约束输出的提示：

def build_deepseekcoder_instruction(languge: str, question: str):
    return 
    '''
    Please continue to complete the function. You are not allowed to modify the given code and do the completion only. Please return all completed function in a codeblock. Here is the given code to do completion:
    ```{}
        {}```
    '''.strip().format(languge.lower(), question.strip())

下面需要面对的问题就是，在调用API接口的时候，两个关键的参数temperature和top_p怎么设定？

temperature控制模型生成的随机性或创造性。
- 范围：通常在0到1之间（可以取大于1的值，但不常见）。
- 作用：
  - 当temperature值接近0时，模型的输出会变得更加确定，它会倾向于选择概率最高的单词，这样生成的内容会更保守，更接近训练数据中的常见模式。
  - 当temperature值较高时，模型生成的单词选择会更加随机，输出的内容会更具创造性，但也可能会变得不太连贯或不合逻辑。
top_p(Nucleus Sampling)控制模型生成时使用的概率累积阈值，用于确定在候选单词中的选择范围，主要控制多样性。
- 范围：在0到1之间。
- 作用：
  - top_p会影响生成时候选单词的集合。当top_p=1时，模型会从所有可能的单词中进行采样；当top_p的值越低，模型会在一个概率累积较高的候选词集合中选择，即只考虑那些占总概率最高的词汇。

在这里，由于在deepseek的官方代码中没有找到他们的参数，因此我选择和bigcode-models-leaderboard的参数对齐。也就是temperature=0.2，top_p=0.95

调用代码如下：

def generate_response(prompt):
    prompt = build_deepseekcoder_instruction('Python', prompt)
    response = client.chat.completions.create(
        model="deepseek-coder",
        messages=[
            {"role": "user", "content": prompt}
        ],
        temperature=0.2,
        top_p=0.95,
        max_tokens=1024,
        stream=False
    )
    return response.choices[0].message.content

主要的两个问题考虑了之后，我首先跑出了第一轮结果：官网给出的分数高达89，但我只复现出了约66分，说明复现的过程还是有问题。思考思考，检查结果很容易发现，生成的代码报错原因非常集中：都是缩进的错···🙄🙄 于是阅读了deepseek处理代码缩进的逻辑，并且对他们的逻辑进行稍微修改（很奇怪，他们的逻辑会把prompt中代码def之前的部分补上，又会导致缩进问题，这部分直接不要就好了），修改后再次进行打分：呀，这下终于跑出了满意的结果！😄😄