Cracking the Coding Evaluation
Tabby offers an open-source alternative solution to GitHub Copilot with easy setup and self-host options. We embrace an open ecosystem to support major open source coding LLMs (e.g. StarCoder, CodeLlama, WizardCoder, etc.), and enable easy integration of proprietary models. In addition, Tabby performs retrieval-augmented code completion to suggest code from your private codebase. We firmly believe in the continuous advancement in open source coding LLMs, yet we need quantitative measurements to guide the direction of product improvement, and help developers decide their model of choice.
Evaluation coding LLMs has also been a hot topic in academics. Many different metrics targeting different coding tasks have been proposed over the past year. At Tabby, we prioritize on metrics that best resemble real-world development workflow, and of course, the metrics should be constructed with non-biased data sources. In this blogpost, we will discuss our thoughts for desired code completion benchmarks, and also review latest academic progress in this area.
Exisiting Paradigmsโ
Existing coding LLM benchmark mostly focus on Pass@k metric - generating k
code samples and measuring how often the results successfully pass given unit tests. OpenAI initially introduced this metric in Evaluating Large Language Models Trained on Code in July 2021, along with the release of HumanEval bechmark dataset.
๐ค HumanEvalโ
HumanEval is a hand-crafted dataset, consisting of 164 Python programming problems with unit tests. An example task looks like:
from typing import List
def below_zero(operations: List[int]) -> bool:
"""
You're given a list of deposit and withdrawal operations on a bank account that starts with zero balance. Your task is to detect if at any point the balance of account fallls below zero, and at that point function should return True. Otherwise it should return False.
>>> below_zero([1, 2, 3]) False
>>> below_zero([1, 2, -4, 5]) True
"""
HumanEval was a pioneer research effort, but now suffers from some unfortunate drawbacks:
-
Data is likely contaminated. HumanEval dataset has been around for over two years and it has been discussed and documented widely online. The latest coding LLMs are likely to have included its test data in training data crawling, which would make the evaluation no longer valid.
-
Trivial coding questions that aren't mimicing real engineering setups. HumanEval includes mostly LeetCode's interview-style questions, where they include a single function for LLMs to fill in the body. In a more realistic corporate setup, developers often add code in multiple files in a single PR, and constantly refer to functions implemented in other files. These are indeed more interesting yet challenging tasks for LLMs to perform, but are critical scenarios for AI coding assitants to land in enterprises.
-
Unit tests are too weak. Researchers noticed that test cases in HumanEval tasks (on average 7.7 tests per problem) aren't enough to guarantee the correctness of the generated code (e.g. a wrong implementation could still pass all existing tests), and thus augmented test cases in HumanEval benchmark by 80x in HumanEvalPlus.
- Limited coverage in programming languages. This one is obvious as HumanEval only includes Python code. We โค๏ธ all programming languages!
๐งฉ Mostly Basic Programming Problems (MBPP)โ
MBPP is another popular benchmark for code generation. Researchers from Google introduced it in the paper Program Synthesis with Large Language Models in August 2021, one month after the release of HumanEval. It contains 974 entry-level Python (as the name clearly suggests) programming tasks. An example looks like:
"""
Write a python function to remove first and last occurrence of a given character from the string.
"assert remove_Occ(\"hello\",\"l\") == \"heo\""
"assert remove_Occ(\"abcda\",\"a\") == \"bcd\""
"assert remove_Occ(\"PHP\",\"P\") == \"H\""
"""
Unlike HumanEval, MBPP targets basic tasks commonly encountered by engineers, such as string manipulation, simple arithmetic, and basic data structure operations. However it still faces similar drawbacks as HumanEval mentioned above.