Solely relying on test passing to evaluate Large Language Models (LLMs) for code synthesis may result in unfair assessment or promoting models with data leakage. As an alternative, we introduce CodeMind, a framework designed to gauge the code reasoning abilities of LLMs. CodeMind currently supports three inductive code reasoning tasks: (1) Independent Execution Reasoning (IER), Dynamic Semantics Reasoning (DER), and Specification Reasoning (SR). Please follow the instructions below to reproduce the results to use existing models, tasks, and datasets. We also support adding new models, tasks, and datasets.
To install all the dependencies, run the following command: pip install -r requirements.txt
CodeMind is designed to read API keys required for API-access models from local variables. Please modify and run setup.sh {OPANAIKEY} {GEMINIAPI}
to automatically add the variable to your local machines.
cd scripts
bash run_IER.sh {MODEL_ID} {CACHE_DIR} {DATASET}
## below is the command to run Magicoder on Humaneval:
bash run_IER.sh ise-uiuc/Magicoder-S-DS-6.7B ${path_to_store_checkpoints} humaneval Python
MODEL_ID
: Currently our framework supports following OpenAI and huggingface models: gpt-3.5-turbo
, gpt-4-turbo
, codellama/CodeLlama-13b-Instruct-hf
,codellama/CodeLlama-34b-Instruct-hf
, codellama/CodeLlama-13b-hf
, deepseek-ai/deepseek-coder-6.7b-instruct
, deepseek-ai/deepseek-coder-6.7b-base
, deepseek-ai/deepseek-coder-33b-instruct
,ise-uiuc/Magicoder-S-DS-6.7B
, bigcode/starcoder
, bigcode/starcoder2-15b
, semcoder/semcoder_s
CACHE_DIR
: path to store the downloaded pretrained huggingface model checkpoints.
DATASET
: choose one from the following list [ Avatar, cruxeval, humaneval, classeval]
cd scripts
bash run_DSR.sh {MODEL_ID} {CACHAE_DIR} {DATASET}
## Below is the command to run DSR for CodeLlama-13b-instruct on humaneval
bash run_DSR.sh codellama/CodeLlama-13b-Instruct-hf ${path_to_store_checkpoints} humaneval
DATASET
: choose one from the following list [ Avatar, cruxeval, humaneval, classeval]
cd scripts
bash run_SR.sh {MODEL_ID} {DATASET} {CACHAE_DIR} {SR_TYPE}
## Below is the command to run SR for Deepseek-coder on classeval under 'no_test' setting
bash run_SR.sh deepseek-ai/deepseek-coder-6.7b-instruc classeval ${path_to_store_checkpoints} no_test
## Below is the command to run SR for Deepseek-coder on humaneval under 'use test' setting
bash run_SR.sh deepseek-ai/deepseek-coder-6.7b-instruc humaneval ${path_to_store_checkpoints} use_test
SR_TYPE
: can be 'no_test' or 'use_test'
DATASET
: choose one from [humaneval, classeval]
Interested to read more about CodeMind, the code reasoning tasks, and a grounded-theory study evaluating LLMs for code reasoning across five benchmarks and two programming languages? Please read the pre-print on Arxiv: https://arxiv.org/pdf/2402.09664.pdf
citiation:
@article{liu2024codemind,
title={CodeMind: A Framework to Challenge Large Language Models for Code Reasoning},
author={Liu, Changshu and Zhang, Shizhuo Dylan and Ibrahimzada, Ali Reza and Jabbarvand, Reyhaneh},
journal={arXiv preprint arXiv:2402.09664},
year={2024}
}
CodeMind is an open-source project to promote the proper evaluation of LLMs for code-related tasks. If you are interested in building on top of CodeMind and adding more code reasoning tasks, please send an email to {cl144,reyhaneh}@illinois.edu.