Skip to content

CodeMind is a generic framework for evaluating inductive code reasoning of LLMs. It is equipped with a static analysis component that enables in-depth analysis of the results.

License

Notifications You must be signed in to change notification settings

Intelligent-CAT-Lab/CodeMind

Repository files navigation

CodeMind Framework

Solely relying on test passing to evaluate Large Language Models (LLMs) for code synthesis may result in unfair assessment or promoting models with data leakage. As an alternative, we introduce CodeMind, a framework designed to gauge the code reasoning abilities of LLMs. CodeMind currently supports three inductive code reasoning tasks: (1) Independent Execution Reasoning (IER), Dynamic Semantics Reasoning (DER), and Specification Reasoning (SR). Please follow the instructions below to reproduce the results to use existing models, tasks, and datasets. We also support adding new models, tasks, and datasets.

Dependencies

To install all the dependencies, run the following command: pip install -r requirements.txt

CodeMind is designed to read API keys required for API-access models from local variables. Please modify and run setup.sh {OPANAIKEY} {GEMINIAPI} to automatically add the variable to your local machines.

How to reproduce the results

IER Reasoning Task

cd scripts
bash run_IER.sh {MODEL_ID} {CACHE_DIR} {DATASET}

## below is the command to run Magicoder on Humaneval:
bash run_IER.sh ise-uiuc/Magicoder-S-DS-6.7B ${path_to_store_checkpoints} humaneval Python

MODEL_ID: Currently our framework supports following OpenAI and huggingface models: gpt-3.5-turbo, gpt-4-turbo, codellama/CodeLlama-13b-Instruct-hf,codellama/CodeLlama-34b-Instruct-hf, codellama/CodeLlama-13b-hf, deepseek-ai/deepseek-coder-6.7b-instruct, deepseek-ai/deepseek-coder-6.7b-base, deepseek-ai/deepseek-coder-33b-instruct,ise-uiuc/Magicoder-S-DS-6.7B, bigcode/starcoder, bigcode/starcoder2-15b, semcoder/semcoder_s

CACHE_DIR: path to store the downloaded pretrained huggingface model checkpoints.
DATASET: choose one from the following list [ Avatar, cruxeval, humaneval, classeval]

DSR Reasoning Task

cd scripts
bash run_DSR.sh {MODEL_ID}  {CACHAE_DIR} {DATASET}

## Below is the command to run DSR for CodeLlama-13b-instruct on humaneval
bash run_DSR.sh codellama/CodeLlama-13b-Instruct-hf ${path_to_store_checkpoints}  humaneval

DATASET: choose one from the following list [ Avatar, cruxeval, humaneval, classeval]

SR Reasoning Task

cd scripts
bash run_SR.sh {MODEL_ID} {DATASET} {CACHAE_DIR} {SR_TYPE}

## Below is the command to run SR for Deepseek-coder on classeval under 'no_test' setting
bash run_SR.sh deepseek-ai/deepseek-coder-6.7b-instruc classeval ${path_to_store_checkpoints} no_test

## Below is the command to run SR  for Deepseek-coder on humaneval under 'use test' setting
bash run_SR.sh deepseek-ai/deepseek-coder-6.7b-instruc humaneval ${path_to_store_checkpoints} use_test

SR_TYPE: can be 'no_test' or 'use_test'

DATASET: choose one from [humaneval, classeval]

Paper

Interested to read more about CodeMind, the code reasoning tasks, and a grounded-theory study evaluating LLMs for code reasoning across five benchmarks and two programming languages? Please read the pre-print on Arxiv: https://arxiv.org/pdf/2402.09664.pdf

citiation:

@article{liu2024codemind,
  title={CodeMind: A Framework to Challenge Large Language Models for Code Reasoning},
  author={Liu, Changshu and Zhang, Shizhuo Dylan and Ibrahimzada, Ali Reza and Jabbarvand, Reyhaneh},
  journal={arXiv preprint arXiv:2402.09664},
  year={2024}
}

Contributing to CodeMind

CodeMind is an open-source project to promote the proper evaluation of LLMs for code-related tasks. If you are interested in building on top of CodeMind and adding more code reasoning tasks, please send an email to {cl144,reyhaneh}@illinois.edu.

About

CodeMind is a generic framework for evaluating inductive code reasoning of LLMs. It is equipped with a static analysis component that enables in-depth analysis of the results.

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •  

Languages