OpenAI's human-eval sampling benchmark

This repo utilizes the OpenAI human-eval dataset to determine optimal values for the temperature and top_p parameters when sampling solutions from the gpt-3.5-turbo model for an instruction guided code generation task.

Through multi-threading at the level of parameter combinations, human-eval problem solution generation, and solution eval evaluation this benchmark runs in under 30 seconds.

Example run

We include an example run for different values of temperature and top_p ranging from 0.0 to 1.0 with a step size of 0.2. Check the CONFIG section in 1_run_eval.py to run the test for different ranges. Note DEBUG is enabled by default which reduces the combinations and problems that are evaluated (to avoid accidentally consuming a lot of OpenAI credits).

Instructions

Make sure you have make and Python 3 installed. It's recommended to create and activate a virtual environment before running the make commands.

Be sure to have sourced an OPENAI_API_KEY to allow for LLM generation.

python -m venv .venv
source .venv/bin/activate

make install

make run-bench

make plot

Or simply:

make

Caveat

Note that the OpenAI documentation generally advises to vary either temperature or top_p and not both. However, it's not entirely clear why one should only vary either. More details in this thread.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
images		images
.gitignore		.gitignore
1_run_eval.py		1_run_eval.py
2_plot_performance.py		2_plot_performance.py
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

images

images

.gitignore

.gitignore

1_run_eval.py

1_run_eval.py

2_plot_performance.py

2_plot_performance.py

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

OpenAI's human-eval sampling benchmark

Example run

Instructions

Caveat

About

Releases

Packages

Contributors 3

Languages

License

definitive-io/human-eval-sampling-benchmark

Folders and files

Latest commit

History

Repository files navigation

OpenAI's human-eval sampling benchmark

Example run

Instructions

Caveat

About

Topics

Resources

License

Stars

Watchers

Forks

Languages