Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small improvements & fixes to SWE-Bench #1874

Merged
merged 16 commits into from
May 20, 2024

Conversation

li-boxuan
Copy link
Collaborator

@li-boxuan li-boxuan commented May 18, 2024

I was able to run a few benchmark instances from SWE-Bench by myself following the documentation - it was great! In general the experience was smooth, thanks to @xingyaoww, @libowen2121 and the team! I made a few small enhancements and fixes to further improve the developer experience.

  1. Always use poetry run python (using python from poetry's virtual environment) over python or python3 in scripts to make sure the behavior is consistent.
  2. Make AGENT configurable. One can use an argument to control which agent they would like to benchmark. To facilitate this, I removed hardcoded CodeActAgent from run_infer.sh, and also added VERSION attribute to all agents, as the benchmark needs to record the agent version.
  3. Make EVAL_LIMIT configurable. One can use an argument to control how many instances they'd like to benchmark. Useful for debugging & development purposes.
  4. Fix 'eval_output_dir' not defined error in run_infer.py.
  5. Other enhancements to the README file and logs.

I also notice that a lot of code from run_infer.py could be shared by other benchmarks, but since we only have one benchmark now, I think we could avoid over-engineering. A refactor and code dedup would be useful in the future once we have more benchmarks, though.

@li-boxuan li-boxuan changed the title Small improvements to SWE-Bench doc and scripts Small improvements & fixes to SWE-Bench doc and scripts May 19, 2024
@li-boxuan li-boxuan changed the title Small improvements & fixes to SWE-Bench doc and scripts Small improvements & fixes to SWE-Bench May 19, 2024
@li-boxuan li-boxuan marked this pull request as ready for review May 19, 2024 20:52
@li-boxuan li-boxuan added evaluation agent framework strategies for prompting, agent, etc labels May 19, 2024
Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome to me!! Thanks for polishing it so well! :)

evaluation/swe_bench/README.md Show resolved Hide resolved
@li-boxuan li-boxuan enabled auto-merge (squash) May 20, 2024 07:56
@li-boxuan li-boxuan merged commit b845a38 into OpenDevin:main May 20, 2024
25 checks passed
li-boxuan added a commit to li-boxuan/OpenDevin that referenced this pull request May 21, 2024
I was able to run a few benchmark instances from SWE-Bench by myself following the documentation - it was great! In general the experience was smooth, thanks to @xingyaoww, @libowen2121 and the team! I made a few small enhancements and fixes to further improve the developer experience.

Always use poetry run python (using python from poetry's virtual environment) over python or python3 in scripts to make sure the behavior is consistent.
Make AGENT configurable. One can use an argument to control which agent they would like to benchmark. To facilitate this, I removed hardcoded CodeActAgent from run_infer.sh, and also added VERSION attribute to all agents, as the benchmark needs to record the agent version.
Make EVAL_LIMIT configurable. One can use an argument to control how many instances they'd like to benchmark. Useful for debugging & development purposes.
Fix 'eval_output_dir' not defined error in run_infer.py.
Other enhancements to the README file and logs.
I also notice that a lot of code from run_infer.py could be shared by other benchmarks, but since we only have one benchmark now, I think we could avoid over-engineering. A refactor and code dedup would be useful in the future once we have more benchmarks, though.
@libowen2121
Copy link
Contributor

Hey @li-boxuan! Thanks for the excellent work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent framework strategies for prompting, agent, etc evaluation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants