{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":704162485,"defaultBranch":"main","name":"evals","ownerLogin":"johny-b","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2023-10-12T17:03:45.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/33967107?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1702027504.0","currentOid":""},"activityList":{"items":[{"before":"3761fa093a86a4c5ac918899a95dcb8c410d4c72","after":"0108dd7e76d5f8e07f333d24ad268530eba4b315","ref":"refs/heads/main","pushedAt":"2023-12-15T08:05:48.000Z","pushType":"push","commitsCount":7,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Ballots v2 (#1390)\n\nThis is an update to the Ballots eval which includes\r\n\r\n- A better, cleaned, dataset\r\n- Improved prompting\r\n- Clearer README\r\n\r\n---------\r\n\r\nCo-authored-by: ojaffe <oliver.jaffe@hotmail.co.uk>","shortMessageHtmlLink":"Ballots v2 (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1965817933\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1390\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1390/hovercard\" href=\"https://github.com/openai/evals/pull/1390\">openai#1390</a>)"}},{"before":"b6ac243ead431398a77fdea1726acd91de69e88b","after":"c092f26945a5744912efea8308e80952e46e73fa","ref":"refs/heads/bluff-ttt","pushedAt":"2023-12-08T09:26:02.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"run tests","shortMessageHtmlLink":"run tests"}},{"before":null,"after":"b6ac243ead431398a77fdea1726acd91de69e88b","ref":"refs/heads/bluff-ttt","pushedAt":"2023-12-08T09:25:04.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"test","shortMessageHtmlLink":"test"}},{"before":"23261b6d5c0a26f24c467f4544f87ca4a6991e17","after":"e22f69b4e2ff96e8dd0ef5d91a2ff4244ca348ea","ref":"refs/heads/bluff-1.0.0-fix","pushedAt":"2023-12-08T09:14:32.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Bluff now works with openai >= 1.0.0","shortMessageHtmlLink":"Bluff now works with openai &gt;= 1.0.0"}},{"before":"e3bb0c901e171ae00dc73c02669163e4566478be","after":"3761fa093a86a4c5ac918899a95dcb8c410d4c72","ref":"refs/heads/main","pushedAt":"2023-12-08T09:13:56.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"[ci] Fix referencing API key for unit tests (#1425)\n\nUnblocks #1423 by fixing the unit test","shortMessageHtmlLink":"[ci] Fix referencing API key for unit tests (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2031737288\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1425\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1425/hovercard\" href=\"https://github.com/openai/evals/pull/1425\">openai#1425</a>)"}},{"before":"ef98bcf44ad0560cda63ed6578a6ec3a610949b9","after":"23261b6d5c0a26f24c467f4544f87ca4a6991e17","ref":"refs/heads/bluff-1.0.0-fix","pushedAt":"2023-12-08T09:12:29.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Bluff now works with openai >= 1.0.0","shortMessageHtmlLink":"Bluff now works with openai &gt;= 1.0.0"}},{"before":"e90c60fccc48af12f5e75917b7c7e0bccf9fdf26","after":"ef98bcf44ad0560cda63ed6578a6ec3a610949b9","ref":"refs/heads/bluff-1.0.0-fix","pushedAt":"2023-12-08T09:09:50.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Secrets -> secrets in github workflow","shortMessageHtmlLink":"Secrets -&gt; secrets in github workflow"}},{"before":"23261b6d5c0a26f24c467f4544f87ca4a6991e17","after":"e90c60fccc48af12f5e75917b7c7e0bccf9fdf26","ref":"refs/heads/bluff-1.0.0-fix","pushedAt":"2023-12-07T19:41:16.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"rerun tests","shortMessageHtmlLink":"rerun tests"}},{"before":null,"after":"5d0f5ba4ea3d19f2484b6a92a2db5cfca42bf1c7","ref":"refs/heads/silence-httpx-logger","pushedAt":"2023-12-06T14:41:02.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"silence httpx logger","shortMessageHtmlLink":"silence httpx logger"}},{"before":null,"after":"23261b6d5c0a26f24c467f4544f87ca4a6991e17","ref":"refs/heads/bluff-1.0.0-fix","pushedAt":"2023-12-06T13:52:14.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Bluff now works with openai >= 1.0.0","shortMessageHtmlLink":"Bluff now works with openai &gt;= 1.0.0"}},{"before":"e96b4d35502125e354391044512d899268ade99d","after":"e3bb0c901e171ae00dc73c02669163e4566478be","ref":"refs/heads/main","pushedAt":"2023-12-06T13:24:29.000Z","pushType":"push","commitsCount":5,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Fix commandline --help exception (#1381)\n\ncurrently when running `oaieval --help`, it will throw an exception:\r\n```\r\nTraceback (most recent call last):\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/bin/oaieval\", line 8, in <module>\r\n    sys.exit(main())\r\n  File \"/Users/yanlin/workspace/github/evals/evals/cli/oaieval.py\", line 264, in main\r\n    args = cast(OaiEvalArguments, parser.parse_args(sys.argv[1:]))\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 1833, in parse_args\r\n    args, argv = self.parse_known_args(args, namespace)\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 1866, in parse_known_args\r\n    namespace, args = self._parse_known_args(args, namespace)\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 2079, in _parse_known_args\r\n    start_index = consume_optional(start_index)\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 2019, in consume_optional\r\n    take_action(action, args, option_string)\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 1943, in take_action\r\n    action(self, namespace, argument_values, option_string)\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 1106, in __call__\r\n    parser.print_help()\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 2567, in print_help\r\n    self._print_message(self.format_help(), file)\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 2551, in format_help\r\n    return formatter.format_help()\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 283, in format_help\r\n    help = self._root_section.format_help()\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 214, in format_help\r\n    item_help = join([func(*args) for func, args in self.items])\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 214, in <listcomp>\r\n    item_help = join([func(*args) for func, args in self.items])\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 214, in format_help\r\n    item_help = join([func(*args) for func, args in self.items])\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 214, in <listcomp>\r\n    item_help = join([func(*args) for func, args in self.items])\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 540, in _format_action\r\n    help_text = self._expand_help(action)\r\n  File \"/Users/yanlin/miniconda3/envs/modelenv/lib/python3.10/argparse.py\", line 637, in _expand_help\r\n    return self._get_help_string(action) % params\r\nTypeError: %o format: an integer is required, not dict\r\n```\r\n\r\nthe reason is just a '%' symbol in help string, use `%%` instead.","shortMessageHtmlLink":"Fix commandline --help exception (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1946642064\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1381\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1381/hovercard\" href=\"https://github.com/openai/evals/pull/1381\">openai#1381</a>)"}},{"before":"aefb1c3c06bae844ca8bea2201932e129ba72921","after":"4bca3dfb68206c0b3626ac47027a9d524dc5c1af","ref":"refs/heads/bluff","pushedAt":"2023-11-15T12:10:05.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Bluff eval","shortMessageHtmlLink":"Bluff eval"}},{"before":"c6d5cfebe0a527a884da1fa5ac42c7ecc8a930eb","after":"aefb1c3c06bae844ca8bea2201932e129ba72921","ref":"refs/heads/bluff","pushedAt":"2023-11-15T07:55:24.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Bluff eval","shortMessageHtmlLink":"Bluff eval"}},{"before":"42552a128107f44c365a347915f6757aa0b0c64a","after":"c6d5cfebe0a527a884da1fa5ac42c7ecc8a930eb","ref":"refs/heads/bluff","pushedAt":"2023-11-15T07:52:40.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Revert \"openai.error.InvalidRequestError -> openai.BadRequestError in bluff\"\n\nThis reverts commit f224fa31626a7cb93d38af4574aa30c40a7b8cd6.","shortMessageHtmlLink":"Revert \"openai.error.InvalidRequestError -&gt; openai.BadRequestError in…"}},{"before":"9efa222d1caeb5ba69d6e675d72cd645451cba4a","after":"42552a128107f44c365a347915f6757aa0b0c64a","ref":"refs/heads/bluff","pushedAt":"2023-11-15T07:47:22.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Revert \"openai.error.InvalidRequestError -> openai.BadRequestError in bluff\"\n\nThis reverts commit f224fa31626a7cb93d38af4574aa30c40a7b8cd6.","shortMessageHtmlLink":"Revert \"openai.error.InvalidRequestError -&gt; openai.BadRequestError in…"}},{"before":"a06a07bbfe21d6ff13d62f98ef464021225a4987","after":"e96b4d35502125e354391044512d899268ade99d","ref":"refs/heads/main","pushedAt":"2023-11-15T07:45:15.000Z","pushType":"push","commitsCount":6,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Fix the OpenAI Version to <=0.28.1  (#1410)\n\n# Thank you for contributing an eval! ♥️\r\n\r\n🚨 Please make sure your PR follows these guidelines, **failure to follow\r\nthe guidelines below will result in the PR being closed automatically**.\r\nNote that even if the criteria are met, that does not guarantee the PR\r\nwill be merged nor GPT-4 access be granted. 🚨\r\n\r\n**PLEASE READ THIS**:\r\n\r\nIn order for a PR to be merged, it must fail on GPT-4. We are aware that\r\nright now, users do not have access, so you will not be able to tell if\r\nthe eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep\r\nin mind as we run the eval, if GPT-4 gets higher than 90% on the eval,\r\nwe will likely reject it since GPT-4 is already capable of completing\r\nthe task.\r\n\r\nWe plan to roll out a way for users submitting evals to see the eval\r\nperformance on GPT-4 soon. Stay tuned! Until then, you will not be able\r\nto see the eval performance on GPT-4. **Starting April 10, the minimum\r\neval count is 15 samples, we hope this makes it easier to create and\r\ncontribute evals.**\r\n\r\nAlso, please note that we're using **Git LFS** for storing the JSON\r\nfiles, so please make sure that you move the JSON file to Git LFS before\r\nsubmitting a PR. Details on how to use Git LFS are available\r\n[here](https://git-lfs.com).\r\n\r\n## Eval details 📑\r\n\r\n### Eval name\r\n\r\n[Insert Eval name here]\r\n\r\n### Eval description\r\n\r\n[Insert a short description of what your eval does here]\r\n\r\n### What makes this a useful eval?\r\n\r\n[Insert why this eval is worth including and any additional context]\r\n\r\n## Criteria for a good eval ✅\r\n\r\nBelow are some of the criteria we look for in a good eval. In general,\r\nwe are seeking cases where the model does not do a good job despite\r\nbeing capable of generating a good response (note that there are some\r\nthings large language models cannot do, so those would not make good\r\nevals).\r\n\r\nYour eval should be:\r\n\r\n- [ ] Thematically consistent: The eval should be thematically\r\nconsistent. We'd like to see a number of prompts all demonstrating some\r\nparticular failure mode. For example, we can create an eval on cases\r\nwhere the model fails to reason about the physical world.\r\n- [ ] Contains failures where a human can do the task, but either GPT-4\r\nor GPT-3.5-Turbo could not.\r\n- [ ] Includes good signal around what is the right behavior. This means\r\neither a correct answer for `Basic` evals or the `Fact` Model-graded\r\neval, or an exhaustive rubric for evaluating answers for the `Criteria`\r\nModel-graded eval.\r\n- [ ] **Include at least 15 high-quality examples.**\r\n\r\nIf there is anything else that makes your eval worth including, please\r\ndocument it below.\r\n\r\n### Unique eval value\r\n\r\n> Insert what makes your eval high quality that was not mentioned above.\r\n(Not required)\r\n\r\n## Eval structure 🏗️\r\n\r\nYour eval should\r\n\r\n- [ ] Check that your data is in `evals/registry/data/{name}`\r\n- [ ] Check that your YAML is registered at\r\n`evals/registry/evals/{name}.yaml`\r\n- [ ] Ensure you have the right to use the data you submit via this eval\r\n\r\n(For now, we will only be approving evals that use one of the existing\r\neval classes. You may still write custom eval classes for your own\r\ncases, and we may consider merging them in the future.)\r\n\r\n## Final checklist 👀\r\n\r\n### Submission agreement\r\n\r\nBy contributing to Evals, you are agreeing to make your evaluation logic\r\nand data under the same MIT license as this repository. You must have\r\nadequate rights to upload any data used in an Eval. OpenAI reserves the\r\nright to use this data in future service improvements to our product.\r\nContributions to OpenAI Evals will be subject to our usual Usage\r\nPolicies (<https://platform.openai.com/docs/usage-policies>).\r\n\r\n- [ ] I agree that my submission will be made available under an MIT\r\nlicense and complies with OpenAI's usage policies.\r\n\r\n### Email address validation\r\n\r\nIf your submission is accepted, we will be granting GPT-4 access to a\r\nlimited number of contributors. Access will be given to the email\r\naddress associated with the commits on the merged pull request.\r\n\r\n- [ ] I acknowledge that GPT-4 access will only be granted, if\r\napplicable, to the email address used for my merged pull request.\r\n\r\n### Limited availability acknowledgment\r\n\r\nWe know that you might be excited to contribute to OpenAI's mission,\r\nhelp improve our models, and gain access to GPT-4. However, due to the\r\nrequirements mentioned above and the high volume of submissions, we will\r\nnot be able to accept all submissions and thus not grant everyone who\r\nopens a PR GPT-4 access. We know this is disappointing, but we hope to\r\nset the right expectation before you open this PR.\r\n\r\n- [ ] I understand that opening a PR, even if it meets the requirements\r\nabove, does not guarantee the PR will be merged nor GPT-4 access be\r\ngranted.\r\n\r\n### Submit eval\r\n\r\n- [ ] I have filled out all required fields of this form\r\n- [ ] I have used **Git LFS** for the Eval JSON data\r\n- [ ] (Ignore if not submitting code) I have run `pip install\r\npre-commit; pre-commit install` and have verified that `mypy`, `black`,\r\n`isort`, `autoflake` and `ruff` are running when I commit and push\r\n\r\nFailure to fill out all required fields will result in the PR being\r\nclosed.\r\n\r\n### Eval JSON data\r\n\r\nSince we are using Git LFS, we are asking eval submitters to add in as\r\nmany Eval Samples (at least 5) from their contribution here:\r\n\r\n<details>\r\n  <summary>View evals in JSON</summary>\r\n\r\n  ### Eval\r\n  ```jsonl\r\n  INSERT_EVAL_HERE\r\n  ```\r\n</details>","shortMessageHtmlLink":"Fix the OpenAI Version to &lt;=0.28.1 (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1993938985\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1410\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1410/hovercard\" href=\"https://github.com/openai/evals/pull/1410\">openai#1410</a>)"}},{"before":"f224fa31626a7cb93d38af4574aa30c40a7b8cd6","after":"9efa222d1caeb5ba69d6e675d72cd645451cba4a","ref":"refs/heads/bluff","pushedAt":"2023-11-15T02:56:04.000Z","pushType":"push","commitsCount":4,"pusher":{"login":"andrew-openai","name":"Andrew","path":"/andrew-openai","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/120423412?s=80&v=4"},"commit":{"message":"Merge branch 'main' into bluff","shortMessageHtmlLink":"Merge branch 'main' into bluff"}},{"before":null,"after":"354e55f3fcd44a07ff2f7e919a73ba0123aac73a","ref":"refs/heads/migrate","pushedAt":"2023-11-14T09:51:09.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Migrate from openai==0.28.1 to openai==1.2.4\n\nI've tested a few evals and they seem to work.\n\nThings I did:\n\n* run openai migrate\n* add \"import openai\" where it was necessary and removed by migration\n  script\n* change [\"id\"] to .id in registry.py\n* remove \"api_base\" and \"api_key\" arguments where they are no longer\n  accepted\n* add model_dump() in both get_completion functions\n\nNOTE: THIS IS NOT READY YET. E.g. I don't know why we had \"api_base\" and\n\"api_key\" where I removed - perhaps they should be now passed somewhere\nelse?","shortMessageHtmlLink":"Migrate from openai==0.28.1 to openai==1.2.4"}},{"before":"b52b7a8850761b4f222b5eacb81f71b47f86f9bc","after":"f224fa31626a7cb93d38af4574aa30c40a7b8cd6","ref":"refs/heads/bluff","pushedAt":"2023-11-14T08:54:34.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"openai.error.InvalidRequestError -> openai.BadRequestError in bluff\n\nNOTE: I didn't check if this works as intended - still working on making\noaieval work with the new openai version.","shortMessageHtmlLink":"openai.error.InvalidRequestError -&gt; openai.BadRequestError in bluff"}},{"before":"198ccaad148f339746c8f69c8e05cbe48b586349","after":"b52b7a8850761b4f222b5eacb81f71b47f86f9bc","ref":"refs/heads/bluff","pushedAt":"2023-11-14T08:22:06.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Merge remote-tracking branch 'origin/main' into bluff","shortMessageHtmlLink":"Merge remote-tracking branch 'origin/main' into bluff"}},{"before":"a53f52b3c104542ec1953d26a4bd551685da8bb4","after":"a06a07bbfe21d6ff13d62f98ef464021225a4987","ref":"refs/heads/main","pushedAt":"2023-11-14T08:20:52.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"[Evals] Update the errors we except for retries (#1406)\n\nResolve https://github.com/openai/evals/issues/1399","shortMessageHtmlLink":"[Evals] Update the errors we except for retries (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1991082210\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1406\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1406/hovercard\" href=\"https://github.com/openai/evals/pull/1406\">openai#1406</a>)"}},{"before":"462a5e1139be8c863a4b49965a59672b2e3f4199","after":"198ccaad148f339746c8f69c8e05cbe48b586349","ref":"refs/heads/bluff","pushedAt":"2023-11-10T12:43:10.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Bluff eval","shortMessageHtmlLink":"Bluff eval"}},{"before":"21e877d7ef24e1d66cab909912dfcca6e2816649","after":"462a5e1139be8c863a4b49965a59672b2e3f4199","ref":"refs/heads/bluff","pushedAt":"2023-11-10T12:40:42.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Bluff eval","shortMessageHtmlLink":"Bluff eval"}},{"before":null,"after":"21e877d7ef24e1d66cab909912dfcca6e2816649","ref":"refs/heads/bluff","pushedAt":"2023-11-10T12:28:26.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Bluff eval","shortMessageHtmlLink":"Bluff eval"}},{"before":"dd96814dd96bd64f3098afca8dc873aa8d8ce4c8","after":"a53f52b3c104542ec1953d26a4bd551685da8bb4","ref":"refs/heads/main","pushedAt":"2023-11-10T12:01:27.000Z","pushType":"push","commitsCount":5,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Add new Solvers framework (#1397)\n\n# Solvers\r\n\r\nIn this PR, we introduce a new abstraction called \"Solvers\" as an\r\nintermediary interface between an Eval and a CompletionFn.\r\n\r\n## Motivation\r\nThis addresses some difficulties we previously had:\r\n- We want to be able to easily run and compare different kinds of model\r\nscaffolding approaches against a given Eval.\r\n- The current interface for CompletionFns requires users to pass a\r\n**prompt** to the CompletionFn, which encourages the eval designer to\r\nwrite a prompt that often privileges a particular kind of model over\r\nothers and often locks-in the scaffolding approach. e.g. If developing\r\nwith ChatCompletion models, the resulting prompt will usually work best\r\nfor ChatCompletion models.\r\n- It’s technically possible for eval designers to write solver-agnostic\r\nprompts, but the string format is hard to parse and reshape into new\r\nprompts. To enable flexibility, you want to provide instructions,\r\ninputs, previous interactions, and other task data separately rather\r\nthan just a single string.\r\n\r\n## Solution\r\n- In our proposed approach, we clearly separate the responsibilities of\r\ndefining the rules, inputs, and metrics for a task (the \"Eval\") from the\r\nresponsibility of solving the task (the \"Solver\").\r\n- An Eval's responsibility is to construct a structured TaskState object\r\ncontaining all the necessary information for the eval, but the Eval\r\nitself is unopinionated about how that information should be used. In\r\nother words, the Eval should be agnostic to the Solver that attempts it.\r\n- A Solver receives the TaskState object and decides how to use that\r\ninformation -- e.g. concatenating it into a prompt and passing that\r\nprompt into a CompletionFn. Note that a Solver can generate its response\r\nin any way, and may call any number of CompletionFn's, wait for human\r\ninput, or generate a response from a programmatic bot without any models\r\ninvolved.\r\n- When the Solver is done, it returns a SolverResult to be judged by the\r\nEval.\r\n\r\n## What's new\r\n- We introduce a `Solver` class that inherits from `CompletionFn`. This\r\nlooks largely the same as a CompletionFn except that its input is a\r\nstructured TaskState object instead of a plain string prompt.\r\n- Along with the Solver base class, we also introduce a variety of\r\nSolvers that are useful for various models including a HumanCLISolver,\r\nOpenAIChatCompletionSolver, OpenAICompletionSolver, and more!\r\n- We introduce a `SolverEval` class that inherits from `Eval`, which\r\nshould be used by any eval that wants to use solvers. Key features:\r\n- Allows us to be explicit about what kind of eval we're building, and\r\nenforces checks on the input completion_fn to see if it is a compatible\r\nSolver.\r\n- Creates a new copy of the solver for each run of `eval_sample`, to\r\nallow for stateful solvers (e.g. agents with memory) without interfering\r\nwith other sample runs.\r\n- Add new generic `MatchWithSolvers` class which is similar to a `Match`\r\nEval class but uses SolverEval instead.\r\n\r\n## Usage and Compatibility\r\nAs before, once a new SolverEval and Solver have been registered to\r\n`evals/registry/evals` and `evals/registry/completion_fns` respectively,\r\none can run an eval with:\r\n```bash\r\noaieval <solver> <solvereval>\r\n```\r\nwhere `<solver>` is a Solver and `<solvereval>` is a SolverEval.\r\n\r\nIn general, Solvers are not compatible with plain Evals, and SolverEvals\r\nare not compatible with plain CompletionFns (since the passing of the\r\nTaskState object is a breaking change on the interface). That said, we\r\nprovide wrappers for the common `OpenAICompletionFn` and\r\n`OpenAIChatCompletionFn` so that users can use these simple model-based\r\ncompletion_fns with SolverEvals out-of-the-box:\r\n```bash\r\noaieval gpt-4 <solvereval>\r\n```","shortMessageHtmlLink":"Add new Solvers framework (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"1977828035\" data-permission-text=\"Title is private\" data-url=\"https://github.com/openai/evals/issues/1397\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/openai/evals/pull/1397/hovercard\" href=\"https://github.com/openai/evals/pull/1397\">openai#1397</a>)"}},{"before":null,"after":"0e8dd2c0f8855cac254e0cbd41a90a9e06ea9282","ref":"refs/heads/deepcopy_in_recorder","pushedAt":"2023-10-12T17:10:28.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"johny-b","name":null,"path":"/johny-b","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/33967107?s=80&v=4"},"commit":{"message":"Deepcopy in recorder\n\nRecorder doesn't write the record file immediately when record_sampling\nis called, but does this once there are enough entries to be written.\nTo be able to write the logs later, recorder saves the prompt object\npassed to record_sampling.\nIf someone alters the prompt object after a call to record_sampling,\nbut before data is dumped to a file, recorder writes incorrect data.\n\nThis is fixed with deepcopy. Note: common copy() would help for\ncases when we add/remove stuff from the prompt, but not if we modify\nelements of it.\n\nExample. There should be no difference between this:\n\n```\nmessages = [{\"role\": \"system\", \"content\": \"hello there!\"}]\nresponse = self.completion_fns[0](prompt=messages).get_completions()[0]\nmessages = messages + [{\"role\": \"assistant\", \"content\": response}]\nmessages += [{\"role\": \"system\", \"content\": \"tell me a joke\"}]\nresponse = self.completion_fns[0](prompt=messages).get_completions()[0]\n```\n\nand\n\n```\nmessages = [{\"role\": \"system\", \"content\": \"hello there!\"}]\nresponse = self.completion_fns[0](prompt=messages).get_completions()[0]\nmessages.append({\"role\": \"assistant\", \"content\": response})\nmessages += [{\"role\": \"system\", \"content\": \"tell me a joke\"}]\nresponse = self.completion_fns[0](prompt=messages).get_completions()[0]\n```\n\nbut now the second code snipped creates weird logs.","shortMessageHtmlLink":"Deepcopy in recorder"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAADywZwrQA","startCursor":null,"endCursor":null}},"title":"Activity · johny-b/evals"}