-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add COMET to the evaluation #587
Conversation
1eb872d
to
f883cf7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me with the requirements regenerated. Does it need @eu9ene's review as well?
pipeline/eval/requirements/eval.txt
Outdated
colorama==0.4.6 \ | ||
--hash=sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44 \ | ||
--hash=sha256:4f1d9991f5acc0ca119f9d443620b77f9d6b33703e51011c16baf57afb285fc6 | ||
aiohttp==3.9.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's difficult to argue that there's any real security risks to not pinning hashes, but could you please regenerate this with --generate-hashes
anyways, to be safe? (I suppose the worst thing that could happen is pulling in cryptomining code or some such...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't an intentional change on my part. To guard against this in the future, I added a task to update the requirements with --generate-hashes
echo "task update-requirements -- pipeline/eval/requirements/eval.in"; | ||
exit 1 | ||
fi | ||
# Make sure the command is being run for docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python 3.12 breaks on my machine for pip-compile on our repo, this is why I am enforcing docker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that makes sense. (We do this in a bunch of other places as well for the same reason.)
@@ -207,3 +207,30 @@ tasks: | |||
cmds: | |||
- >- | |||
PYTHONPATH=$(pwd) poetry run python -W ignore utils/run_model.py {{.CLI_ARGS}} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addition is in addition piece of code to address @bhearsum's feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I see we've been running evals on a GPU machine so we should double-check that COMET actually uses it. It outputs some warnings: https://firefox-ci-tc.services.mozilla.com/tasks/BoXF3JpuQeOfjfp485c-2Q/runs/0/logs/public/logs/live.log. The scoring with comet took 15 minutes and it's just 1000 examples I think.
if [[ -z "{{.CLI_ARGS}}" ]]; then | ||
echo "Provide a path to the .in file"; | ||
echo "For example:" | ||
echo "task update-requirements -- pipeline/eval/requirements/eval.in"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the path should not be hardcoded here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path is just an example path.
pipeline/eval/eval.py
Outdated
@@ -158,11 +164,11 @@ def main(args_list: Optional[list[str]] = None) -> None: | |||
elif args.model_variant == "gpu": | |||
if not args.workspace: | |||
raise Exception("The workspace size was not provided") | |||
if not args.gpus: | |||
raise Exception("The number of GPUs was not provided") | |||
# if not args.gpus: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncomment or remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's old code. I'll remove it.
pipeline/eval/eval.py
Outdated
comet_data.append({"src": source, "mt": target, "ref": target_ref}) | ||
comet_results = comet_model.predict(comet_data, gpus=args.gpus) | ||
# Reduce the precision. | ||
comet_score = float(round(comet_results.system_score * 1e4)) / 1e4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comet_score = float(round(comet_results.system_score * 1e4)) / 1e4 | |
comet_score = round(comet_results.system_score, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's much nicer!
] # fmt:skip | ||
|
||
|
||
@pytest.mark.parametrize("params", test_data, ids=[d[0] for d in test_data]) | ||
def test_evaluate(params) -> None: | ||
(task_name, model_type, model_name) = params | ||
(task_name, model_type, model_name, comet) = params |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we separate running comet to another test so that we at least can easily run the fast ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do maybe we should follow-up. I'm a bit annoyed how long this takes to run. I spent a bit of time trying to train my own comet model that was much smaller so that it could potentially run faster.
Splitting up the test would be hard since it's a run_task and not a unit test where it's easy to run just parts of it. I'd have to add a bunch of environment variables that artificially not do work, and then the assertions would get a bit messy to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course I figured it out after typing this. I split up the test.
pipeline/eval/eval.py
Outdated
comet_data = [] | ||
for source, target, target_ref in zip(source_lines, target_lines, target_ref_lines): | ||
comet_data.append({"src": source, "mt": target, "ref": target_ref}) | ||
comet_results = comet_model.predict(comet_data, gpus=args.gpus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also control batch_size
here. The default one might be fine (16 I think), but ideally we should check how it utilizes GPU memory and bump it if it's underutilized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the parameter from it since I figured the defaults would be fine.
ideally we should check how it utilizes GPU memory
Ah, I haven't used the dashboards yet to do any tuning like this yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use an interactive task and run stuff manually to troubleshoot GPU issues. You would need to use tmux and nvidia-smi to monitor GPU usage.
re: the test failure, I'm very confused by it. It's looking for |
0d67e16
to
212f1a0
Compare
pip install pip-tools | ||
fi | ||
# Finally generate the hashes. | ||
- pip-compile --generate-hashes {{.CLI_ARGS}} --allow-unsafe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--allow-unsafe fixed my issue for installing torch, and it's the recommended config.
--allow-unsafe / --no-allow-unsafe
Pin packages considered unsafe: distribute,
pip, setuptools.
WARNING: Future versions of pip-tools will
enable this behavior by default. Use --no-
allow-unsafe to keep the old behavior. It is
recommended to pass the --allow-unsafe now
to adapt to the upcoming change.
* Update bicleaner * Lock requirements * Use larger multilingual models * Fix requirements * Fix typo * Add libs for hunspell * Fix model downloading * Fix model downloading * Use a toolchain for cyhunspell * Remove a hard github dependency * Fix test * Add COMET to the evaluation (#587) * Custom cleaning (#547) * Update default config * Pre-download fast text model * Add custom filters * Add unit tests for config generation * Make using custom filtering configs configurable * Fix substitution * Parse tasks with label finetune-student (#609) * Add test case * Update regex * Add support for automatically continuing training from earlier runs of a Task (fixes #270) (#580) * Add support for automatically continuing training from earlier runs of a Task. * Automatically retry training tasks on exit code 17 This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume. * Support news-crawl importer (#608) * Ensure all of the task labels can be parsed in the task graph (#612) * Show the stack trace when there is an error in tracking * Rename parse_tag to parse_task_label * Document the regexes * Turn the parsed label into a NamedTuple for better typing hints * Get the tag parsing working with the full task graph * Allow for 3 letter language codes * Temporarily disable an evaluate task that is failing * Update code docs a bit * Fix tests for finetune-student * Use shorter names for URLs (#611) * Change the caching strategy for teachers (#606) * Fix comment --------- Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com> Co-authored-by: Valentin Rigal <rigal@teklia.com> Co-authored-by: Ben Hearsum (he/him) <ben@mozilla.com>
Add COMET to the evaluation task. I would like to follow-up with pre-downloading the model using some kind of fetch task rather than downloading it from the source every time. Downloading it every time is probably less reliable, but it at least unblocks things.
Resolves #503