Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add COMET to the evaluation #587

Merged
merged 10 commits into from
May 16, 2024
Merged

Add COMET to the evaluation #587

merged 10 commits into from
May 16, 2024

Conversation

gregtatum
Copy link
Member

@gregtatum gregtatum commented May 10, 2024

Add COMET to the evaluation task. I would like to follow-up with pre-downloading the model using some kind of fetch task rather than downloading it from the source every time. Downloading it every time is probably less reliable, but it at least unblocks things.

Resolves #503

@gregtatum gregtatum force-pushed the comet branch 5 times, most recently from 1eb872d to f883cf7 Compare May 13, 2024 17:47
@gregtatum gregtatum changed the title WIP - Add COMET to the evaluation Add COMET to the evaluation May 13, 2024
@gregtatum gregtatum marked this pull request as ready for review May 13, 2024 17:49
@gregtatum gregtatum requested a review from a team as a code owner May 13, 2024 17:49
Copy link
Collaborator

@bhearsum bhearsum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me with the requirements regenerated. Does it need @eu9ene's review as well?

colorama==0.4.6 \
--hash=sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44 \
--hash=sha256:4f1d9991f5acc0ca119f9d443620b77f9d6b33703e51011c16baf57afb285fc6
aiohttp==3.9.5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to argue that there's any real security risks to not pinning hashes, but could you please regenerate this with --generate-hashes anyways, to be safe? (I suppose the worst thing that could happen is pulling in cryptomining code or some such...)

Copy link
Member Author

@gregtatum gregtatum May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't an intentional change on my part. To guard against this in the future, I added a task to update the requirements with --generate-hashes

27d130a

@gregtatum gregtatum requested review from eu9ene and bhearsum and removed request for gabrielBusta May 13, 2024 22:09
echo "task update-requirements -- pipeline/eval/requirements/eval.in";
exit 1
fi
# Make sure the command is being run for docker
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 3.12 breaks on my machine for pip-compile on our repo, this is why I am enforcing docker.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that makes sense. (We do this in a bunch of other places as well for the same reason.)

@@ -207,3 +207,30 @@ tasks:
cmds:
- >-
PYTHONPATH=$(pwd) poetry run python -W ignore utils/run_model.py {{.CLI_ARGS}}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addition is in addition piece of code to address @bhearsum's feedback.

Copy link
Collaborator

@eu9ene eu9ene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I see we've been running evals on a GPU machine so we should double-check that COMET actually uses it. It outputs some warnings: https://firefox-ci-tc.services.mozilla.com/tasks/BoXF3JpuQeOfjfp485c-2Q/runs/0/logs/public/logs/live.log. The scoring with comet took 15 minutes and it's just 1000 examples I think.

if [[ -z "{{.CLI_ARGS}}" ]]; then
echo "Provide a path to the .in file";
echo "For example:"
echo "task update-requirements -- pipeline/eval/requirements/eval.in";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the path should not be hardcoded here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path is just an example path.

@@ -158,11 +164,11 @@ def main(args_list: Optional[list[str]] = None) -> None:
elif args.model_variant == "gpu":
if not args.workspace:
raise Exception("The workspace size was not provided")
if not args.gpus:
raise Exception("The number of GPUs was not provided")
# if not args.gpus:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncomment or remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's old code. I'll remove it.

comet_data.append({"src": source, "mt": target, "ref": target_ref})
comet_results = comet_model.predict(comet_data, gpus=args.gpus)
# Reduce the precision.
comet_score = float(round(comet_results.system_score * 1e4)) / 1e4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
comet_score = float(round(comet_results.system_score * 1e4)) / 1e4
comet_score = round(comet_results.system_score, 4)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's much nicer!

] # fmt:skip


@pytest.mark.parametrize("params", test_data, ids=[d[0] for d in test_data])
def test_evaluate(params) -> None:
(task_name, model_type, model_name) = params
(task_name, model_type, model_name, comet) = params
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we separate running comet to another test so that we at least can easily run the fast ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do maybe we should follow-up. I'm a bit annoyed how long this takes to run. I spent a bit of time trying to train my own comet model that was much smaller so that it could potentially run faster.

Splitting up the test would be hard since it's a run_task and not a unit test where it's easy to run just parts of it. I'd have to add a bunch of environment variables that artificially not do work, and then the assertions would get a bit messy to run.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course I figured it out after typing this. I split up the test.

comet_data = []
for source, target, target_ref in zip(source_lines, target_lines, target_ref_lines):
comet_data.append({"src": source, "mt": target, "ref": target_ref})
comet_results = comet_model.predict(comet_data, gpus=args.gpus)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also control batch_size here. The default one might be fine (16 I think), but ideally we should check how it utilizes GPU memory and bump it if it's underutilized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the parameter from it since I figured the defaults would be fine.

ideally we should check how it utilizes GPU memory

Ah, I haven't used the dashboards yet to do any tuning like this yet.

Copy link
Collaborator

@eu9ene eu9ene May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use an interactive task and run stuff manually to troubleshoot GPU issues. You would need to use tmux and nvidia-smi to monitor GPU usage.

@bhearsum
Copy link
Collaborator

re: the test failure, I'm very confused by it. It's looking for nvidia-cuda-nvrtc-cu12, which apparently comes in via torch (which comes in via unbabel-comet), but that package doesn't explicitly (at least obviously...) depend on it. pytorch/pytorch#100974 (comment) has some discussion of a similar issue that suggests that bringing torch in as a direct dependency may make pip-compile do the right thing.

@gregtatum gregtatum force-pushed the comet branch 2 times, most recently from 0d67e16 to 212f1a0 Compare May 15, 2024 21:38
pip install pip-tools
fi
# Finally generate the hashes.
- pip-compile --generate-hashes {{.CLI_ARGS}} --allow-unsafe
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--allow-unsafe fixed my issue for installing torch, and it's the recommended config.

  --allow-unsafe / --no-allow-unsafe
                                  Pin packages considered unsafe: distribute,
                                  pip, setuptools.

                                  WARNING: Future versions of pip-tools will
                                  enable this behavior by default. Use --no-
                                  allow-unsafe to keep the old behavior. It is
                                  recommended to pass the --allow-unsafe now
                                  to adapt to the upcoming change.

@gregtatum gregtatum requested a review from eu9ene May 15, 2024 21:41
@gregtatum gregtatum merged commit a6ead91 into mozilla:main May 16, 2024
36 checks passed
eu9ene pushed a commit that referenced this pull request May 21, 2024
eu9ene added a commit that referenced this pull request May 22, 2024
* Update bicleaner

* Lock requirements

* Use larger multilingual models

* Fix requirements

* Fix typo

* Add libs for hunspell

* Fix model downloading

* Fix model downloading

* Use a toolchain for cyhunspell

* Remove a hard github dependency

* Fix test

* Add COMET to the evaluation (#587)

* Custom cleaning (#547)

* Update default config

* Pre-download fast text model

* Add custom filters

* Add unit tests for config generation

* Make using custom filtering configs configurable

* Fix substitution

* Parse tasks with label finetune-student (#609)

* Add test case

* Update regex

* Add support for automatically continuing training from earlier runs of a Task (fixes #270) (#580)

* Add support for automatically continuing training from earlier runs of a Task.

* Automatically retry training tasks on exit code 17

This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.

* Support news-crawl importer (#608)

* Ensure all of the task labels can be parsed in the task graph (#612)

* Show the stack trace when there is an error in tracking

* Rename parse_tag to parse_task_label

* Document the regexes

* Turn the parsed label into a NamedTuple for better typing hints

* Get the tag parsing working with the full task graph

* Allow for 3 letter language codes

* Temporarily disable an evaluate task that is failing

* Update code docs a bit

* Fix tests for finetune-student

* Use shorter names for URLs (#611)

* Change the caching strategy for teachers (#606)

* Fix comment

---------

Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>
Co-authored-by: Valentin Rigal <rigal@teklia.com>
Co-authored-by: Ben Hearsum (he/him) <ben@mozilla.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add COMET to the evaluation steps
3 participants