Add COMET to the evaluation #587

gregtatum · 2024-05-10T17:31:33Z

Add COMET to the evaluation task. I would like to follow-up with pre-downloading the model using some kind of fetch task rather than downloading it from the source every time. Downloading it every time is probably less reliable, but it at least unblocks things.

Resolves #503

bhearsum

This looks fine to me with the requirements regenerated. Does it need @eu9ene's review as well?

bhearsum · 2024-05-13T19:02:39Z

pipeline/eval/requirements/eval.txt

-colorama==0.4.6 \
-    --hash=sha256:08695f5cb7ed6e0531a20572697297273c47b8cae5a63ffc6d6ed5c201be6e44 \
-    --hash=sha256:4f1d9991f5acc0ca119f9d443620b77f9d6b33703e51011c16baf57afb285fc6
+aiohttp==3.9.5


It's difficult to argue that there's any real security risks to not pinning hashes, but could you please regenerate this with --generate-hashes anyways, to be safe? (I suppose the worst thing that could happen is pulling in cryptomining code or some such...)

This wasn't an intentional change on my part. To guard against this in the future, I added a task to update the requirements with --generate-hashes

27d130a

gregtatum · 2024-05-13T22:11:09Z

Taskfile.yml

+          echo "task update-requirements -- pipeline/eval/requirements/eval.in";
+          exit 1
+        fi
+      # Make sure the command is being run for docker


Python 3.12 breaks on my machine for pip-compile on our repo, this is why I am enforcing docker.

Yup, that makes sense. (We do this in a bunch of other places as well for the same reason.)

gregtatum · 2024-05-13T22:11:43Z

Taskfile.yml

@@ -207,3 +207,30 @@ tasks:
    cmds:
      - >-
        PYTHONPATH=$(pwd) poetry run python -W ignore utils/run_model.py {{.CLI_ARGS}}
+


This addition is in addition piece of code to address @bhearsum's feedback.

eu9ene

Nice! I see we've been running evals on a GPU machine so we should double-check that COMET actually uses it. It outputs some warnings: https://firefox-ci-tc.services.mozilla.com/tasks/BoXF3JpuQeOfjfp485c-2Q/runs/0/logs/public/logs/live.log. The scoring with comet took 15 minutes and it's just 1000 examples I think.

eu9ene · 2024-05-13T23:09:13Z

Taskfile.yml

+        if [[ -z "{{.CLI_ARGS}}" ]]; then
+          echo "Provide a path to the .in file";
+          echo "For example:"
+          echo "task update-requirements -- pipeline/eval/requirements/eval.in";


I assume the path should not be hardcoded here

The path is just an example path.

eu9ene · 2024-05-13T23:10:27Z

pipeline/eval/eval.py

@@ -158,11 +164,11 @@ def main(args_list: Optional[list[str]] = None) -> None:
    elif args.model_variant == "gpu":
        if not args.workspace:
            raise Exception("The workspace size was not provided")
-        if not args.gpus:
-            raise Exception("The number of GPUs was not provided")
+        # if not args.gpus:


Uncomment or remove?

Thanks, that's old code. I'll remove it.

eu9ene · 2024-05-13T23:13:51Z

pipeline/eval/eval.py

+        comet_data.append({"src": source, "mt": target, "ref": target_ref})
+    comet_results = comet_model.predict(comet_data, gpus=args.gpus)
+    # Reduce the precision.
+    comet_score = float(round(comet_results.system_score * 1e4)) / 1e4


Suggested change

comet_score = float(round(comet_results.system_score * 1e4)) / 1e4

comet_score = round(comet_results.system_score, 4)

That's much nicer!

eu9ene · 2024-05-13T23:14:51Z

tests/test_eval.py

 ]  # fmt:skip


 @pytest.mark.parametrize("params", test_data, ids=[d[0] for d in test_data])
 def test_evaluate(params) -> None:
-    (task_name, model_type, model_name) = params
+    (task_name, model_type, model_name, comet) = params


should we separate running comet to another test so that we at least can easily run the fast ones?

If we do maybe we should follow-up. I'm a bit annoyed how long this takes to run. I spent a bit of time trying to train my own comet model that was much smaller so that it could potentially run faster.

Splitting up the test would be hard since it's a run_task and not a unit test where it's easy to run just parts of it. I'd have to add a bunch of environment variables that artificially not do work, and then the assertions would get a bit messy to run.

Of course I figured it out after typing this. I split up the test.

eu9ene · 2024-05-13T23:25:02Z

pipeline/eval/eval.py

+    comet_data = []
+    for source, target, target_ref in zip(source_lines, target_lines, target_ref_lines):
+        comet_data.append({"src": source, "mt": target, "ref": target_ref})
+    comet_results = comet_model.predict(comet_data, gpus=args.gpus)


You can also control batch_size here. The default one might be fine (16 I think), but ideally we should check how it utilizes GPU memory and bump it if it's underutilized.

I removed the parameter from it since I figured the defaults would be fine.

ideally we should check how it utilizes GPU memory

Ah, I haven't used the dashboards yet to do any tuning like this yet.

You can use an interactive task and run stuff manually to troubleshoot GPU issues. You would need to use tmux and nvidia-smi to monitor GPU usage.

bhearsum · 2024-05-14T20:12:19Z

re: the test failure, I'm very confused by it. It's looking for nvidia-cuda-nvrtc-cu12, which apparently comes in via torch (which comes in via unbabel-comet), but that package doesn't explicitly (at least obviously...) depend on it. pytorch/pytorch#100974 (comment) has some discussion of a similar issue that suggests that bringing torch in as a direct dependency may make pip-compile do the right thing.

gregtatum · 2024-05-15T21:41:40Z

Taskfile.yml

+          pip install pip-tools
+        fi
+      # Finally generate the hashes.
+      - pip-compile --generate-hashes {{.CLI_ARGS}} --allow-unsafe


--allow-unsafe fixed my issue for installing torch, and it's the recommended config.

--allow-unsafe / --no-allow-unsafe Pin packages considered unsafe: distribute, pip, setuptools. WARNING: Future versions of pip-tools will enable this behavior by default. Use --no- allow-unsafe to keep the old behavior. It is recommended to pass the --allow-unsafe now to adapt to the upcoming change.

* Update bicleaner * Lock requirements * Use larger multilingual models * Fix requirements * Fix typo * Add libs for hunspell * Fix model downloading * Fix model downloading * Use a toolchain for cyhunspell * Remove a hard github dependency * Fix test * Add COMET to the evaluation (#587) * Custom cleaning (#547) * Update default config * Pre-download fast text model * Add custom filters * Add unit tests for config generation * Make using custom filtering configs configurable * Fix substitution * Parse tasks with label finetune-student (#609) * Add test case * Update regex * Add support for automatically continuing training from earlier runs of a Task (fixes #270) (#580) * Add support for automatically continuing training from earlier runs of a Task. * Automatically retry training tasks on exit code 17 This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume. * Support news-crawl importer (#608) * Ensure all of the task labels can be parsed in the task graph (#612) * Show the stack trace when there is an error in tracking * Rename parse_tag to parse_task_label * Document the regexes * Turn the parsed label into a NamedTuple for better typing hints * Get the tag parsing working with the full task graph * Allow for 3 letter language codes * Temporarily disable an evaluate task that is failing * Update code docs a bit * Fix tests for finetune-student * Use shorter names for URLs (#611) * Change the caching strategy for teachers (#606) * Fix comment --------- Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com> Co-authored-by: Valentin Rigal <rigal@teklia.com> Co-authored-by: Ben Hearsum (he/him) <ben@mozilla.com>

gregtatum force-pushed the comet branch 5 times, most recently from 1eb872d to f883cf7 Compare May 13, 2024 17:47

gregtatum changed the title ~~WIP - Add COMET to the evaluation~~ Add COMET to the evaluation May 13, 2024

gregtatum marked this pull request as ready for review May 13, 2024 17:49

gregtatum requested a review from a team as a code owner May 13, 2024 17:49

gregtatum requested a review from gabrielBusta May 13, 2024 17:49

bhearsum requested changes May 13, 2024

View reviewed changes

gregtatum requested review from eu9ene and bhearsum and removed request for gabrielBusta May 13, 2024 22:09

gregtatum commented May 13, 2024

View reviewed changes

eu9ene reviewed May 13, 2024

View reviewed changes

gregtatum force-pushed the comet branch 2 times, most recently from 0d67e16 to 212f1a0 Compare May 15, 2024 21:38

gregtatum commented May 15, 2024

View reviewed changes

gregtatum requested a review from eu9ene May 15, 2024 21:41

eu9ene approved these changes May 15, 2024

View reviewed changes

bhearsum approved these changes May 15, 2024

View reviewed changes

gregtatum added 7 commits May 16, 2024 11:38

Switch print to the logger in eval

ba8c57f

Add COMET support

af563cc

Expand out the environment variables in run_task

6f0be47

Add the "slow" marker

7027b4c

Add a task for updating the requirements file

eb75774

Update the eval.txt file

b68c589

Fix pip compile error where dependencies cannot be locked

ef3e88a

gregtatum added 3 commits May 16, 2024 11:38

Re-generate eval.txt

6d69bd2

Fix GPUs argument

e15ee95

Only run COMET once in tests

28a80fd

gregtatum force-pushed the comet branch from 212f1a0 to 28a80fd Compare May 16, 2024 16:44

gregtatum merged commit a6ead91 into mozilla:main May 16, 2024
36 checks passed

eu9ene pushed a commit that referenced this pull request May 21, 2024

Add COMET to the evaluation (#587)

bc1b554

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add COMET to the evaluation #587

Add COMET to the evaluation #587

gregtatum commented May 10, 2024 •

edited

bhearsum left a comment

bhearsum May 13, 2024

gregtatum May 13, 2024 •

edited

gregtatum May 13, 2024

bhearsum May 13, 2024

gregtatum May 13, 2024

eu9ene left a comment

eu9ene May 13, 2024

gregtatum May 14, 2024

eu9ene May 13, 2024

gregtatum May 14, 2024

eu9ene May 13, 2024

gregtatum May 14, 2024

eu9ene May 13, 2024

gregtatum May 15, 2024

gregtatum May 15, 2024

eu9ene May 13, 2024

gregtatum May 14, 2024

eu9ene May 14, 2024 •

edited

bhearsum commented May 14, 2024

gregtatum May 15, 2024

	comet_score = float(round(comet_results.system_score * 1e4)) / 1e4
	comet_score = round(comet_results.system_score, 4)

Add COMET to the evaluation #587

Add COMET to the evaluation #587

Conversation

gregtatum commented May 10, 2024 • edited

bhearsum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregtatum May 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eu9ene left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eu9ene May 14, 2024 • edited

Choose a reason for hiding this comment

bhearsum commented May 14, 2024

Choose a reason for hiding this comment

gregtatum commented May 10, 2024 •

edited

gregtatum May 13, 2024 •

edited

eu9ene May 14, 2024 •

edited