Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting ORC-WER breakdown #6

Open
desh2608 opened this issue Feb 28, 2023 · 4 comments
Open

Getting ORC-WER breakdown #6

desh2608 opened this issue Feb 28, 2023 · 4 comments

Comments

@desh2608
Copy link

Is there a simple way to get the WER break-down into ins/del/sub when computing the ORC-WER?

@boeddeker
Copy link
Member

At the moment, it is not implemented. But we planned to integrate it. Since you are interested, I will take care to implement it.

Do you have a suggestion, how it should be obtained? ins/del/sub aren't unique values.
Do people expect to get the same as Kaldi?
I thought about adding my Kaldi wrapper, so that the optimization is done in python/cython, but the final WER is obtained with Kaldi (Editdistance is the same, but Kaldi additionally reports the ins/del/sub).

@desh2608
Copy link
Author

One way would be to get the reference assignment and then use the kaldialign package to compute the ins/del/sub. This is the WER package used in icefall, for example, and is kaldi-compatible.

@boeddeker
Copy link
Member

Thanks for the pointer, I forgot that kaldialign exists. The code will be much cleaner with kaldialign, then a wrapper around kaldi could be, since adding a pip dependency is simple.

@boeddeker
Copy link
Member

Now the code uses kaldialign to calculate insertions, deletions and substitutions. Hence, the generated files have now this information:

.../meeteval/example_files$ python -m meeteval.wer orcwer -h 'hyp*.stm' -r 'ref*.stm'
Wrote: .../hyp_orcwer_per_reco.json
Wrote: .../hyp_orcwer.json
.../meeteval/example_files$ cat hyp_orcwer.json
{
  "errors": 18,
  "length": 184,
  "insertions": 0,
  "deletions": 14,
  "substitutions": 4,
  "error_rate": 0.09782608695652174
}
.../meeteval/example_files$ cat hyp_orcwer_per_reco.json
{
  "recordingA": {
    "errors": 4,
    "length": 124,
    "insertions": 0,
    "deletions": 0,
    "substitutions": 4,
    "error_rate": 0.03225806451612903,
    "assignment": [
      "Alice",
      "Alice",
      "Bob",
      "Bob",
      "Alice",
      "Alice",
      "Alice",
      "Alice"
    ]
  },
  "recordingB": {
    "errors": 14,
    "length": 60,
    "insertions": 0,
    "deletions": 14,
    "substitutions": 0,
    "error_rate": 0.23333333333333334,
    "assignment": [
      "Bob",
      "Bob",
      "Alice",
      "Alice"
    ]
  }

Let me know, when you need more features or if a modification could simplify an integration in another framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants