Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computing time-constrained WER #17

Open
desh2608 opened this issue May 25, 2023 · 7 comments
Open

Computing time-constrained WER #17

desh2608 opened this issue May 25, 2023 · 7 comments

Comments

@desh2608
Copy link

I am thinking of a metric for long-form ASR and segmentation. Consider the following scenario:

  • The input is a long recording (either single speaker or multi speaker).
  • References may be with word-level timestamps (CTM file) or segment-level (STM).
  • Hypothesis may be word-level or segment-level (CTM or STM).

If reference is STM and hypothesis is CTM, this may correspond to computing the asclite aWER metric, but we also want to support (i) other kinds of systems that may not provide word-level timestamps, and (ii) tighter penalty on segmentation by providing reference CTM.

Additionally, we also want to be able to include multiple possible references (e.g., references may be orthographic or normalized in some way), although I understand that this may be beyond the scope of this toolkit.

I am looking for suggestions about what would be a good metric (if one exists) for this scenario.

(cc @MartinKocour since we were having related discussions.)

@thequilo
Copy link
Member

There is not straightforward answer to you problem, but the following might help to start a discussion.

We classify the WER algorithms with these three properties:

  • With or without considering diarization labels 1
  • Assignment on word or segment/utterance level (from MIMO/ORC WER, if an utterance has to be consistent in the output or not)
  • Considering the time stamps or not
    • Word Level or Segment Level
    • Collar (like in Diarization Error Rate, i.e. the estimated position of a word might be wrong up to this value)

Could you elaborate further your requirements regarding the properties defined above? Especially wether you have/want to use diarization labels.

If reference is STM and hypothesis is CTM, this may correspond to computing the asclite aWER metric,

Can you clarify what exactly you mean with asclite aWER? Is it the WER that is used in the libri-CSS publication?

From our understanding, tha asclite WER from the libri-CSS publication does the following:

  • Without diarization
  • Assignment on word level
  • Considering time stamps
    • Reference: utterance-level timing
    • Hypothesis: word-level timing
      • Our observation: Hyp has no overlap (asclite crashes when speaker overlaps with itself and -spkrautooverlap option is not set)
      • Our guess, what they do: Sort by word start, and reduce word length to eliminate overlap words. (Might be different)
    • time-pruning with 400ms (asclite options: -time-prune and -word-time-align; should be similar to a collar, but unclear how it's exaclty defined)

Currently we are working on a WER that considers diarization and time stamps (Word or segment level), you can find it as tcpWER in this package, but we haven't decided yet, which hyperparameters we want to suggest. We plan to publish it for the CHiME workshop.

Additionally, we also want to be able to include multiple possible references (e.g., references may be orthographic or normalized in some way), although I understand that this may be beyond the scope of this toolkit.

This is not "beyond the scope of this toolkit", but it is beyond our know-how. We think a normalization is kind of orthogonal to the actual WER calculation and for that it might be better to use externel tools from people that have more experience in this topic (e.g. language model people). One idea would be to use kaldi, but we haven't thought about this until know. We are open for suggestions.

We have some more plans, but they are in an early stage and we don't want to talk about those yet in public.
If you want, we could schedule a meeting or write in a slack channel to find your desired WER.

Footnotes

  1. With diarization we mean, that the segments of the same speaker gets the same label assigned from the system. The WER should then find the best assignment, like it is done in cpWER. Without diarization, the estimated label is ignored and and the assignment is determined independently between segments/words.

@desh2608
Copy link
Author

desh2608 commented May 26, 2023

Some clarifications:

By asclite WER, I meant exactly what you described. One problem with this metric seems to be that references are "loose" (i.e. STM files).

By "normalization" I meant providing multiple possible references (similar to what is done to compute Bleu scores in MT).

@boeddeker
Copy link
Member

By "normalization" I meant providing multiple possible references (similar to what is done to compute Bleu scores in MT).

Could you give an example, where multiple possible references are useful?
If I remember correctly, the asclite tool mentioned, that they support this (don't know how), but all examples that I can imagine could be achieved via normalization. But I lack experience in this field.

In MT this is different, because translations have more degree of freedoms.

There are a few issues, when we would allow a "graph" instead of a sequence of words for the reference:

  • The computational complexity would grow
    • With ORC and MIMO-WER we have already some limitations, that they cannot be applied to all systems, e.g. a system on LibriCSS that yields 8 "segmented speaker steams"/"output channels"
  • The necessary speedup algorithms are based on the levenshtein distance, I don't know it they would be applicable to "graphs". Without the speedup algorithms, the complexity explodes too quickly.

@desh2608
Copy link
Author

desh2608 commented May 26, 2023

In ASR, SCLITE and ASCLITE handle this through "GLM files". Basically, you provide rules for alternative references of words or phrases such as I'm --> I am. Within the scoring tool, the reference is created as an acyclic directed graph (ADG) with multiple paths for the alternate references. The multi-dimensional Levenstein distance is then computed over ADGs of reference and hypothesis, instead of linear chains. I guess this is feasible in their case since we have time-marked segments. Without them, as you mention, the complexity would be very large.

In any case, this is a "desirable" but not "necessary" property to have. My main purpose in creating this issue was basically to get your insights on what kind of metrics would work for the task of long-form ASR and segmentation.

Edit: If you are planning to attend ICASSP, we can have more discussions then :)

@boeddeker
Copy link
Member

Thanks for the explanation. Yes, with the timing information, the complexity can be significantly reduced.
We will keep this in mind, but I don't know if we will find the time to figure out if a solution with reasonable complexity exists and then have the time to implement it. Partially, it can be solved by preprocessing.

My main purpose in creating this issue was basically to get your insights on what kind of metrics would work for the task of long-form ASR and segmentation.

There are different long-form ASR and segmentation systems and they are differently evaluated.

Let's say you build a "CSS pipeline" [1]. Some people stop before the Diarization and want to evaluate "Separation + ASR".
In this case, you don't know the speaker labels for the segments.
In such a situation, you could use asclite or ORC-WER [2]. Where asclite considers the temporal information, while ORC-WER doesn't.

When you build a system that yields a "Speaker-attributed Transcription" with temporal information, the asclite tool ignores the "Speaker-attributed" part of your estimation. For this situation, we implemented a time-Constrained levenshtein distance and replaced the classical levenshtein distance in cpWER: Time-Constrained minimum Permutation Word Error Rate (tcpWER)

We provide several options to address for different accuracies between reference and hyposisis. With "ctm" estimates, equidistant_intervals and no collar you can only get a "correct" or substitution error, when words overlap.
This might be what you want.

Edit: If you are planning to attend ICASSP, we can have more discussions then :)

I am not there, but Thilo will attend the conference.

[1] https://arxiv.org/pdf/2011.02014.pdf
[2] Actually, MIMO-WER and nor ORC-WER is want you want to calculate, but ORC-WER is faster.

@desh2608
Copy link
Author

Thanks. For the models we are using now, we don't have speaker attribution. I am actually using the asclite WER at the moment, so it seems we are on the same page about that.

@thequilo
Copy link
Member

I'll add a few more comments:

  • The matching on directed acyclic graphs (I believe that it's called DAG in the community, isn't it?) can be integrated into the theory behind MIMO WER, similar to what was done for asclite. You can break down the DAG matching into a similar algorithm as the Levenshtein distance based on dynamic programming with a matrix (2D) as the storage. The theoretical complexity is only increased by the DAG matching (which is zero for a line graph but can become large for complex graphs). It would, in theory, be possible to implement it. But I'm not sure if it is worth the time at the moment.
  • Keep in mind that asclite matches on a word level, so it doesn't penalize splitting an utterance into multiple parts and assigning different speaker labels to each part / putting them on different outputs. This may or may not be what you want.
  • In my experience, asclite has many issues, e.g., give it negative time stamps and it produces a WER of 0 without crashing. So, you have to be very careful to use it correctly
  • I'm attending ICASSP, we can have a chat there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants