Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle homophones (similar sounding words) #1493

Open
pixeye33 opened this issue Aug 6, 2023 · 14 comments
Open

Handle homophones (similar sounding words) #1493

pixeye33 opened this issue Aug 6, 2023 · 14 comments

Comments

@pixeye33
Copy link

pixeye33 commented Aug 6, 2023

I have a custom sentence in french like this :
[mets] [le] volume [à|a] {volume} [pourcent|%]
(it means, put the volume to x percent)

Often, Nabucasa STT understands mais and the sentence does not match. to make it work i changed to
[mets|mais] [le] volume [à|a] {volume} [pourcent|%]

[mets|mais] are two words that means put for the first one, and but for the second one.
both are similar sounding words, confusing them is a dyslexia symptom.

We (probably) can't fix what Nabucasa STT says it understands, but maybe we can change the displayed sentence in the prompt ?

I thought of something like this :
[mets!|mais] [le] volume [à|a] {volume} [pourcent|%]
with ! after the word that is the correct one.

that ways even if Nabucasa STT understands mais volume à 10 % it will write it as mets volume à 10 %
this could also work in expansion rules.

to go even further, we could even imagine displaying mets le volume à 10 % even if mais volume 10 was understood/said.
with a sentence written like this :
[mets!|mais]! [le]! volume [à|a]! {volume} [pourcent|%!]!
notice the ! after the ] as a way to allow the word not to be said, but still keeping it in the displayed sentence.
when using []! syntax if there is no ! inside [] it will keep the first option.

side note : i do feel like this was more a https://github.com/home-assistant/hassil issue, but most of the other in this project too.

@tetele
Copy link
Contributor

tetele commented Aug 10, 2023

We (probably) can't fix what Nabucasa STT says it understands, but maybe we can change the displayed sentence in the prompt ?

You can't do that. Whatever the STT engine understands (whichever STT engine you use), that's what gets passed on in the pipeline to the intent recognition engine, not the other way around.

That's only fixable in one of two ways:

  • better STT model for your language (this is outside HA's scope)
  • if the STT engine supports it, multiple potential STT results with corresponding confidence scores, all of which should be parsed, in order of confidence, until one of them matches a possible sentence (this is in HA core's scope, not the intents repo)

I am not sure the second option is available in the underpinnings of the Nabu Casa STT engine, but maybe it's something @synesthesiam wants to take a look at.

notice the ! after the ] as a way to allow the word not to be said, but still keeping it in the displayed sentence.

What is the use case here? I mean... if you don't say the words and the recognized words get passed on to the intent recognition service (e.g. volume à 60 %), which needs to match to your sentence ([mets!|mais]! [le]! volume [à|a]! {volume} [pourcent|%!]!), who cares what gets displayed? The intent recognition has already taken place at this point, right?

@pixeye33
Copy link
Author

pixeye33 commented Aug 10, 2023

You can't do that. Whatever the STT engine understands (whichever STT engine you use), that's what gets passed on in the pipeline to the intent recognition engine, not the other way around.

I'm aware, i'm not suggesting intent -> STT engine flow.

here is put another way :
I say mets
STT understands mais
Intent recognition engine matches [mets!|mais]
what i'm suggesting is to rewrite the end result, unsing the intent string that matched as a "regexp" : mets instead of mais. only for display purposes.
I don't care if STT understood mais, as long as the action is triggered, but it makes me cry to see mais written, as if i did not know how to spell.

Confidence scores, if they exist are probably the better way, i agree :
I say mets
STT understands mais (90%) or mets (70%)
Intent matches [mets]

but the implementation of that solution is probably more complex, far in the future and SST engine dependant, than a simple rewrite of the displayed text.

What is the use case here? who cares what gets displayed? The intent recognition has already taken place at this point, right?

Yes, intent has already happen, action too.
but reading a full sentence is more apealing, than a bunch of keywords.

Note : I'm conviced that in the future, we will say less words (laziness), and yet have a full sentence written (more satisfying), this was a way to have that without much additional work.

@tetele
Copy link
Contributor

tetele commented Aug 10, 2023

what i'm suggesting is to rewrite the end result, unsing the intent string that matched as a "regexp" : mets instead of mais. only for display purposes.

Written where? In the Assist dialog box? That gets displayed before any intent matching is done.

Also, the plan is to only recognize grammatically correct sentences in order to train a recognition model that can "catch" more sentences than just those which were manually defined, so recognizing mais le volume a 90% is just a band-aid on a broken bone which will do more harm than good in the long run.

@pixeye33 pixeye33 closed this as not planned Won't fix, can't repro, duplicate, stale Aug 10, 2023
@thdg
Copy link
Contributor

thdg commented Aug 10, 2023

My two cents on this.
Handling common errors that the STT system does makes the system more robust and should be highly recommended where needed.
Trying to rewrite the sentence might potentially be useful someday but seems to be over complicating things for now.

I recommend doing something like this:

[mais|<stt_error_mais>] [le] volume [à|a] {volume} [pourcent|%]
with an expansion rule:
stt_error_mais: "mets" or stt_mais: "(mets|...|...)" if there are other common errors (replace the dots with the other errors)

This accomplishes two things:

  • The added robustness adds minimal cluttering to the sentences
  • If the time comes, it is easy to add a fallback word for the error: stt_error_mais -> "mais"

@synesthesiam
Copy link
Contributor

I handled this kind of issue in Rhasspy with a ":" operator in sentence templates, so "mais:mets" would match "mais" but output "mets". There were two output sentences too, one with the literally recognized text (mais) and one with the transformed text (mets).

I could see adding this to hassil, but we need to make a clear case for it. I don't want to mask STT errors, but we also want to be robust to them.

@pixeye33 pixeye33 reopened this Aug 10, 2023
@tetele
Copy link
Contributor

tetele commented Aug 10, 2023

Another example of the same issue, this time in German #1373 (comment)

@Kelesis
Copy link

Kelesis commented Nov 12, 2023

For information, another similar issue is for numbers (1 or one),
especially for range of numbers,
I have this issue in french but it might be the same for english :
I say "Chronomètre 1 minute" ("Time 1 minute")
and the returned text is "Chronomètre une minute" ("Time one minute") so the range doesn't work.
The workaround is to create a specific sentence, but if I want a time like x hours x minutes x seconds ... I have to define all possible combinations 1xx x1x xx1 xxx ... not very nice 😅
For all other numbers the returned text is made of digits 0123456789 and works as expected.
Maybe when expecting a range, template matcher could accept written numbers? A lot of work :'(

Edit : I finally found a better workaround using only one sentence based on default value defined by slot.

Screenshot_20231112-154629

@tetele
Copy link
Contributor

tetele commented Nov 13, 2023

@Kelesis that specific problem regarding numbers has been addressed and will be included in the following releases

@X-Ryl669
Copy link

X-Ryl669 commented Nov 13, 2023

How hard would it be to match intent not by their text but by their SOUNDEX or equivalent algorithm?
The idea would be to:

  1. While building the intent possibility tree : convert the intent to SOUNDEX or a sequence of phoneme
  2. Convert STT output to SOUNDEX (or a sequence of phoneme) too
  3. Compute similarity between the latter and each node of the tree, ranking the higher matching first (maybe even doing a leveinstein search on the tree so we can stop search for all intents after a given number of insertion/substitution errors ?
  4. Drop any match if it's below a given matching threshold

What do you think?

@X-Ryl669
Copy link

X-Ryl669 commented Nov 13, 2023

stt_error_mais: "mets" or stt_mais: "(mets|...|...)" if there are other common errors (replace the dots with the other errors)

This is not wanted since it create exponential growth on the potential sentences. In French (and probably other language) there are multiple spelling for the same sound (like "Ouvrer / Ouvrez / Ouvré / Ouvrés / Ouvrée / Ouvrait / Ouvraient / ...") so a simple sentence with "Ouvrez les volets roulants" could be written as "Ouvrer les volets roulants" (which is perfectly correct grammatically and semantically) even "Ouvre haie lait veau lé roue lent" (incorrect grammatically and semantically). The STT engine can't decide on the former or the latter since there's no context, so it's perfectly right to choose either one (and it's 100% correct doing so). So it's wrong to blame STT here.

In a YAML, you can't list all possibility and it would be impossible to match against those even if you did.

@tetele
Copy link
Contributor

tetele commented Nov 13, 2023

How hard would it be to match intent not by their text but by their SOUNDEX or equivalent algorithm?

That sounds a lot like this suggestion, doesn't it?

@X-Ryl669
Copy link

Exactly like this. Thanks for linking it!

@synesthesiam
Copy link
Contributor

The discussion @tetele linked has more info, but in short the plan is to have HA attempt matches first without and then with fuzzy recognition enabled in hassil. This is happening in text, though, so it's not as ideal as using something like SOUNDEX. However, we need to support many more languages than just English.

@X-Ryl669
Copy link

X-Ryl669 commented Jan 12, 2024

Hey @synesthesiam, please have a look to my tinkering here and more specifically the tests (run with hatch run test:pytest -s) for example usage and the tests for what it's able to match.

I've used Epitran to support many languages (close to a hundred) for G2P and implemented an fussy intent matching on top based on a IPA mapping. The intent are built using a tree where you have either a simple Basic node (simple text that must be here), an Optional node (a text that can be missing), a greedy Parametric node (a value, like forty two) or an Alternative node (some text or some other text).

I think it should match more or less what HA intent type that exist. Yet, it's able to match sentences like:
Fermée le veau les for intent expecting Fermez les volets (that a Leveishtein algorithm can't match easily on textual space) because it's converting both to IPA strings first and then doing a kind of Levenstein match on the IPA space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants