-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Searching for names with accented characters. #1941
Comments
I don't think we can afford to curate some ad-hoc list of translations... and even if we add to current "default" search, other searches would not make use of it. We could add/use of "fuzzy search" which would allow misspecification, but in my prior experience then they often lead to spurious results. So we better separate our exact from "fuzzy" matches. Added a note to on that. For the sake of potentially addressing this particular one: is there a way to harmonize searched metadata upon query? i.e. if we detect that query does not include any of the symbols from |
Wouldn't it make sense to go the other way around? For all words in the metadata that contain letters with accents we could add to the metadata that word without accents. |
This problem comes up in a variety of languages: French - Uses several accented characters like é, à, ç, and ô. Programmatically mapping languages that use the Latin alphabet with accents (like French, Spanish, Portuguese, etc.) to plain English characters involves a process called "transliteration" or "romanization". Here are a few common approaches for this transliteration: 1. Using Standard LibrariesMost programming languages have libraries that can help with transliteration. For example: PythonIn Python, you can use the from unidecode import unidecode
text = "Café Münchner Küche"
transliterated_text = unidecode(text)
print(transliterated_text) # Output: 'Cafe Munchner Kuche' 2. Custom MappingYou can create a custom mapping dictionary for specific languages if you need more control over how transliteration is done. This method allows handling specific cases or languages that accents_map = {
'é': 'e', 'ë': 'e', 'è': 'e', 'ê': 'e',
'á': 'a', 'à': 'a', 'â': 'a', 'ä': 'a', 'ã': 'a', 'å': 'a',
'í': 'i', 'ì': 'i', 'î': 'i', 'ï': 'i',
'ó': 'o', 'ò': 'o', 'ô': 'o', 'ö': 'o', 'õ': 'o',
'ú': 'u', 'ù': 'u', 'û': 'u', 'ü': 'u',
'ñ': 'n', 'ç': 'c', 'ß': 'ss'
}
def transliterate(text, accent_map):
return ''.join(accent_map.get(char, char) for char in text)
text = "Café Münchner Küche"
transliterated_text = transliterate(text, accents_map)
print(transliterated_text) # Output: 'Cafe Munchner Kuche' For most practical purposes, libraries like (source: GPT) |
The "romanticization" mapping is often pretty straightforward, although there are some cases where it is ambiguous. For example, in the above example ü is decoded as "u", however @oruebel decodes it as "ue". |
whatever way it would work ;) in "your case" though it might then missing "phrase" searches where one word would remain with accent and another without... IMHO harmonizing uniformly both metadata and query would address this. Overall it is a question on what we can do at all at the DB level for search queries: unlikely we would like to maintain a copy of all metadata records with harmonization. @candleindark do you know ways at postgresql level to do data harmonization for the search queries or may be fuzzy search engines? (then we would just compliment with fuzzy search results I guess) |
I think this really only comes up for author names. Maybe more broadly Contributor names if funding agencies also uses non-ASCII chars. I'm imagining an additional optional field on the Contributor metadata that is not revealed in the DLP but is exposed to the search bar that contains romanized names. e.g. the Contributor György Buzsáki would have a field "Romanized" which would be auto-populated with "Gyorgy Buzsaki". |
Just some Google evidence on this matter ;-) |
I think you can look into the |
György Buzsáki's lab has an incredibly impressive set of open data on the DANDI Archive:
With currently 12 datasets, including raw data, from high-profile publications, Buzsáki could easily be considered one of the most prolific contributors on the archive.
The problem is, when most users search for him, they search with the string "Buzsaki", not the Hungarian spelling, "Buzsáki." (You will often see his name spelled without the accent in English.) When they do so, they only see 3 publications instead of the impressive 12.
Would it be possible to make it so that the search responds as expected for the English version of names?
The text was updated successfully, but these errors were encountered: