Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searching for names with accented characters. #1941

Open
bendichter opened this issue May 18, 2024 · 8 comments
Open

Searching for names with accented characters. #1941

bendichter opened this issue May 18, 2024 · 8 comments

Comments

@bendichter
Copy link
Contributor

bendichter commented May 18, 2024

György Buzsáki's lab has an incredibly impressive set of open data on the DANDI Archive:

  1. DANDI:000003/0.230629.1955, data from “Physiological Properties and Behavioral Correlates of Hippocampal Granule Cells and Mossy Cells,” Neuron, 2017
  2. DANDI:000041/0.210812.1515, data from “Network Homeostasis and State Dynamics of Neocortical Sleep,” Neuron 2016
  3. DANDI:000044/0.210812.1516, data from “Diversity in neural firing dynamics supports both rigid and learned hippocampal sequences,” Science, 2016
  4. DANDI:000056/0.210812.1518, data from “Internally organized mechanisms of the head direction sense,” Nature Neuroscience, 2015
  5. DANDI:000059/0.230907.2101, data from “Cooling of Medial Septum Reveals Theta Phase Lag Coordination of Hippocampal Cell Assemblies,” Neuron, 2020
  6. DANDI:000061/0.210812.1517, data from “Reactivations of emotional memory in the hippocampus–amygdala system during sleep,” Nature Neuroscience, 2017
  7. DANDI:000067/0.210812.1457, data from “Behavior-dependent short-term assembly dynamics in the medial prefrontal cortex,” Neuron 2008
  8. DANDI:000114/0.230602.1643, data from “Oxytocin neurons enable social transmission of maternal behaviour,” Nature, 2021
  9. DANDI:000166/0.220116.2037, data from “Layer-Specific Physiological Features and Interlaminar Interactions in the Primary Visual Cortex of the Mouse,” Neuron 2019
  10. DANDI:000233/0.230223.0815, data from “A metabolic function of the hippocampal sharp wave-ripple,” Nature, 2021
  11. DANDI:000552/0.230630.2304, data from “Preconfigured dynamics in the hippocampus are guided by embryonic birthdate and rate of neurogenesis,” Nature Neuroscience, 2022
  12. DANDI:000568/0.230705.1633, data from “Probing subthreshold dynamics of hippocampal neurons by pulsed optogenetics,” Science, 2022

With currently 12 datasets, including raw data, from high-profile publications, Buzsáki could easily be considered one of the most prolific contributors on the archive.

The problem is, when most users search for him, they search with the string "Buzsaki", not the Hungarian spelling, "Buzsáki." (You will often see his name spelled without the accent in English.) When they do so, they only see 3 publications instead of the impressive 12.

Would it be possible to make it so that the search responds as expected for the English version of names?

@bendichter bendichter changed the title Searching for names with non-English characters. Searching for names with accented characters. May 18, 2024
@yarikoptic
Copy link
Member

I don't think we can afford to curate some ad-hoc list of translations... and even if we add to current "default" search, other searches would not make use of it. We could add/use of "fuzzy search" which would allow misspecification, but in my prior experience then they often lead to spurious results. So we better separate our exact from "fuzzy" matches. Added a note to

on that.

For the sake of potentially addressing this particular one: is there a way to harmonize searched metadata upon query? i.e. if we detect that query does not include any of the symbols from áä etc set, we harmonize metadata to map all those into versions without modifiers and search then?

@bendichter
Copy link
Contributor Author

For the sake of potentially addressing this particular one: is there a way to harmonize searched metadata upon query? i.e. if we detect that query does not include any of the symbols from áä etc set, we harmonize metadata to map all those into versions without modifiers and search then?

Wouldn't it make sense to go the other way around? For all words in the metadata that contain letters with accents we could add to the metadata that word without accents.

@bendichter
Copy link
Contributor Author

This problem comes up in a variety of languages:

French - Uses several accented characters like é, à, ç, and ô.
Spanish - Includes characters like ñ, á, é, í, ó, ú, and ü.
Portuguese - Contains characters such as ã, á, â, à, ç, é, ê, í, ó, ô, õ, and ú.
German - Primarily uses umlauts like ä, ö, ü, and the sharp s (ß), although the last is not strictly an accent but a distinct character.
Vietnamese - Extensively uses diacritics; for example, ả, ạ, ấ, ầ, ẩ, ẫ, ậ, and so on.
Czech - Includes characters like á, č, ď, é, ě, í, ň, ó, ř, š, ť, ú, ů, ý, ž.
Icelandic - Uses characters such as á, é, í, ó, ú, ý, þ, æ, ö.

Programmatically mapping languages that use the Latin alphabet with accents (like French, Spanish, Portuguese, etc.) to plain English characters involves a process called "transliteration" or "romanization".

Here are a few common approaches for this transliteration:

1. Using Standard Libraries

Most programming languages have libraries that can help with transliteration. For example:

Python

In Python, you can use the unidecode library which is capable of transliterating any Unicode text containing accented characters into the closest possible representation in ASCII text.

from unidecode import unidecode

text = "Café Münchner Küche"
transliterated_text = unidecode(text)
print(transliterated_text)  # Output: 'Cafe Munchner Kuche'

2. Custom Mapping

You can create a custom mapping dictionary for specific languages if you need more control over how transliteration is done. This method allows handling specific cases or languages that unidecode might not cover accurately.

accents_map = {
    'é': 'e', 'ë': 'e', 'è': 'e', 'ê': 'e',
    'á': 'a', 'à': 'a', 'â': 'a', 'ä': 'a', 'ã': 'a', 'å': 'a',
    'í': 'i', 'ì': 'i', 'î': 'i', 'ï': 'i',
    'ó': 'o', 'ò': 'o', 'ô': 'o', 'ö': 'o', 'õ': 'o',
    'ú': 'u', 'ù': 'u', 'û': 'u', 'ü': 'u',
    'ñ': 'n', 'ç': 'c', 'ß': 'ss'
}

def transliterate(text, accent_map):
    return ''.join(accent_map.get(char, char) for char in text)

text = "Café Münchner Küche"
transliterated_text = transliterate(text, accents_map)
print(transliterated_text)  # Output: 'Cafe Munchner Kuche'

For most practical purposes, libraries like unidecode offer a good balance between simplicity and performance, making them a popular choice for many developers.

(source: GPT)

@bendichter
Copy link
Contributor Author

The "romanticization" mapping is often pretty straightforward, although there are some cases where it is ambiguous. For example, in the above example ü is decoded as "u", however @oruebel decodes it as "ue".

@yarikoptic
Copy link
Member

yarikoptic commented May 20, 2024

Wouldn't it make sense to go the other way around? For all words in the metadata that contain letters with accents we could add to the metadata that word without accents.

whatever way it would work ;) in "your case" though it might then missing "phrase" searches where one word would remain with accent and another without... IMHO harmonizing uniformly both metadata and query would address this.

Overall it is a question on what we can do at all at the DB level for search queries: unlikely we would like to maintain a copy of all metadata records with harmonization. @candleindark do you know ways at postgresql level to do data harmonization for the search queries or may be fuzzy search engines? (then we would just compliment with fuzzy search results I guess)

@bendichter
Copy link
Contributor Author

I think this really only comes up for author names. Maybe more broadly Contributor names if funding agencies also uses non-ASCII chars. I'm imagining an additional optional field on the Contributor metadata that is not revealed in the DLP but is exposed to the search bar that contains romanized names. e.g. the Contributor György Buzsáki would have a field "Romanized" which would be auto-populated with "Gyorgy Buzsaki".

@oruebel
Copy link

oruebel commented May 20, 2024

ü is decoded as "u", however @oruebel decodes it as "ue"

Just some Google evidence on this matter ;-)

Screen Shot 2024-05-20 at 9 21 32 AM

@candleindark
Copy link
Collaborator

@candleindark do you know ways at postgresql level to do data harmonization for the search queries or may be fuzzy search engines?

I think you can look into the unaccent extension in Postgres.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants