Searching for names with accented characters. #1941

bendichter · 2024-05-18T16:43:24Z

György Buzsáki's lab has an incredibly impressive set of open data on the DANDI Archive:

DANDI:000003/0.230629.1955, data from “Physiological Properties and Behavioral Correlates of Hippocampal Granule Cells and Mossy Cells,” Neuron, 2017
DANDI:000041/0.210812.1515, data from “Network Homeostasis and State Dynamics of Neocortical Sleep,” Neuron 2016
DANDI:000044/0.210812.1516, data from “Diversity in neural firing dynamics supports both rigid and learned hippocampal sequences,” Science, 2016
DANDI:000056/0.210812.1518, data from “Internally organized mechanisms of the head direction sense,” Nature Neuroscience, 2015
DANDI:000059/0.230907.2101, data from “Cooling of Medial Septum Reveals Theta Phase Lag Coordination of Hippocampal Cell Assemblies,” Neuron, 2020
DANDI:000061/0.210812.1517, data from “Reactivations of emotional memory in the hippocampus–amygdala system during sleep,” Nature Neuroscience, 2017
DANDI:000067/0.210812.1457, data from “Behavior-dependent short-term assembly dynamics in the medial prefrontal cortex,” Neuron 2008
DANDI:000114/0.230602.1643, data from “Oxytocin neurons enable social transmission of maternal behaviour,” Nature, 2021
DANDI:000166/0.220116.2037, data from “Layer-Specific Physiological Features and Interlaminar Interactions in the Primary Visual Cortex of the Mouse,” Neuron 2019
DANDI:000233/0.230223.0815, data from “A metabolic function of the hippocampal sharp wave-ripple,” Nature, 2021
DANDI:000552/0.230630.2304, data from “Preconfigured dynamics in the hippocampus are guided by embryonic birthdate and rate of neurogenesis,” Nature Neuroscience, 2022
DANDI:000568/0.230705.1633, data from “Probing subthreshold dynamics of hippocampal neurons by pulsed optogenetics,” Science, 2022

With currently 12 datasets, including raw data, from high-profile publications, Buzsáki could easily be considered one of the most prolific contributors on the archive.

The problem is, when most users search for him, they search with the string "Buzsaki", not the Hungarian spelling, "Buzsáki." (You will often see his name spelled without the accent in English.) When they do so, they only see 3 publications instead of the impressive 12.

Would it be possible to make it so that the search responds as expected for the English version of names?

yarikoptic · 2024-05-20T13:32:20Z

I don't think we can afford to curate some ad-hoc list of translations... and even if we add to current "default" search, other searches would not make use of it. We could add/use of "fuzzy search" which would allow misspecification, but in my prior experience then they often lead to spurious results. So we better separate our exact from "fuzzy" matches. Added a note to

idea: separate /search results browser from the search backend #1659

on that.

For the sake of potentially addressing this particular one: is there a way to harmonize searched metadata upon query? i.e. if we detect that query does not include any of the symbols from áä etc set, we harmonize metadata to map all those into versions without modifiers and search then?

bendichter · 2024-05-20T13:54:43Z

For the sake of potentially addressing this particular one: is there a way to harmonize searched metadata upon query? i.e. if we detect that query does not include any of the symbols from áä etc set, we harmonize metadata to map all those into versions without modifiers and search then?

Wouldn't it make sense to go the other way around? For all words in the metadata that contain letters with accents we could add to the metadata that word without accents.

bendichter · 2024-05-20T14:05:54Z

This problem comes up in a variety of languages:

French - Uses several accented characters like é, à, ç, and ô.
Spanish - Includes characters like ñ, á, é, í, ó, ú, and ü.
Portuguese - Contains characters such as ã, á, â, à, ç, é, ê, í, ó, ô, õ, and ú.
German - Primarily uses umlauts like ä, ö, ü, and the sharp s (ß), although the last is not strictly an accent but a distinct character.
Vietnamese - Extensively uses diacritics; for example, ả, ạ, ấ, ầ, ẩ, ẫ, ậ, and so on.
Czech - Includes characters like á, č, ď, é, ě, í, ň, ó, ř, š, ť, ú, ů, ý, ž.
Icelandic - Uses characters such as á, é, í, ó, ú, ý, þ, æ, ö.

Programmatically mapping languages that use the Latin alphabet with accents (like French, Spanish, Portuguese, etc.) to plain English characters involves a process called "transliteration" or "romanization".

Here are a few common approaches for this transliteration:

1. Using Standard Libraries

Most programming languages have libraries that can help with transliteration. For example:

Python

In Python, you can use the unidecode library which is capable of transliterating any Unicode text containing accented characters into the closest possible representation in ASCII text.

from unidecode import unidecode

text = "Café Münchner Küche"
transliterated_text = unidecode(text)
print(transliterated_text)  # Output: 'Cafe Munchner Kuche'

2. Custom Mapping

You can create a custom mapping dictionary for specific languages if you need more control over how transliteration is done. This method allows handling specific cases or languages that unidecode might not cover accurately.

accents_map = {
    'é': 'e', 'ë': 'e', 'è': 'e', 'ê': 'e',
    'á': 'a', 'à': 'a', 'â': 'a', 'ä': 'a', 'ã': 'a', 'å': 'a',
    'í': 'i', 'ì': 'i', 'î': 'i', 'ï': 'i',
    'ó': 'o', 'ò': 'o', 'ô': 'o', 'ö': 'o', 'õ': 'o',
    'ú': 'u', 'ù': 'u', 'û': 'u', 'ü': 'u',
    'ñ': 'n', 'ç': 'c', 'ß': 'ss'
}

def transliterate(text, accent_map):
    return ''.join(accent_map.get(char, char) for char in text)

text = "Café Münchner Küche"
transliterated_text = transliterate(text, accents_map)
print(transliterated_text)  # Output: 'Cafe Munchner Kuche'

For most practical purposes, libraries like unidecode offer a good balance between simplicity and performance, making them a popular choice for many developers.

(source: GPT)

bendichter · 2024-05-20T14:20:17Z

The "romanticization" mapping is often pretty straightforward, although there are some cases where it is ambiguous. For example, in the above example ü is decoded as "u", however @oruebel decodes it as "ue".

yarikoptic · 2024-05-20T15:26:58Z

Wouldn't it make sense to go the other way around? For all words in the metadata that contain letters with accents we could add to the metadata that word without accents.

whatever way it would work ;) in "your case" though it might then missing "phrase" searches where one word would remain with accent and another without... IMHO harmonizing uniformly both metadata and query would address this.

Overall it is a question on what we can do at all at the DB level for search queries: unlikely we would like to maintain a copy of all metadata records with harmonization. @candleindark do you know ways at postgresql level to do data harmonization for the search queries or may be fuzzy search engines? (then we would just compliment with fuzzy search results I guess)

bendichter · 2024-05-20T15:33:49Z

I think this really only comes up for author names. Maybe more broadly Contributor names if funding agencies also uses non-ASCII chars. I'm imagining an additional optional field on the Contributor metadata that is not revealed in the DLP but is exposed to the search bar that contains romanized names. e.g. the Contributor György Buzsáki would have a field "Romanized" which would be auto-populated with "Gyorgy Buzsaki".

oruebel · 2024-05-20T16:23:36Z

ü is decoded as "u", however @oruebel decodes it as "ue"

Just some Google evidence on this matter ;-)

candleindark · 2024-05-20T18:10:16Z

@candleindark do you know ways at postgresql level to do data harmonization for the search queries or may be fuzzy search engines?

I think you can look into the unaccent extension in Postgres.

bendichter changed the title ~~Searching for names with non-English characters.~~ Searching for names with accented characters. May 18, 2024

yarikoptic mentioned this issue May 20, 2024

idea: separate /search results browser from the search backend #1659

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Searching for names with accented characters. #1941

Searching for names with accented characters. #1941

bendichter commented May 18, 2024 •

edited

yarikoptic commented May 20, 2024

bendichter commented May 20, 2024

bendichter commented May 20, 2024

bendichter commented May 20, 2024

yarikoptic commented May 20, 2024 •

edited

bendichter commented May 20, 2024

oruebel commented May 20, 2024

candleindark commented May 20, 2024

Searching for names with accented characters. #1941

Searching for names with accented characters. #1941

Comments

bendichter commented May 18, 2024 • edited

yarikoptic commented May 20, 2024

bendichter commented May 20, 2024

bendichter commented May 20, 2024

1. Using Standard Libraries

Python

2. Custom Mapping

bendichter commented May 20, 2024

yarikoptic commented May 20, 2024 • edited

bendichter commented May 20, 2024

oruebel commented May 20, 2024

candleindark commented May 20, 2024

bendichter commented May 18, 2024 •

edited

yarikoptic commented May 20, 2024 •

edited