Source filenames from linguist `language.yml` #931

silverwind · 2024-02-19T19:34:57Z

Is there an existing issue for this?

I have searched the existing issues

What problem does this feature solve?

Chroma's language metadata is somewhat lacking.

What feature do you propose?

Linguist's languages.yml is probably the most up to date data on programming languages filenames on the internet.

I think one solution could be to embed either the whole file or just the relevant parts of it into the module via go:embed during a build process and then use it as the single authorative source for chroma's lexer-from-filename detection.

The text was updated successfully, but these errors were encountered:

alecthomas · 2024-02-19T19:36:11Z

Sorry, I have no idea what this issue is asking for?

alecthomas · 2024-02-19T20:37:57Z

Are you suggesting using it for file extension detection, or something else?

I think the issue is that the mapping from Linguist file types to Chroma file types won't be 1:1

silverwind · 2024-02-19T21:07:25Z

Take for example the json lexer, Chroma only knows 2 filename for it:

chroma/lexers/embedded/json.xml

Lines 5 to 6 in e9292e6

    
           <filename>*.json</filename> 
        
           <filename>*.avsc</filename>

Linguist knows 17 extensions and 15 filenames:

https://github.com/github-linguist/linguist/blob/559a6426942abcae16b6d6b328147476432bf6cb/lib/linguist/languages.yml#L3159-L3193

This rich data from the linguist file could for example be extracted from the linguist file and written into chroma lexer's .xml and .go files so chroma ends up with much better filename-based language detection in lexers.Match. But overall I think it might be better to just reduce and embed the linguist file in a build script that runs before each release.

Yes, it won't be a 1:1 mapping but the data extraction could be made best-effort where we map linguist name to chroma lexer name.

silverwind · 2024-02-19T21:39:26Z

Not sure how you would feel about an additional dependency, but there is also https://github.com/go-enry/go-enry which offers various ways to detect a language based on filename or content (think bash scripts with no file extension). As far as I know, it is based on a somewhat-regularily updated dataset sourced from linguist. All that's left to do for chroma would be to map the detected language to a lexer.

silverwind added the feature request label Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source filenames from linguist `language.yml` #931

Source filenames from linguist `language.yml` #931

silverwind commented Feb 19, 2024 •

edited

alecthomas commented Feb 19, 2024

alecthomas commented Feb 19, 2024

silverwind commented Feb 19, 2024 •

edited

silverwind commented Feb 19, 2024

Source filenames from linguist language.yml #931

Source filenames from linguist language.yml #931

Comments

silverwind commented Feb 19, 2024 • edited

Is there an existing issue for this?

What problem does this feature solve?

What feature do you propose?

alecthomas commented Feb 19, 2024

alecthomas commented Feb 19, 2024

silverwind commented Feb 19, 2024 • edited

silverwind commented Feb 19, 2024

Source filenames from linguist `language.yml` #931

Source filenames from linguist `language.yml` #931

silverwind commented Feb 19, 2024 •

edited

silverwind commented Feb 19, 2024 •

edited