Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent syntax highlighting for C funcitons in the same stdio.h header #2973

Closed
drankinatty opened this issue Jan 24, 2021 · 21 comments
Closed
Labels
autoclose Flag things to future autoclose. bug help welcome Could use help from community language

Comments

@drankinatty
Copy link

Inconsistent Syntax Highlight for C functions from stdio.h

On StackOverflow when answering questions, highlightjs fails to apply consistent syntax highlighting to many functions, but this is particularly acute when it applies (or fails to apply) consistent syntax highlighting to functions from the same standard header stdio.h For example:

The functions fputs() and puts() are highlighted, but fopen(), fgetc() and fclose() all have no highlight applied at all. The is confusing to new programmers wondering why some functions are one color and some are not.

This applies to C, but is applicable to C++ as well

Here is the link to the answer: Strncat causes exception

Are you using highlight or highlightAuto?

Sample Code to Reproduce

#include <stdio.h>

#define MAXC 100        /* if you need a constant, #defien one (or more) */

int main (int argc, char* argv[]) {
    
    if (argc < 2) { /* validate one argument give for filename */
        fputs ("error: too few arguments.\n", stderr);
        return 1;
    }
    char buf[MAXC] = "", c;
    size_t n = 0;
    FILE* fp = fopen (argv[1], "r");

    if (fp == NULL) {                   /* validate file open for reading */
        perror ("fopen-argv[1]");       /* on failure, perror() tells why */
        return 1;
    }

    /* while buf not full (saving 1-char for \0), read char */
    while (n + 1 < MAXC && (c = fgetc(fp)) != EOF)
        buf[n++] = c;                   /* assign char to next element in buf */

    buf[n] = 0;                         /* nul-terminate buf */
    
    puts (buf);                         /* output result */
    
    fclose(fp);
}

Expected behavior

All functions, especially from the same standard header stdio.h should have consistent highlight applied.

@drankinatty drankinatty added bug help welcome Could use help from community language labels Jan 24, 2021
@drankinatty
Copy link
Author

Why does the code posted here have consistent highlight of stdio.h functions while that on StackOverflow does not?

@drankinatty
Copy link
Author

drankinatty commented Jan 24, 2021

See also bug filed on Meta-StackOverflow - Why is our highlightjs syntax highlight inconsistent while Code filed with bugs to highlightjs on GitHub is consistent?

Also, I don't know how StackOverflow is using highlight or highlightAuto, the developers there know, but they refer us here for problems with highlightjs.

@joshgoebel
Copy link
Member

Ok, there does seem to be a problem here, but the solution isn't 100% clear. If we're going to highlight functions in the stdlib then we should do it consistency, but I'm also not certain what that surface area would be to do this completely. I'm sure the current list was just someones idea of what was "common" vs being at all exhaustive.

If we did try for an exhaustive list... should we only cover stdio.h? Is that adding a few keywords or adding hundreds? What about other standard functions? Do we need to add those also? How many to cover the entire standard library?

It may be that we instead need a rule to match general function dispatch (rather than a huge list of all "built-ins") and then a shorter list of "common" keywords just to help our auto-detect decide what is and isn't C code.

@klmr

@drankinatty
Copy link
Author

Well, this is where we have had lots of problems on StackOverflow since switching from prettify to highlightjs. I don't know how each do the highlight, but I do understand highlightjs was attempting to be relatively general rather than syntax for everything. But that brings in problems like this. In C there are really only a handful of standard header files. By far the most commonly included is stdio.h followed by stdlib.h, and string.h.

Those alone would cover 90% of the new-user questions I run into on StackOverflow. In each case there are no more than 20 or so function in each. If we could add those a key words (or whatever the correct terminology is for "Highlight This"), that would be a big improvement and cover most to the new user-questions (where consistent highlight is most important -- eliminating the lingering question in their mind of "why is this function this color and the other function from the same header not?")

I can use the C-standard an pull out all the function names from each header if that would help. I'm just not sure what to do with them then? Let me know if that would be helpful.

@joshgoebel
Copy link
Member

joshgoebel commented Jan 24, 2021

Well, this is where we have had lots of problems on StackOverflow since switching from prettify to highlightjs.

What was the prior behavior?

I do understand highlightjs was attempting to be relatively general rather than syntax for everything.

I'm not sure what this is saying... one possibly problem here is that C/C++ are a single grammar for us, but that's diverging now... otherwise I don't know how to understand this comment.

there are no more than 20 or so function in each. If we could add those a key words ... would be a big improvement and cover most to the new user-questions

A possibility, but please first answer my original question. Why not just highlight all the function invocations the same? (as seen here in VS Code):

Screen Shot 2021-01-24 at 12 06 46 PM

Doing so we cover all those libraries in one fell swoop.

@joshgoebel
Copy link
Member

Ping.

@joshgoebel joshgoebel added the autoclose Flag things to future autoclose. label Feb 14, 2021
@joshgoebel
Copy link
Member

Ping. Also tagging to auto-close if there is no future engagement.

@klmr
Copy link
Contributor

klmr commented Feb 14, 2021

A possibility, but please first answer my original question. Why not just highlight all the function invocations the same? (as seen here in VS Code):

Agreed. Hard-coding standard library functions in the highlighter always seems excessive to me (they’re, in my mind, not quite the same as built-ins), and there are hundreds of them.

I’m also fundamentally not convinced of the value of specifically highlighting these functions: semantically they are regular functions, even if the compiler treated them specially (but it generally doesn’t).

@joshgoebel
Copy link
Member

Full list:

    built_in: 'std string wstring cin cout cerr clog stdin stdout stderr stringstream istringstream ostringstream ' +
      'auto_ptr deque list queue stack vector map set pair bitset multiset multimap unordered_set ' +
      'unordered_map unordered_multiset unordered_multimap priority_queue make_pair array shared_ptr abort terminate abs acos ' +
      'asin atan2 atan calloc ceil cosh cos exit exp fabs floor fmod fprintf fputs free frexp ' +
      'fscanf future isalnum isalpha iscntrl isdigit isgraph islower isprint ispunct isspace isupper ' +
      'isxdigit tolower toupper labs ldexp log10 log malloc realloc memchr memcmp memcpy memset modf pow ' +
      'printf putchar puts scanf sinh sin snprintf sprintf sqrt sscanf strcat strchr strcmp ' +
      'strcpy strcspn strlen strncat strncmp strncpy strpbrk strrchr strspn strstr tanh tan ' +
      'vfprintf vprintf vsprintf endl initializer_list unique_ptr _Bool complex _Complex imaginary _Imaginary',

@klmr Do any of these stand out to you as deserving of special treatment... what about std, cin and cout or the _Bool, _Complex, etc. variants?

It's sounding like we should move many of these from built_in to _ (ie, avoid highlighting while still using the most common ones only for their auto-detection value).

@joshgoebel
Copy link
Member

joshgoebel commented Feb 14, 2021

#3005 steps towards just highlighting all dispatch so we only need a decision on which built_ins might truly be special/unique (for differentiation in the future)... and which are only good for relevance.

And if a built_in is neither, we should remove it from the list.

@klmr
Copy link
Contributor

klmr commented Feb 15, 2021

Do any of these stand out to you as deserving of special treatment

I’m quite opinionated, but for me the answer is a hard no — except, potentially, for reserved identifiers (i.e. everything with double underscore, or starting with underscore + capital letter etc.; i.e. stuff like _Bool); and for these I’d use a regex, not a hard-coded list.

But I’d like to make the point that this “full list” is currently completely arbitrary. It contains a wild mix of C and C++ names, of functions, macros and types, it only contains a tiny subset of all standard library names, and I can’t discern a pattern of what is and isn’t included. In particular, these aren’t all defined in one particular header file.

I would scrap this list.

@joshgoebel
Copy link
Member

joshgoebel commented Feb 15, 2021

and for these I’d use a regex, not a hard-coded list. ... this “full list” is currently completely arbitrary. It contains a wild mix of C and C++ names

I think you forget we must also auto-detect languages. Lists like this are critical to doing so - even if we don't use them for highlighting individual identifiers. I just glanced at that list and to me (with my limiting experience with C/C++) I see a lot of stuff that looks much like C/C++ code, which is exactly the goal. cin, cout, memcmp, sprintf, etc... Things to get our relevancy count up if we see them in code.

I suppose one could argue keywords might be "enough" to serve this purpose but I always tend to be conservative with changes like this... so to me scraping the list means preserving it but simplifying and using it for relevancy vs highlighting.

IE: f0073f2

Now if someone had suggestions for trimming it to ONLY strong C/C++ signals I'd be happy removing 70% of the cruft... I have a feeling some of these (like the ones I mentioned earlier) are much more signal than others like say sin, etc.

@joshgoebel joshgoebel removed the autoclose Flag things to future autoclose. label Feb 15, 2021
@ztane
Copy link

ztane commented Feb 15, 2021

well... C standard for example says that the standard library functions are pretty much builtins; even in absense of <string.h> a call to memcpy is still a memcpy and compiler is allowed to optimize it as such. But _Bool should be coloured in the exact same manner as int naturally, as it is a builtin type always defined by that name.

@ztane
Copy link

ztane commented Feb 15, 2021

The full list of keywords for C11/C18 is

                auto                          if                             unsigned
                break                         inline                         void
                case                          int                            volatile
                char                          long                           while
                const                         register                       _Alignas
                continue                      restrict                       _Alignof
                default                       return                         _Atomic
                do                            short                          _Bool
                double                        signed                         _Complex
                else                          sizeof                         _Generic
                enum                          static                         _Imaginary
                extern                        struct                         _Noreturn
                float                         switch                         _Static_assert
                for                           typedef                        _Thread_local
                goto                          union
                ```

@joshgoebel
Copy link
Member

But _Bool should be coloured in the exact same manner as int naturally, as it is a builtin type always defined by that name.

These (types) should probably all be type, not built_in or keyword but that's out of scope for this issue. In that list you just provided for example for would be a keyword to us while int and long should be types.

@klmr
Copy link
Contributor

klmr commented Feb 15, 2021

@josh True, I hadn’t considered auto-detection. Can we use identifiers for auto-detection without applying highlighting? And then we obviously need to get this right: neither cin nor cout are C. And of the remaining ones we probably want to curate a list that’s (a) specific, and (b) still common enough that a lot of real-world C code will contain them; for instance, although _Bool is highly specific to C (@ztane correctly points out that it’s a keyword), a lot of real-world C code uses its alias bool instead1. And that isn’t very specific at all.


1 The only reason that the (intentionally ugly) _Bool instead of bool is a keyword is compatibility: since, at the time of its introduction, a lot of code used the name bool for custom type definitions. But there’s no reason for new code to use _Bool. Most (?) projects include stdbool.h and then use bool instead of the _Bool keyword.

@joshgoebel
Copy link
Member

@ztane Pulled that out into #3010

@joshgoebel
Copy link
Member

Can we use identifiers for auto-detection without applying highlighting?

That's what the PR does now.

And then we obviously need to get this right: neither cin nor cout are C.

Yeah, see #3010. Cleaning up C is a whole sep task.

And of the remaining ones we probably want to curate a list that’s (a) specific, and (b) still common enough that a lot of real-world C code will contain them

I'm happy for your (or others) help with this. :-)

for instance, although _Bool is highly specific to C (@ztane correctly points out that it’s a keyword), a lot of real-world C code uses its alias bool instead1. And that isn’t very specific at all.

Re: _Bool If it's on a short list of actual keywords (say for C), then we'll likely keep it (and highlight) just because it's an actual keyword, etc... But we'd also have bool as a type if that's a super-common alias... but yes bool is not a strong signal and many languages will "fight" over it during auto-detect.

@klmr
Copy link
Contributor

klmr commented Feb 15, 2021

That's what the PR does now.

That’s #3005?

I'm happy for your (or others) help with this. :-)

I started (by writing some markup tests) yesterday, starting off from the C++ test cases and making them relevant for C. Unfortunately I didn’t get very far yet …

@joshgoebel
Copy link
Member

Yes. The _ keyword key now implies "relevant, but do not highlight" and of course you've always been able to do this with rules just be changing their relevance value.

I started (by writing some markup tests) yesterday

That's a great start. :)

@joshgoebel
Copy link
Member

Is this not resolve by #3005?

@joshgoebel joshgoebel added the autoclose Flag things to future autoclose. label Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoclose Flag things to future autoclose. bug help welcome Could use help from community language
Projects
None yet
Development

No branches or pull requests

4 participants