Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ignoring multi-line blocks (e.g. code snippets in markdown files) #9

Open
LukeStorry opened this issue Feb 22, 2021 · 16 comments
Labels
enhancement New feature or request
Milestone

Comments

@LukeStorry
Copy link

I'm trying to add in exclusion for backticked code snippets so the contents don't get spell checked.

I've tried both \x60{1,3}[\s\S]+?\x60{1,3} and \[\n.]+?`` and variations thereof, but can't seem to get the action to ignore the snippets.

For example:


Section 4 - Encryption

Here is a sample encrypton function:

def encrypt(message: str):
    # Do the encryption...
    return encrypted_message

The encrypt function takes one parameter...

@jsoref
Copy link
Member

jsoref commented Feb 22, 2021

Unfortunately, there's currently no provision to handle multi-line things. I'm still thinking about how to do it :-(.

My concern is that people don't necessarily ignore binary files, which means there's a risk of the tool having to pull the entire file into memory.

I think to address this, I'll probably have to switch to a mode where I build my own state machine :-(.

@jsoref jsoref changed the title Document how to exclude code snippets in markdown files Document how to exclude multi-line code snippets in markdown files Feb 22, 2021
@LukeStorry
Copy link
Author

Ah, ok that makes sense!

Thanks for your speedy response too.

I can workaround by just adding all problematic words to expect.txt, but multi-line excludes would be handy!

@jsoref
Copy link
Member

jsoref commented Feb 22, 2021

Yeah... agreed.

You can cheat by adding a regexp for ^.*ignore-line.*$ and then adding # ignore-line to the line or similar.

So far, I've cheated by manually checking such files and then excluding the files. But that isn't great.

@LukeStorry
Copy link
Author

Thanks, I'll use that in some places.

This seems to work for single-line backtick escaping: \x60.+?\x60

In some of the examples you put a / at the start of the regex, what difference does that make?

@jsoref
Copy link
Member

jsoref commented Feb 22, 2021

If you mean things like:

/docs\.google\.com/[a-z]+/d/(?:e/|)[0-9a-zA-Z_-]+/

It assumes that a url is of the form:

It would not catch:

  • docs.google.com/hello/d/hello/

It's as opposed to:

\bdocs\.google\.com/[a-z]+/d/(?:e/|)[0-9a-zA-Z_-]+/

To avoid tripping over:

  • mucdocs.google.com/hello/d/hello/

@LukeStorry
Copy link
Author

Ah yes ok, was wondering why that didn't need escaping, or if it was signifying the beginning of the regex.
Cheers!

@jsoref
Copy link
Member

jsoref commented Feb 22, 2021

I've added some extra comments to the entries in the wiki. Let me know if they're helpful.

I suppose I could actually include some examples of how they break (as I did above)....

@pudgereyem
Copy link

Hey @LukeStorry and @jsoref,
this is something that I would love to see as well. I added the Github action to my blog but since there's quite a lot of code blocks in there it gets quite painful to "allow" all those weird words in there.

Thanks for the great work. This is not really my domain, otherwise I'd offer to help.

@jsoref jsoref changed the title Document how to exclude multi-line code snippets in markdown files Support ignoring multi-line blocks (e.g. code snippets in markdown files) Apr 11, 2021
@jsoref
Copy link
Member

jsoref commented Apr 11, 2021

I'm still very much thinking about how I want to implement this feature since it's really the ability to parse arbitrary file content, and I also want to make sure I don't mess up line numbering (the current version doesn't worry about line numbering much, but 0.0.18-alpha which I hope to release shortly will be reporting each thing it finds w/ lines and columns, so not messing up line numbering will be even more important then).

One approach would be for me to just import someone else's parser. I haven't actively looked, but I'm fairly pessimistic about the odds that I'd find one I liked.

Handling markdown snippets separately from other files today

@LukeStorry: fwiw, there is one approach that you could try today...
split the action into two jobs, one for markdown and one for code (you can use only.txt to focus on markdown for the one side).

Note that what I'm writing below is entirely untested, there's a version in the wiki which actually worked.

Use:

spelling workflow w/ matrix

.github/workflows/spelling.yml from https://github.com/check-spelling/spell-check-this/blob/prerelease/.github/workflows/spelling.yml

name: Spell checking
on:
  pull_request_target:
  push:
  issue_comment:
  pull_request_review_comment:

jobs:
  spell-check:
    name: "Spell Checker"
    runs-on: ubuntu-latest
    continue-on-error: true
    strategy:
      matrix:
        area: ["code", "markdown"]
    steps:
    - name: checkout-merge
      if: "contains(github.event_name, 'pull_request')"
      uses: actions/checkout@v2.0.0
      with:
        ref: refs/pull/${{github.event.pull_request.number}}/merge
    - name: checkout
      if: "!contains(github.event_name, 'pull_request')"
      uses: actions/checkout@v2.0.0
    - uses: check-spelling/check-spelling@prerelease
      with:
        config: ".github/actions/spelling/${{ matrix.area }}"
        experimental_apply_changes_via_bot: 1

Follow the advice in the wiki for how to set up symlinks (note that with: config: ... will generate something similar to env: bucket: ... + env: project: ...).

Markdown patterns

In the .github/actions/spelling/markdown/patterns/markdown.txt:

^\s{4,}.*$

You'll want to actively use common/allow.txt (allow.txt wasn't available when the wiki page I'm linking to was written which is why it suggests dictionary.txt) to add words that are consistently used across your project -- symlinking it into .github/actions/spelling/markdown/allow/common.txt or similar.

General thoughts on improving adoption

@pudgereyem: thanks for trying. I'm currently in the process of making the adoption easier.

looking at your blog repo, you could try using:

https://github.com/check-spelling/spell-check-this/blob/prerelease/.github/workflows/spelling.yml

talk to the bot to update expect.txt

Specifically, if you make a PR to the default branch which has the above workflow, it'll let you automatically accept its suggestions for expect.txt.

automatic updates for no newline at eof

Looking at your repository, I think I want it to automatically fix no-newline-at-eof -- the code already exists in the repository, so doing it would be pretty harmless. Unix tools (including git) really hate files that are missing the newline-at-eof, but Windows tools tend to make it really hard to understand how to fix this.

automatic updates for excludes

I also want to set it up to be able to automatically update excludes.txt (the prerelease version offers suggestions for excludes). -- I'm not sure if I'll manage to get this into 0.0.18-alpha -- I kinda want to ship it (the feature to update expect has been in development since this past summer...).

Building a list of corrections

The other thing that people would probably benefit from is my little Google Sheet that I use to calculate replacements... Google Sheets is really good at identifying real typos and suggesting real replacements.

-- Eventually, I intend to offer something like that via check-spelling: Suggest corrections, but that's really a long ways away (long after I handle multi-line blocks).

@pudgereyem
Copy link

@jsoref thanks for the quick response.

talk to the bot to update expect.txt

Cool, I'll give it a try later! Btw, can you give me an example of a word that you'd add to allow.txt and expect.txt respectively? Maybe this is completely clear in the wiki, but I couldn't completely wrap my head around it.

automatic updates for no newline at eof

Yeah, agreed. I missed that.

Building a list of corrections

What's the intended workflow for the Google Sheet? I pasted the misspelled words into the Current-column. How would I make so that Google Spreadsheet comes up with suggestions?

All the best,
Victor

@jsoref
Copy link
Member

jsoref commented Apr 12, 2021

n.b. I've copied most of the content from this comment into https://github.com/check-spelling/check-spelling/wiki/Configuration#allow and https://github.com/check-spelling/check-spelling/wiki/Configuration#expect

Allow

(see area dictionaries for other examples):

  • every html tag & attribute,
  • the name of everyone in your company (and the name of your company)
  • names of css attributes/properties
  • C or JavaScript reserved words
  • The names of all types/ functions from the C standard library

... They're really words, just not in the ancient base dictionary. They might not be used today in your project, but there's no reason for the spell checker to complain to a contributor tomorrow because it's foreseeable that they might be. – I'll try to update the wiki to clarify this at some point.

Fwiw, this month I've finally started working on a way to collect things like this in a vaguely systematic way. (I made a different draft last summer but didn't like where it went.)

Expect

Some arbitrary strings that are in test files that aren't really words. They should be removed if the test are changed/removed.

Allow vs Expect

Roughly if it's a proper noun of some sort of exists in the real world outside the project, it's a good candidate for allow. If it's just something you're temporally using that isn't really a word, it probably belongs in expect.

Note

The bot doesn't really care. You could put everything into allow or everything into expect. The difference is that the bot will help you remove unused items from expect and add new items to it. You could also periodically migrate items from expect to allow (think about it like Java object lifespan promotions).


The second tab of the sheet tries to describe the workflow.

@jsoref jsoref added the enhancement New feature or request label May 4, 2021
@ricardo-dematos
Copy link

I have a block of text with a python doctring and in the middle there's the word dBSPL. One of the patterns is catching BSPL and flagging it... I tried adding dBSPL to allow.txt, but with no avail.

The same happens with deltadB, with deltad being flagged.

Suggestions?

@jsoref
Copy link
Member

jsoref commented Nov 30, 2022

You could add BSPL to allow.txt / expect.txt if you were so inclined, or dBSPL to patterns.txt

Similarly deltad to allow.txt / expect.txt or deltadB to patterns.txt.

check-spelling splits on certain case transitions, so it sees d+BSPL or deltad+B.
By default (for the past couple of versions), check-spelling's minimum word length is 3, so it doesn't pay any attention to d or B, and that's what leaves it w/ the other two tokens to complain about.

@ricardo-dematos
Copy link

I went with the option of adding them to patterns.txt :

# ignore 'dBSPL' word
^.*dBSPL.*$

# ignore 'deltadB' word
^.*deltadB.*$

🙈

@jsoref
Copy link
Member

jsoref commented Nov 30, 2022

That doesn't ignore the word, it ignores the line containing the word. If you only want to ignore the word, you'd use \b dBSPL\b / \bdeltadB\b

@jsoref jsoref added this to the v0.0.23 milestone Dec 29, 2023
@jsoref
Copy link
Member

jsoref commented Dec 29, 2023

Fwiw, I'm starting to play with Feature: Block Ignore.

I ran into some pain involving partial line error reporting, so my initial draft will basically ignore the entire contents of a begin line through the entire contents of an end line. I'll play with it for a while. I'm not committing to shipping with this feature in the next version, but at least I am starting to iterate on this problem...

My initial version is a pure string marker as opposed to a regular expression... I'm not wed to that design choice (and I don't like some of my initial choices -- the file format will almost certainly change....)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants