Support ignoring multi-line blocks (e.g. code snippets in markdown files) #9

LukeStorry · 2021-02-22T16:54:19Z

I'm trying to add in exclusion for backticked code snippets so the contents don't get spell checked.

I've tried both \x60{1,3}[\s\S]+?\x60{1,3} and \[\n.]+?`` and variations thereof, but can't seem to get the action to ignore the snippets.

For example:

Section 4 - Encryption

Here is a sample encrypton function:

def encrypt(message: str):
    # Do the encryption...
    return encrypted_message

The encrypt function takes one parameter...

The text was updated successfully, but these errors were encountered:

jsoref · 2021-02-22T17:02:50Z

Unfortunately, there's currently no provision to handle multi-line things. I'm still thinking about how to do it :-(.

My concern is that people don't necessarily ignore binary files, which means there's a risk of the tool having to pull the entire file into memory.

I think to address this, I'll probably have to switch to a mode where I build my own state machine :-(.

LukeStorry · 2021-02-22T17:08:49Z

Ah, ok that makes sense!

Thanks for your speedy response too.

I can workaround by just adding all problematic words to expect.txt, but multi-line excludes would be handy!

jsoref · 2021-02-22T17:24:51Z

Yeah... agreed.

You can cheat by adding a regexp for ^.*ignore-line.*$ and then adding # ignore-line to the line or similar.

So far, I've cheated by manually checking such files and then excluding the files. But that isn't great.

LukeStorry · 2021-02-22T19:38:57Z

Thanks, I'll use that in some places.

This seems to work for single-line backtick escaping: \x60.+?\x60

In some of the examples you put a / at the start of the regex, what difference does that make?

jsoref · 2021-02-22T19:44:46Z

If you mean things like:

/docs\.google\.com/[a-z]+/d/(?:e/|)[0-9a-zA-Z_-]+/

It assumes that a url is of the form:

https://docs.google.com/hello/d/hello/

It would not catch:

docs.google.com/hello/d/hello/

It's as opposed to:

\bdocs\.google\.com/[a-z]+/d/(?:e/|)[0-9a-zA-Z_-]+/

To avoid tripping over:

mucdocs.google.com/hello/d/hello/

LukeStorry · 2021-02-22T19:55:01Z

Ah yes ok, was wondering why that didn't need escaping, or if it was signifying the beginning of the regex.
Cheers!

jsoref · 2021-02-22T20:12:14Z

I've added some extra comments to the entries in the wiki. Let me know if they're helpful.

I suppose I could actually include some examples of how they break (as I did above)....

pudgereyem · 2021-04-11T20:41:31Z

Hey @LukeStorry and @jsoref,
this is something that I would love to see as well. I added the Github action to my blog but since there's quite a lot of code blocks in there it gets quite painful to "allow" all those weird words in there.

Thanks for the great work. This is not really my domain, otherwise I'd offer to help.

jsoref · 2021-04-11T21:50:47Z

I'm still very much thinking about how I want to implement this feature since it's really the ability to parse arbitrary file content, and I also want to make sure I don't mess up line numbering (the current version doesn't worry about line numbering much, but 0.0.18-alpha which I hope to release shortly will be reporting each thing it finds w/ lines and columns, so not messing up line numbering will be even more important then).

One approach would be for me to just import someone else's parser. I haven't actively looked, but I'm fairly pessimistic about the odds that I'd find one I liked.

Handling markdown snippets separately from other files today

@LukeStorry: fwiw, there is one approach that you could try today...
split the action into two jobs, one for markdown and one for code (you can use only.txt to focus on markdown for the one side).

Note that what I'm writing below is entirely untested, there's a version in the wiki which actually worked.

Use:

spelling workflow w/ matrix

.github/workflows/spelling.yml from https://github.com/check-spelling/spell-check-this/blob/prerelease/.github/workflows/spelling.yml

name: Spell checking
on:
  pull_request_target:
  push:
  issue_comment:
  pull_request_review_comment:

jobs:
  spell-check:
    name: "Spell Checker"
    runs-on: ubuntu-latest
    continue-on-error: true
    strategy:
      matrix:
        area: ["code", "markdown"]
    steps:
    - name: checkout-merge
      if: "contains(github.event_name, 'pull_request')"
      uses: actions/checkout@v2.0.0
      with:
        ref: refs/pull/${{github.event.pull_request.number}}/merge
    - name: checkout
      if: "!contains(github.event_name, 'pull_request')"
      uses: actions/checkout@v2.0.0
    - uses: check-spelling/check-spelling@prerelease
      with:
        config: ".github/actions/spelling/${{ matrix.area }}"
        experimental_apply_changes_via_bot: 1

Follow the advice in the wiki for how to set up symlinks (note that with: config: ... will generate something similar to env: bucket: ... + env: project: ...).

Markdown patterns

In the .github/actions/spelling/markdown/patterns/markdown.txt:

^\s{4,}.*$

You'll want to actively use common/allow.txt (allow.txt wasn't available when the wiki page I'm linking to was written which is why it suggests dictionary.txt) to add words that are consistently used across your project -- symlinking it into .github/actions/spelling/markdown/allow/common.txt or similar.

General thoughts on improving adoption

@pudgereyem: thanks for trying. I'm currently in the process of making the adoption easier.

looking at your blog repo, you could try using:

https://github.com/check-spelling/spell-check-this/blob/prerelease/.github/workflows/spelling.yml

talk to the bot to update expect.txt

Specifically, if you make a PR to the default branch which has the above workflow, it'll let you automatically accept its suggestions for expect.txt.

automatic updates for no newline at eof

Looking at your repository, I think I want it to automatically fix no-newline-at-eof -- the code already exists in the repository, so doing it would be pretty harmless. Unix tools (including git) really hate files that are missing the newline-at-eof, but Windows tools tend to make it really hard to understand how to fix this.

automatic updates for excludes

I also want to set it up to be able to automatically update excludes.txt (the prerelease version offers suggestions for excludes). -- I'm not sure if I'll manage to get this into 0.0.18-alpha -- I kinda want to ship it (the feature to update expect has been in development since this past summer...).

Building a list of corrections

The other thing that people would probably benefit from is my little Google Sheet that I use to calculate replacements... Google Sheets is really good at identifying real typos and suggesting real replacements.

-- Eventually, I intend to offer something like that via check-spelling: Suggest corrections, but that's really a long ways away (long after I handle multi-line blocks).

pudgereyem · 2021-04-12T09:05:54Z

@jsoref thanks for the quick response.

talk to the bot to update expect.txt

Cool, I'll give it a try later! Btw, can you give me an example of a word that you'd add to allow.txt and expect.txt respectively? Maybe this is completely clear in the wiki, but I couldn't completely wrap my head around it.

automatic updates for no newline at eof

Yeah, agreed. I missed that.

Building a list of corrections

What's the intended workflow for the Google Sheet? I pasted the misspelled words into the Current-column. How would I make so that Google Spreadsheet comes up with suggestions?

All the best,
Victor

jsoref · 2021-04-12T14:16:46Z

n.b. I've copied most of the content from this comment into https://github.com/check-spelling/check-spelling/wiki/Configuration#allow and https://github.com/check-spelling/check-spelling/wiki/Configuration#expect

Allow

(see area dictionaries for other examples):

every html tag & attribute,
the name of everyone in your company (and the name of your company)
names of css attributes/properties
C or JavaScript reserved words
The names of all types/ functions from the C standard library

... They're really words, just not in the ancient base dictionary. They might not be used today in your project, but there's no reason for the spell checker to complain to a contributor tomorrow because it's foreseeable that they might be. – I'll try to update the wiki to clarify this at some point.

Fwiw, this month I've finally started working on a way to collect things like this in a vaguely systematic way. (I made a different draft last summer but didn't like where it went.)

Expect

Some arbitrary strings that are in test files that aren't really words. They should be removed if the test are changed/removed.

Allow vs Expect

Roughly if it's a proper noun of some sort of exists in the real world outside the project, it's a good candidate for allow. If it's just something you're temporally using that isn't really a word, it probably belongs in expect.

Note

The bot doesn't really care. You could put everything into allow or everything into expect. The difference is that the bot will help you remove unused items from expect and add new items to it. You could also periodically migrate items from expect to allow (think about it like Java object lifespan promotions).

The second tab of the sheet tries to describe the workflow.

ricardo-dematos · 2022-11-30T01:08:54Z

I have a block of text with a python doctring and in the middle there's the word dBSPL. One of the patterns is catching BSPL and flagging it... I tried adding dBSPL to allow.txt, but with no avail.

The same happens with deltadB, with deltad being flagged.

Suggestions?

jsoref · 2022-11-30T02:47:48Z

You could add BSPL to allow.txt / expect.txt if you were so inclined, or dBSPL to patterns.txt

Similarly deltad to allow.txt / expect.txt or deltadB to patterns.txt.

check-spelling splits on certain case transitions, so it sees d+BSPL or deltad+B.
By default (for the past couple of versions), check-spelling's minimum word length is 3, so it doesn't pay any attention to d or B, and that's what leaves it w/ the other two tokens to complain about.

ricardo-dematos · 2022-11-30T11:00:51Z

I went with the option of adding them to patterns.txt :

# ignore 'dBSPL' word
^.*dBSPL.*$

# ignore 'deltadB' word
^.*deltadB.*$

🙈

jsoref · 2022-11-30T13:17:08Z

That doesn't ignore the word, it ignores the line containing the word. If you only want to ignore the word, you'd use \b dBSPL\b / \bdeltadB\b

jsoref · 2023-12-29T00:10:50Z

Fwiw, I'm starting to play with Feature: Block Ignore.

I ran into some pain involving partial line error reporting, so my initial draft will basically ignore the entire contents of a begin line through the entire contents of an end line. I'll play with it for a while. I'm not committing to shipping with this feature in the next version, but at least I am starting to iterate on this problem...

My initial version is a pure string marker as opposed to a regular expression... I'm not wed to that design choice (and I don't like some of my initial choices -- the file format will almost certainly change....)

jsoref changed the title ~~Document how to exclude code snippets in markdown files~~ Document how to exclude multi-line code snippets in markdown files Feb 22, 2021

jsoref changed the title ~~Document how to exclude multi-line code snippets in markdown files~~ Support ignoring multi-line blocks (e.g. code snippets in markdown files) Apr 11, 2021

pudgereyem mentioned this issue Apr 11, 2021

chore: add spell check action pudgereyem/blog#1

Merged

jsoref added the enhancement New feature or request label May 4, 2021

Piedone mentioned this issue Dec 5, 2022

Inline Document Settings #39

Open

jsoref added this to the v0.0.23 milestone Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ignoring multi-line blocks (e.g. code snippets in markdown files) #9

Support ignoring multi-line blocks (e.g. code snippets in markdown files) #9

LukeStorry commented Feb 22, 2021

jsoref commented Feb 22, 2021

LukeStorry commented Feb 22, 2021

jsoref commented Feb 22, 2021 •

edited

LukeStorry commented Feb 22, 2021

jsoref commented Feb 22, 2021

LukeStorry commented Feb 22, 2021

jsoref commented Feb 22, 2021

pudgereyem commented Apr 11, 2021

jsoref commented Apr 11, 2021

pudgereyem commented Apr 12, 2021

jsoref commented Apr 12, 2021 •

edited

ricardo-dematos commented Nov 30, 2022

jsoref commented Nov 30, 2022

ricardo-dematos commented Nov 30, 2022

jsoref commented Nov 30, 2022

jsoref commented Dec 29, 2023

Support ignoring multi-line blocks (e.g. code snippets in markdown files) #9

Support ignoring multi-line blocks (e.g. code snippets in markdown files) #9

Comments

LukeStorry commented Feb 22, 2021

Section 4 - Encryption

jsoref commented Feb 22, 2021

LukeStorry commented Feb 22, 2021

jsoref commented Feb 22, 2021 • edited

LukeStorry commented Feb 22, 2021

jsoref commented Feb 22, 2021

LukeStorry commented Feb 22, 2021

jsoref commented Feb 22, 2021

pudgereyem commented Apr 11, 2021

jsoref commented Apr 11, 2021

Handling markdown snippets separately from other files today

General thoughts on improving adoption

talk to the bot to update expect.txt

automatic updates for no newline at eof

automatic updates for excludes

Building a list of corrections

pudgereyem commented Apr 12, 2021

jsoref commented Apr 12, 2021 • edited

Allow

Expect

Allow vs Expect

Note

ricardo-dematos commented Nov 30, 2022

jsoref commented Nov 30, 2022

ricardo-dematos commented Nov 30, 2022

jsoref commented Nov 30, 2022

jsoref commented Dec 29, 2023

jsoref commented Feb 22, 2021 •

edited

jsoref commented Apr 12, 2021 •

edited