Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix for scrubber sample code which fails when scrubbing "two" #232

Open
0dB opened this issue Aug 6, 2023 · 2 comments
Open

Bugfix for scrubber sample code which fails when scrubbing "two" #232

0dB opened this issue Aug 6, 2023 · 2 comments
Assignees

Comments

@0dB
Copy link
Contributor

0dB commented Aug 6, 2023

The code in section "Scrubber" of https://derwen.ai/docs/ptr/sample/ has a small bug: When you add a token that also exists as a single term in the file, like "two", the while loop will consume the whole span and span[0] will then fail. Easy fix:

In (using my tokens instead of the ones on the page):

def prefix_scrubber():
    def scrubber_func(span: Span) -> str:
        while span[0].text in ("every", "other", "the", "two"): # ATTN, different tokens, will fail in original code
            span = span[1:]
        return span.text
    return scrubber_func

just add len(span) > 1 and and replace

while span[0].text in ("every", "other", "the", "two"):
by
while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):

to get

def prefix_scrubber():
    def scrubber_func(span: Span) -> str:
        while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):
            span = span[1:]
        return span.text
    return scrubber_func

Now, for the sample used on that page, I get

0.13134098, 05, sentences, [sentences, the two sentences, sentences, two sentences, the sentences]
0.07117996, 02, sentence, [every sentence, every other sentence]

and the line for "two" is still fine

0.00000000, 02, two, [two, two]

You are welcome to use the token list I used, ("every", "other", "the", "two"), it gives even more merged results than the example on the page.

@0dB
Copy link
Contributor Author

0dB commented Aug 6, 2023

I have created a PR.

@ceteri
Copy link
Collaborator

ceteri commented Aug 7, 2023

Thank you kindly @0dB , looks great.
I'm working to resolve the CI issue and get this merge.

@ceteri ceteri self-assigned this Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants