Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification Needed on "C4 NoPunc" in Data Processing #162

Closed
codefly13 opened this issue May 16, 2024 · 1 comment
Closed

Clarification Needed on "C4 NoPunc" in Data Processing #162

codefly13 opened this issue May 16, 2024 · 1 comment

Comments

@codefly13
Copy link

I am currently working with a dataset and noticed the term "C4 NoPunc" used in the context of data quality filtering. I would like to clarify what exactly this term refers to. Specifically, does "C4 NoPunc" mean:

  1. Quality filters are applied except for the "lines_with_no_ending_punctuation" rule. This means all other C4 quality filters are applied, but lines are not removed based solely on the absence of ending punctuation.

  2. Only the "lines_with_no_ending_punctuation" rule is used in quality filtering. This means that the sole criterion for removing lines is the absence of ending punctuation, and no other C4 quality filters are applied.

Could you please provide some insight into which of these interpretations is correct, or if there's another meaning entirely?

@soldni
Copy link
Member

soldni commented May 21, 2024

Hi @codefly13!

It's the latter: only the lines_with_no_ending_punctuation rule is used in quality filtering.

I'm closing this issue assuming that the above answers your question, but please re-open it in case you need further clarification!

Best,
Luca

@soldni soldni closed this as completed May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants