Draft for getting-started-preprocessing #183

Iota87 · 2020-09-11T19:51:02Z

To be completed. Need preliminary feedbacks on:

Structure (also keeping in mind rendering on website)
Tone (e.g. use of examples)
Length / level of details

henrifroese

Just found a few small typos. Overall, I really like the tutorial 👍 . Things I'd change:

the "stemming" part at the bottom (I think it's not from you) is not that nice, I'd probably redo that (although this should probably be moved to nlp.py anyways soon so maybe it'll just be left out here)
I would maybe include 2-3 more preprocessing functions that are characteristic of the module so users can familiarize themselves with our function style (e.g. remove_whitespace, remove_html_tags, and definitely tokenize)
Maybe drop lines 11-13 and call the first section "Overview" (there's the "--" in line 11, I'm not sure why?)
The "Custom Pipelines" section belongs to clean (it's also just calling clean), so I'd move it under the clean section
I think something like the following structure would be nice:

Overview

Key Functions

Clean

Custom Pipelines

Tokenize

Preprocessing API

Add quick examples with remove_whitespace / remove_html_tags so users know how to use the module. Then just what you already have here.

Recap

3-sentence summary with inline code.

website/docs/getting-started-preprocessing.md

Iota87 · 2020-10-07T21:03:31Z

Great comments Henri, and good catches on the typos.
I added tokenize, references and adjusted the structure in line with your input.
Let me know what you think.
I am a bit hesitant to add "remove_html_tags" here because I do not know if it is something that you can easily explain in plain words and in a succinct way to a complete beginner. It can be explained in a separate section/tutorial, but I am not sure you want to get into HTML tags in the getting started. What do you think?

jbesomi · 2020-10-09T19:56:29Z

Hi Guys!

Thank you Giovanni for the great start and Henri for the comments!

Sorry for having reviewed that late!

As a general comment, I think we need to make it more technical and concise. The end goal of the getting started preprocessing tutorial is to teach how to use Texthero to actually do text preprocessing.

As we want to guide the user through Texthero preprocessing core, it's important to show them how to actually do the stuff.

Giovanni, do you think you can start from the comment below, test the code in a Juptyer Notebook, and then write around to this a getting-started tutorial? I didn't go into the details to give you more freedom; If you want more advice or something is unclear just let me know!

Kind regards,
Jonathan

(overview + what's important to keep in mind)

one of Texthero's pillar is text preprocessing
need to mention the modularity approach (one function for one task), and that the user can customize the pipeline
preprocessing is task and domain-specific. The developer needs to know what he wants, Texthero provide a tool to quickly experiment. It's advised to start with the standard clean pipeline, see if that work, and otherwise iteratively try to solve the problem
The Texthero preprocessing is seen more as a pre-processing step for bag-of-words models, where what matters is the content (not the grammar or punctuation). In bag-of-words models, we want to get rid of punctuations and stopwords and we want to normalize (stem). This is different from the more advanced and complex neural network transformers architectures ... here we might want to keep the punctuation as well as the stopwords ... but, if the text data are very dirty, then a general cleaning might be useful anyway (for example removal of round brackets and content generally help + replacement of 12.3 numbers to NUM might help as well) ...
Users come here after having read the "getting started" page, they already know about the clean function, here we want to offer something more and explain to them how to clean some text data, it's important to give users examples as well as guide them through the process
We want to teach the users to use the API preprocessing, and we want to mention at least 50% of such functions
Tokenization part: hide for now, as we are making main changes there

Preprocessing

Overview

Introduction to this new "chapter" and menstion what we have seen before + introduction sentence about preprocessing ... something like: "By now you should have a general overview of what's Texthero is about, in the next sections we will dig a bit deeper into Texthero's core and see what we can get out of our beautiful text data."

Preprocessing API

Link + introduction

Doing it right

There is no magic formula that works in every situation, Texthero provides a modular approach to deal with data processing
The user needs to understand what it actually requires.
Texthero is mostly used to get a first feeling of the data, using bag-of-words approaches, in this case, the goal is to try to keep relevant and clean content
Mention bag-of-words approach, explain the difference between transformers. Here is really from raw data (maybe coming from an ocr or scraped from a website) to something cleaner.

Standard vs Custom pipeline ( old key function)

Mention there is the clean standard function or that we can customize, as, Mention chaining, all preprocessing's functions receive as input a Pandas Series and they return a Pandas Series. This allows chaining multiple functions in a pandas-pythonic fashion.

FAQ

FAQ questions, mostly to improve SEO.

Text preprocessing, From zero to hero

Preprocessing is about data cleaning, let's assume we got some dirty data we want to clean, especially, we want to keep only relevant and clean content.

df = pd.DataFrame(["I have the power! $$ (wow!)", "Flame on!",
"HULK SMASH!",...
Holy ____ Batman!
I am the vengeance, I am the night, I am BATMAN!
I am GROOT.
I’m going ghost!
I am the law!
SPOOOON!!!"])

Let's start by calling clean ... see what happens.

hero.preprocessing.clean(df['text'])

...

comment ...

Now, assume we want to keep the punctuation marks but remove parenthesis ... open the "preprocessing API" page and look for the "remove_brackets"

Show a custom pipeline and explain it:

df['clean'] = (
df['text']
.pipe(p.function1)
.pipe(p.function2)
.pipe(p.function3)
)

Going further

two-three high-quality links to other pages about text-preprocessing + a getting started tutorial on regex with python

Recap

Iota87 · 2020-10-14T22:40:12Z

Sounds good, Jonathan! I reviewed your comments and suggestions, they are perfectly aligned with what discussed on the call. Working on it!
Thanks,
Giovanni

Concise version. Structure should be final. More examples can be added.

Giovanni Liotta added 3 commits September 11, 2020 14:32

pre-processing schema draft

8a3c8d4

pre-processing draft schema

f303f0b

Draft schema pre-processing

f6c4534

vercel bot deployed to Preview September 11, 2020 19:51 View deployment

jbesomi marked this pull request as draft September 12, 2020 09:42

jbesomi requested a review from henrifroese September 12, 2020 09:42

henrifroese reviewed Sep 13, 2020

View reviewed changes

Added Tokenize function

092b6f9

vercel bot deployed to Preview October 7, 2020 21:04 View deployment

Giovanni Liotta and others added 2 commits October 7, 2020 16:11

pandas import

6c9cd5d

Delete settings.json

7f8f614

vercel bot temporarily deployed to Preview October 9, 2020 16:53 Inactive

Delete texthero.code-workspace

4673b47

vercel bot deployed to Preview October 9, 2020 16:53 View deployment

Iota87 added 2 commits December 9, 2020 10:40

Updated getting-started-preprocessing.md

44ab2e0

Concise version. Structure should be final. More examples can be added.

Merge branch 'master' of github.com:Iota87/texthero

f1d49c5

vercel bot deployed to Preview December 9, 2020 16:49 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft for getting-started-preprocessing #183

Draft for getting-started-preprocessing #183

Iota87 commented Sep 11, 2020 •

edited by jbesomi

henrifroese left a comment •

edited

Iota87 commented Oct 7, 2020

jbesomi commented Oct 9, 2020 •

edited

Iota87 commented Oct 14, 2020

Draft for getting-started-preprocessing #183

Are you sure you want to change the base?

Draft for getting-started-preprocessing #183

Conversation

Iota87 commented Sep 11, 2020 • edited by jbesomi

henrifroese left a comment • edited

Choose a reason for hiding this comment

Overview

Key Functions

Clean

Custom Pipelines

Tokenize

Preprocessing API

Recap

Iota87 commented Oct 7, 2020

jbesomi commented Oct 9, 2020 • edited

Preprocessing

Overview

Preprocessing API

Doing it right

Standard vs Custom pipeline ( old key function)

FAQ

Text preprocessing, From zero to hero

Going further

Recap

Iota87 commented Oct 14, 2020

Iota87 commented Sep 11, 2020 •

edited by jbesomi

henrifroese left a comment •

edited

jbesomi commented Oct 9, 2020 •

edited