Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to check if page is readable? #78

Open
Feelnoobskill opened this issue Aug 1, 2016 · 4 comments
Open

How to check if page is readable? #78

Feelnoobskill opened this issue Aug 1, 2016 · 4 comments

Comments

@Feelnoobskill
Copy link

I want to check if the page is readable or not. Is that possible?

@haroldtreen
Copy link
Collaborator

What do you mean by this?

When you use readability, the article returned will be null if nothing was found. Otherwise it will return whatever part of the article it determined to be the main content.

I've been wondering how to flag articles that aren't being extracted properly (eg. a description has been extracted rather then the article). My current approach is to look at the size of the article html vs. input html. I've found that if the content html is < 3% of the original article - chances are the main article was missed.

Does that help @Feelnoobskill ? Or maybe you can elaborate what you picture the solution looking like?

@Feelnoobskill
Copy link
Author

Feelnoobskill commented Aug 3, 2016

@haroldtreen thanks for the response. Basically, I would like to create reader mode like iOS Safari has.

Meaning that some pages are not suitable for opening in reader mode (for example stackoverflow home page). Right now node-readability will extract some random text from webpage and this is not acceptable in my case . So i was thinking maybe someone already faced with this problem and can share their experience.

@haroldtreen
Copy link
Collaborator

Ah. Interesting. I wasn't aware that iOS did that.

Some ideas:

  • You could look at the metadata to determine if the page is an article. Use that to remove the reader/force-show the reader button. For example:
<meta property="og:type" content="article">
  • You could run readability and check the % reduction. As stated before, I find ~2.5-3% to be a good metric for something going wrong.
  • Run readability and look at the length of the output. Reader mode will probably only be useful if the length the output is > X characters/lines.

The tag stuff might be the closest stuff to being able to say yes/no without actually running the algorithms on the page.

@NinoSkopac
Copy link

This is good info @haroldtreen.

It would be great if the library had an API for this (eg isReadable)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants