Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lists consisting of mostly links get removed #747

Open
Liamolucko opened this issue Aug 8, 2023 · 0 comments
Open

Lists consisting of mostly links get removed #747

Liamolucko opened this issue Aug 8, 2023 · 0 comments

Comments

@Liamolucko
Copy link

Expected Behavior

Postlight Parser should preserve all the actual content of the page.

Current Behavior

Postlight Parser will get rid of any bulleted / numbered lists which consist mostly of links.

Steps to Reproduce

Run Postlight Parser on https://faultlore.com/blah/defaults-affect-inference. The bulleted list a bit after the 'Some Wild Shit Swift Does' heading gets removed.

Picture of the list in question:

Screenshot 2023-08-08 at 9 10 09 pm

Detailed Description

This is the code that causes the problem:

const density = linkDensity($node);
// Too high of link density, is probably a menu or
// something similar.
// console.log(weight, density, contentLength)
if (weight < 25 && density > 0.2 && contentLength > 75) {
$node.remove();
return;
}
// Too high of a link density, despite the score being
// high.
if (weight >= 25 && density > 0.5) {
// Don't remove the node if it's a list and the
// previous sibling starts with a colon though. That
// means it's probably content.
const tagName = $node.get(0).tagName.toLowerCase();
const nodeIsList = tagName === 'ol' || tagName === 'ul';
if (nodeIsList) {
const previousNode = $node.prev();
if (
previousNode &&
normalizeSpaces(previousNode.text()).slice(-1) === ':'
) {
return;
}
}
$node.remove();
return;
}

It's aiming to try and get rid of menus and things.

Possible Solution

The easiest solution would be to also apply the special case from the weight >= 25 bit of the code above to the weight < 25 bit of the code, which keeps any list that comes after a paragraph ending in a colon. (The lists which don't work fall into the weight < 25 camp, which is why they don't already work thanks to that special case.)

Another solution I thought of would be to look at either the average or maximum length of links in a list (or table / div / everything else that the tag-cleaning code gets applied to), and if it's longer than some threshold include it. In theory that should differentiate between shorter links in menus and longer sentence-length links in content; but looking at the example I provided again those links are actually quite short so that might not work as well as I'd hoped.

So yeah, probably that first solution. I've already implemented it at https://github.com/Liamolucko/postlight-parser/tree/fix-link-lists and confirmed that it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant