Lists consisting of mostly links get removed #747

Liamolucko · 2023-08-08T11:46:18Z

Expected Behavior

Postlight Parser should preserve all the actual content of the page.

Current Behavior

Postlight Parser will get rid of any bulleted / numbered lists which consist mostly of links.

Steps to Reproduce

Run Postlight Parser on https://faultlore.com/blah/defaults-affect-inference. The bulleted list a bit after the 'Some Wild Shit Swift Does' heading gets removed.

Picture of the list in question:

Detailed Description

This is the code that causes the problem:

parser/src/utils/dom/clean-tags.js

Lines 43 to 73 in e8ba7ec

    
           const density = linkDensity($node); 
        
           // Too high of link density, is probably a menu or 
        
           // something similar. 
        
           // console.log(weight, density, contentLength) 
        
           if (weight < 25 && density > 0.2 && contentLength > 75) { 
        
             $node.remove(); 
        
             return; 
        
           } 
        
           // Too high of a link density, despite the score being 
        
           // high. 
        
           if (weight >= 25 && density > 0.5) { 
        
             // Don't remove the node if it's a list and the 
        
             // previous sibling starts with a colon though. That 
        
             // means it's probably content. 
        
             const tagName = $node.get(0).tagName.toLowerCase(); 
        
             const nodeIsList = tagName === 'ol' || tagName === 'ul'; 
        
             if (nodeIsList) { 
        
               const previousNode = $node.prev(); 
        
               if ( 
        
                 previousNode && 
        
                 normalizeSpaces(previousNode.text()).slice(-1) === ':' 
        
               ) { 
        
                 return; 
        
               } 
        
             } 
        
             $node.remove(); 
        
             return; 
        
           }

It's aiming to try and get rid of menus and things.

Possible Solution

The easiest solution would be to also apply the special case from the weight >= 25 bit of the code above to the weight < 25 bit of the code, which keeps any list that comes after a paragraph ending in a colon. (The lists which don't work fall into the weight < 25 camp, which is why they don't already work thanks to that special case.)

Another solution I thought of would be to look at either the average or maximum length of links in a list (or table / div / everything else that the tag-cleaning code gets applied to), and if it's longer than some threshold include it. In theory that should differentiate between shorter links in menus and longer sentence-length links in content; but looking at the example I provided again those links are actually quite short so that might not work as well as I'd hoped.

So yeah, probably that first solution. I've already implemented it at https://github.com/Liamolucko/postlight-parser/tree/fix-link-lists and confirmed that it works.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lists consisting of mostly links get removed #747

Lists consisting of mostly links get removed #747

Liamolucko commented Aug 8, 2023

Lists consisting of mostly links get removed #747

Lists consisting of mostly links get removed #747

Comments

Liamolucko commented Aug 8, 2023

Expected Behavior

Current Behavior

Steps to Reproduce

Detailed Description

Possible Solution