Is it intentional that dek elements need to be contained in the content? #676

Shepard · 2022-07-28T12:03:17Z

Platform: Windows 10 64 bit
Mercury Parser Version: 2.2.1
Browser Version (if a browser bug): Chrome 103

Expected Behavior

When defining a custom extractor, elements selected via the selector for the "dek" can be found anywhere in the document.

Current Behavior

The selector only finds something if the dek element is included in whatever the content selectors returned after selecting and cleaning.

Steps to Reproduce

I noticed this when writing a custom extractor for the site spektrum.de so I'll include the extractor code I have so far.

import Mercury from '@postlight/mercury-parser';

const SpektrumExtractor = {
  domain: 'www.spektrum.de',

  title: {
    selectors: [
      '.content__title'
    ],
  },

  author: {
    selectors: [
      '.content__author__info__name'
    ],
  },

  date_published: {
    selectors: [
      '.content__meta__date'
    ],
  },

  dek: {
    selectors: [
      '.content__intro'
    ],
  },

  lead_image_url: {
    selectors: [
      ['meta[name="og:image"]', 'value'],
      ['meta[property="og:image"]', 'content'],
      '.image__article__top img',
    ],
  },

  content: {
    selectors: [
      'article.content'
    ],
    clean: [
      '.breadcrumbs',
      '.hide-for-print',
      'aside',
      'header',
      '.image__article__top',
      '.content__author',
      '.copyright',
      '.callout-box',
    ],
  },
}

Mercury.addExtractor(spektrumExtractor);

I then opened the article https://www.spektrum.de/news/genetik-das-geheimnis-der-parasitischen-rafflesien/2039026 and run this with code in the context of the page:

const result = await Mercury.parse(document.URL, {
	html: document.documentElement.outerHTML,
	fetchAllPages: false,
});
console.log(result.dek);

The console output will be null.
If I adjust the selector 'header' for the content to 'header h2' then the dek element will be included in the content and can thus be found and will appear on the console.

Detailed Description

I'm writing a custom extractor and I noticed that the dek property was always null after parsing. All the other properties were working and an element matching the selector I had defined for the dek was clearly contained in the document.
When debugging this I noticed that the reason it can not be found is that by the time the extraction code gets to the dek, the DOM the selector gets applied to is not the original document anymore but (from the looks of it) only what is left from it after extracting and cleaning the content property.

So, effectively, the dek has to be contained in the content in order to be found. I'm wondering if this is intentional. If so, I can adjust my selectors for the content to include the dek but I'd rather not have that bit in there. To me, the content should only be the main body of text.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it intentional that dek elements need to be contained in the content? #676

Is it intentional that dek elements need to be contained in the content? #676

Shepard commented Jul 28, 2022

Is it intentional that dek elements need to be contained in the content? #676

Is it intentional that dek elements need to be contained in the content? #676

Comments

Shepard commented Jul 28, 2022

Expected Behavior

Current Behavior

Steps to Reproduce

Detailed Description