Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it intentional that dek elements need to be contained in the content? #676

Open
Shepard opened this issue Jul 28, 2022 · 0 comments
Open

Comments

@Shepard
Copy link
Contributor

Shepard commented Jul 28, 2022

  • Platform: Windows 10 64 bit
  • Mercury Parser Version: 2.2.1
  • Browser Version (if a browser bug): Chrome 103

Expected Behavior

When defining a custom extractor, elements selected via the selector for the "dek" can be found anywhere in the document.

Current Behavior

The selector only finds something if the dek element is included in whatever the content selectors returned after selecting and cleaning.

Steps to Reproduce

I noticed this when writing a custom extractor for the site spektrum.de so I'll include the extractor code I have so far.

import Mercury from '@postlight/mercury-parser';

const SpektrumExtractor = {
  domain: 'www.spektrum.de',

  title: {
    selectors: [
      '.content__title'
    ],
  },

  author: {
    selectors: [
      '.content__author__info__name'
    ],
  },

  date_published: {
    selectors: [
      '.content__meta__date'
    ],
  },

  dek: {
    selectors: [
      '.content__intro'
    ],
  },

  lead_image_url: {
    selectors: [
      ['meta[name="og:image"]', 'value'],
      ['meta[property="og:image"]', 'content'],
      '.image__article__top img',
    ],
  },

  content: {
    selectors: [
      'article.content'
    ],
    clean: [
      '.breadcrumbs',
      '.hide-for-print',
      'aside',
      'header',
      '.image__article__top',
      '.content__author',
      '.copyright',
      '.callout-box',
    ],
  },
}

Mercury.addExtractor(spektrumExtractor);

I then opened the article https://www.spektrum.de/news/genetik-das-geheimnis-der-parasitischen-rafflesien/2039026 and run this with code in the context of the page:

const result = await Mercury.parse(document.URL, {
	html: document.documentElement.outerHTML,
	fetchAllPages: false,
});
console.log(result.dek);

The console output will be null.
If I adjust the selector 'header' for the content to 'header h2' then the dek element will be included in the content and can thus be found and will appear on the console.

Detailed Description

I'm writing a custom extractor and I noticed that the dek property was always null after parsing. All the other properties were working and an element matching the selector I had defined for the dek was clearly contained in the document.
When debugging this I noticed that the reason it can not be found is that by the time the extraction code gets to the dek, the DOM the selector gets applied to is not the original document anymore but (from the looks of it) only what is left from it after extracting and cleaning the content property.

So, effectively, the dek has to be contained in the content in order to be found. I'm wondering if this is intentional. If so, I can adjust my selectors for the content to include the dek but I'd rather not have that bit in there. To me, the content should only be the main body of text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant