Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spidermon Field Coverage - Item Validation impossible with nested scrapy.Items #314

Open
Criamos opened this issue Sep 29, 2021 · 1 comment

Comments

@Criamos
Copy link

Criamos commented Sep 29, 2021

Disclaimer: I'm a Python beginner and have been using Scrapy for the past 5 months, so please bear with me since this is the first GitHub Issue I've ever written.

Problem Description

I've tried integrating spidermon into an existing codebase with ~40 crawlers that use scrapy.Items as their data model. Upon trying to integrate Item Validation (both via schematics and jsonschema) I've noticed that spidermon only seems to be able to "see" the first level of a scrapy.Item (class: scrapy.Item), but not the other scrapy.Item-classes that are nested within mentioned Item.

Source code examples

I've tried to illustrate the problem with a simplified abstraction that's close to the Scrapy Tutorial - here's my items.py:

# items.py
import scrapy
from scrapy import Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst


class JoinMultivalues(object):
    def __init__(self, separator=u" "):
        self.separator = separator

    def __call__(self, values):
        return values


class LicenseItem(scrapy.Item):
    description = Field()


class QuotesItem(scrapy.Item):
    text = Field()
    author = Field()
    tags = Field()
    license = Field(output_processor=JoinMultivalues())


class QuotesItemLoader(ItemLoader):
    default_item_class = QuotesItem
    default_output_processor = TakeFirst()


class LicenseItemLoader(ItemLoader):
    default_item_class = LicenseItem
    default_output_processor = TakeFirst()

The main idea is: Within the QuotesItem there's a LicenseItem that should hold a license-description. Within the QuotesItem there could be other scrapy.Items nested within, sometimes several layers deep.

This is how a yielded Item looks like in the Terminal (please ignore the "raw" formatting of the license description string):

2021-09-29 14:03:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'author': ['Allen Saunders'],
 'license': [{'description': ['[\'<p class="copyright">\\n                Made with <span '
                 'class="sh-red"></span> by <a '
                 'href="https://scrapinghub.com">Scrapinghub</a>\\n            '
                 "</p>']"]}],
 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans'],
 'text': ['“Life is what happens to us while we are making other plans.”']}

The crawler itself uses the Basic Scrapy Monitors and doesn't do much else than scraping the Tutorial website, using the scrapy.ItemLoader-class to nest one scrapy.Item within another and yields the QuotesItem in the end.

# quotes_spider.py
import scrapy
from scrapy.loader import ItemLoader

from ..items import QuotesItem, LicenseItem


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    custom_settings = {
        'SPIDERMON_SPIDER_CLOSE_MONITORS': 'spidermon.contrib.scrapy.monitors.SpiderCloseMonitorSuite',
        'SPIDERMON_MIN_ITEMS': 1,
        'SPIDERMON_VALIDATION_MODELS': 'tutorial.tutorial.validators',
        'SPIDERMON_ADD_FIELD_COVERAGE': True,
        'SPIDERMON_FIELD_COVERAGE_RULES': {
            'QuotesItem/text': 0.9,
            'QuotesItem/author': 0.8,
            'QuotesItem/tags': 1,
            'QuotesItem/license/description': 1
        }
    }

    def parse(self, response, **kwargs):
        for quote in response.css('div.quote'):
            item_loader = ItemLoader(item=QuotesItem(), response=response)
            # Implementation via nested dictionaries
            # quotes_item = {
            #     'text': quote.css('span.text::text').get(),
            #     'author': quote.css('small.author::text').get(),
            #     'tags': quote.css('div.tags a.tag::text').getall(),
            #     'nested_dict': {
            #         'nested_field_1': 'can you read this?',
            #         'nested_field_2': 'I hope you can',
            #         'we_must_go_deeper': {
            #             'description': 'but do we really have to?',
            #             "4th_level": {
            #                 "five_levels_deep": "we can't turn back now!"
            #             }
            #         }
            #     }
            # }
            # yield quotes_item

            # Implementation #2 via ItemLoaders - see items.py:
            item_loader.add_value('text', quote.css('span.text::text').get())
            item_loader.add_value('author', quote.css('small.author::text').get())
            item_loader.add_value('tags', quote.css('div.tags a.tag::text').getall())

            license_loader = ItemLoader(item=LicenseItem(), response=response)
            license_raw = response.xpath('//footer/div/p[@class="copyright"]').getall()
            license_description = str(license_raw)
            license_loader.add_value('description', license_description)

            item_loader.add_value('license', license_loader.load_item())
            yield item_loader.load_item()

Output examples

This is what the field coverage output looks like:

'spidermon_field_coverage/QuotesItem/author': 1.0,
 'spidermon_field_coverage/QuotesItem/license': 1.0,
 'spidermon_field_coverage/QuotesItem/tags': 1.0,
 'spidermon_field_coverage/QuotesItem/text': 1.0,
 'spidermon_item_scraped_count': 20,
 'spidermon_item_scraped_count/QuotesItem': 20,
 'spidermon_item_scraped_count/QuotesItem/author': 20,
 'spidermon_item_scraped_count/QuotesItem/license': 20,
 'spidermon_item_scraped_count/QuotesItem/tags': 20,
 'spidermon_item_scraped_count/QuotesItem/text': 20,

My expectation/hope was that I'd be able to "look inside" the LicenseItem and spidermon would show me spidermon_item_scraped_count/QuotesItem/license/description. But as you can see above, spidermon stops at the depth of QuotesItem/license. Spidermon's field coverage monitor can't look inside my license-Item and therefore fails while trying to access the description-field with the following output:

2021-09-29 14:03:15 [quotes] ERROR: [Spidermon] 
======================================================================
FAIL: Field Coverage Monitor/test_check_if_field_coverage_rules_are_met
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/criamos/PycharmProjects/spidermonTesting/venv/lib/python3.9/site-packages/spidermon/contrib/scrapy/monitors.py", line 275, in test_check_if_field_coverage_rules_are_met
    self.assertTrue(len(failures) == 0, msg=msg)
AssertionError: 
The following items did not meet field coverage rules:
QuotesItem/license/description (expected 1, got 0)

Item Validators - validators.py

(the commented-out parts were different approaches until I realized that the problem must lie elsewhere)

# validators.py
from schematics import Model
from schematics.types import *


class LicenseItemValidator(Model):
    description = StringType()


class QuoteItemValidator(Model):
    text = StringType(required=True)
    author = StringType(required=True)
    tags = ListType(StringType)
    # license = DictType(field=StringType, coerce_key=BaseType)
    # license = ModelType(model_spec=LicenseItemValidator, required=True)
    license = ListType(ModelType(LicenseItemValidator))

Expected Behaviour:

If I yield a normal python dictionary (see "Implementation via dict class" in my quotes_spider-example above), I'll get the following output:

'spidermon_item_scraped_count': 20,
 'spidermon_item_scraped_count/dict': 20,
 'spidermon_item_scraped_count/dict/author': 20,
 'spidermon_item_scraped_count/dict/nested_dict': 20,
 'spidermon_item_scraped_count/dict/nested_dict/nested_field_1': 20,
 'spidermon_item_scraped_count/dict/nested_dict/nested_field_2': 20,
 'spidermon_item_scraped_count/dict/nested_dict/we_must_go_deeper': 20,
 'spidermon_item_scraped_count/dict/nested_dict/we_must_go_deeper/4th_level': 20,
 'spidermon_item_scraped_count/dict/nested_dict/we_must_go_deeper/4th_level/five_levels_deep': 20,
 'spidermon_item_scraped_count/dict/nested_dict/we_must_go_deeper/description': 20,
 'spidermon_item_scraped_count/dict/tags': 20,
 'spidermon_item_scraped_count/dict/text': 20,

Which works as expected. The nested dictionaries are accessible by spidermon.


Now I'm all out of ideas since the only approach I currently see as a solution is to "flatten" all the sub-Items into a big scrapy.Item-structure. This could totally be my fault and I'm simply using spidermon and schematics wrong here, but if anyone could confirm or deny if this is intended behaviour or not, it would be really appreciated. Thank you in advance for taking the time to read this wall of text (and thank you for developing scrapy / spidermon!)

@mushtaqak
Copy link

@Criamos I think it is expected. As of now spidermon does not support nested items inside a list such as license field in your case. We can not apply coverage rules on license. description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants