Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restock & Price monitor - Use itemprop where available #2041

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

dgtlmoon
Copy link
Owner

@dgtlmoon dgtlmoon commented Dec 8, 2023

Re #2039 -

Make the restock monitor use machine-data first if available, then fallback to what the browser scraper replies.

https://schema.org/ItemAvailability

<link itemprop="availability" href="https://schema.org/OutOfStock" />

Todo

  • Add test
  • Check handling in the case that two or more itemprop's are on the page
  • Check it works well with existing watches
  • add price extraction too (present it as stock, price next to tags)
  • upper and lower price alert, where the 'change text' can be "it exceeded lower" or "it exceeded higher" perhaps
  • should record in the watch where the source came from (scraped or from LDJSON etc)
  • remove confusing "'follow JSON-LD embedded data?'" prompt
  • update - all watches with 'track_ldjson_price_data' set should be now "restock detection" mode
  • update - all 'watch - in_stock' should be watch - restock - in_stock bool
  • check memory usage (current PR has 400mb with 1200 entries :( )
  • check no exception:... errors exist with test set

actually better to add here is an extra attribute on the watch to record the most recent price (and stock status?) too, then we can add a setting to alert if the price moves

as well as add two extra columns in the watch-overview table (maybe this should be extensible so different functionalities can announce that they want an extra column in the table)

maybe some toggle button like "follow content change" or "follow price/stock change"

@druppelt
Copy link

druppelt commented Dec 8, 2023

This seems to currently only support links like https://schema.org/InStock and http://schema.org/InStock. Google states that "The short names without the URL prefix are also supported (for example, BackOrder).".
Edit: I misunderstood the code, I think this part is fine.

Also I'm not sure it link elements are the only place these can occur. To me the google documentation looks like meta elements are also supported, e.g. <meta itemprop="availability" content="https://schema.org/OutOfStock" />

And, while it may be out of scope for this PR, there is also the RDFa format, which looks like this: <div property="schema:availability" content="https://schema.org/InStock"></div>

@dgtlmoon
Copy link
Owner Author

dgtlmoon commented Dec 9, 2023

@druppelt I just realised that we store in_stock anyway in the watch data structure, so it means running with that value only means we dont send a stray notification/trigger/change, so that's a nice little win

What do you think about this xPath ? Do all itemprop's have to be inside itemtype ?

//*[@itemtype='https://schema.org/Offer']//*[@itemprop='availability']/@href

I also added RDFa style detection too, but I couldnt find a web page "in the wild" to test it on, do you have any links I could test it against?

@dgtlmoon
Copy link
Owner Author

dgtlmoon commented Dec 9, 2023

Ah, according to https://schema.org/availability

Microdata (completed in this PR) ✔️

  <div itemprop="offers" itemscope itemtype="https://schema.org/Offer">
    <span itemprop="priceCurrency" content="USD">$</span><span
          itemprop="price" content="1000.00">1,000.00</span>

    <link itemprop="availability" href="https://schema.org/InStock" />In stock

RDFa (not yet, needs also availability, we have //*[@property='schema:availability']/@content)

  <div property="offers" typeof="Offer">
    <!--price is 1000, a number, with locale-specific thousands separator
        and decimal mark, and the $ character is marked up with the
        machine-readable code "USD" -->
    <span property="priceCurrency" content="USD">$</span><span
      property="price" content="1000.00">1,000.00</span>
    <link property="availability" href="https://schema.org/InStock" />In stock
  </div>

JSON-LD (handled by the 'follow JSON-LD embedded data?' prompt, but this should work here also, needs to be integrated)

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Product",
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "3.5",
    "reviewCount": "11"
  },
  "description": "0.7 cubic feet countertop microwave. Has six preset cooking categories and convenience features like Add-A-Minute and Child Lock.",
  "name": "Kenmore White 17\" Microwave",
  "image": "kenmore-microwave-17in.jpg",
  "offers": {
    "@type": "Offer",
    "availability": "https://schema.org/InStock",
    "price": "55.00",
    "priceCurrency": "USD"
  },

@dgtlmoon dgtlmoon changed the title Restock monitor - Use itemprop where available Restock & Price monitor - Use itemprop where available May 3, 2024
@dgtlmoon
Copy link
Owner Author

scrapinghub/extruct#232 stuck here

@@ -240,7 +240,7 @@ def _get_stripped_text_from_json_match(match):
# ensure_is_ldjson_info_type - str "product", optional, "@type == product" (I dont know how to do that as a json selector)
def extract_json_as_string(content, json_filter, ensure_is_ldjson_info_type=None):
stripped_text_from_html = False

# https://github.com/dgtlmoon/changedetection.io/pull/2041#issuecomment-1848397161w
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this code can be replaced with extruct

@property
def has_restock_info(self):
# has either price or availability
if self.get('restock'):
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be moved to the actual Restock object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants