Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Support nested lists in scraper image field #2538

Draft
wants to merge 1 commit into
base: mealie-next
Choose a base branch
from

Conversation

robot-rover
Copy link

The image field returned by the scraper can be in many different formats. The code that handles it is the RecipeDataService#scrape_image() function in the file recipe_data_service.py. It already supports lists and dicts, but this commit adds support for nested lists and other combinations.

What type of PR is this?

(REQUIRED)

  • bug

What this PR does / why we need it:

(REQUIRED)

Some recipe parsers return nested lists in the image field. For example, this recipe: https://www.tasteofhome.com/recipes/favorite-chicken-potpie/ has the tag

<script type="application/ld+json">{
    "@context": "https:\/\/schema.org",
    "@type": "Recipe",
    "@id": "https:\/\/www.tasteofhome.com\/recipes\/favorite-chicken-potpie\/",
    # [skipped lines]
    "image": [
        "https:\/\/tmbidigitalassetsazure.blob.core.windows.net\/rms3-prod\/attachments\/37\/1200x1200\/exps21444_TH132767B05_02_1b_WEB.jpg",
        [
            "https:\/\/tmbidigitalassetsazure.blob.core.windows.net\/toh\/GoogleImages\/exps21444_TH132767B05_02_1b_WEB.jpg"
        ],
        [
            "https:\/\/tmbidigitalassetsazure.blob.core.windows.net\/toh\/GoogleImagesPostCard\/exps21444_TH132767B05_02_1b_WEB.jpg"
        ]
    ],
    

The nested list was crashing the original implementation of scrape_image with the following stack trace:

ERROR: 05-Sep-23 10:40:26       Error Scraping Image: No connection adapters were found for "['https://tmbidigitalassetsazure.blob.core.windows.net/toh/GoogleImages/exps21444_TH132767B05_02_1b_WEB.jpg']"
Traceback (most recent call last):
  File "/app/./mealie/services/scraper/scraper.py", line 45, in create_from_url
    recipe_data_service.scrape_image(new_recipe.image)
  File "/app/./mealie/services/recipe/recipe_data_service.py", line 128, in scrape_image
    loop.run_until_complete(future)
  File "uvloop/loop.pyx", line 1501, in uvloop.loop.Loop.run_until_complete
  File "/app/./mealie/services/recipe/recipe_data_service.py", line 30, in largest_content_len
    for response in await asyncio.gather(*tasks):
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/app/./mealie/services/recipe/recipe_data_service.py", line 25, in <lambda>
    loop.run_in_executor(executor, lambda: session.head(url, headers={"User-Agent": _FIREFOX_UA}))
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/requests/sessions.py", line 622, in head
    return self.request("HEAD", url, **kwargs)
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/requests/sessions.py", line 695, in send
    adapter = self.get_adapter(url=request.url)
  File "/opt/pysetup/.venv/lib/python3.10/site-packages/requests/sessions.py", line 792, in get_adapter
    raise InvalidSchema(f"No connection adapters were found for {url!r}")
requests.exceptions.InvalidSchema: No connection adapters were found for "['https://tmbidigitalassetsazure.blob.core.windows.net/toh/GoogleImages/exps21444_TH132767B05_02_1b_WEB.jpg']"

This PR fixes the crash by moving the the code that handles lists to a new function and calling it recursively.

Which issue(s) this PR fixes:

(REQUIRED)

None

Testing

(fill-in or delete this section)

I tested the recipe that was broken (https://www.tasteofhome.com/recipes/favorite-chicken-potpie/) as well as a few other recipes that worked originally.

And they all imported correctly in my development container.

Release Notes

(REQUIRED)

NONE

The `image` field returned by the scraper can be in many different formats. The
code that handles it is the `RecipeDataService#scrape_image()` function in the
file `recipe_data_service.py`. It already supports lists and dicts, but this
commit adds support for nested lists and other combinations.
@hay-kot hay-kot marked this pull request as draft October 7, 2023 21:28
Copy link
Contributor

This PR is stale because it has been open 45 days with no activity.

@github-actions github-actions bot added the stale label Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant