Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Videos missing #4

Closed
rgaudin opened this issue Jan 14, 2021 · 7 comments
Closed

Videos missing #4

rgaudin opened this issue Jan 14, 2021 · 7 comments

Comments

@rgaudin
Copy link
Contributor

rgaudin commented Jan 14, 2021

Investigating openzim/zimit#71 I realized I can't seem to be able to scrape videos reliably with the current version.

Even a very simple tests doesn't work:

The actual video content is not fetched.

I think this is the root cause behind fondamentaux missing most (if not all) of videos. See openzim/zimit#78

That may indicate that the problem is not new (those November runs were using zimit:dev which at that time used webrecorder/browsertrix-crawler:0.1.0). Maybe something has changed on Youtube.com at that time?

@ikreymer
Copy link
Member

Thanks for including specific examples! I found an issue with Youtube, both for capture and replay, which will require a fix to both pywb and wabac.js.

I can confirm from other reports that Youtube changed something about their player recently, which requires updating the rules.

You can follow this issue for the pywb fix: webrecorder/pywb#607
Will add one for wabac.js as well soon.

For Vimeo, I was not able to find an issue with manual capture.
Will try running the crawler on these to see if any issues can be reproed.

However, lesfondamentaux is not using Youtube or Vimeo, so if there is an issue with those videos, it would actually be a separate issue.

It's possible that there are multiple issues involved, youtube and vimeo each require custom rules, and the default fondamentaux player I believe is more standard html5. Will test more soon.

@ikreymer
Copy link
Member

After further investigation, here are some findings:

  • It appears the autoplay script doesn't run correctly in all frames unless headless mode is used, with --headless. This is a bit odd, and maybe something with puppeteer.. I don't recall seeing this before, but with the headless flag, the embedded videos are detected, while without it, it seems the script does not run in those iframes.

  • An additional replay fix was needed in pywb for vimeo.

  • Seems to be no issues with lesfondamenatux. It's just using default html5 video, so should not have been affected. The embedded youtube video on the home page was archived as well..

So it seems the solution, in addition to the youtube + vimeo-specific rule updates, is probably to default zimit to run with --headless flag set.. Still not sure why it doesn't work in regular mode, though.

@rgaudin
Copy link
Contributor Author

rgaudin commented Jan 26, 2021

OK, can't really think of what could have happen to get different results. Maybe that working zimit:dev was a manual push with a different version of chrome?

Thanks for the progress.

@ikreymer
Copy link
Member

Found the cause and fix. It turns out the chrome command-line args need an additional flag when not in headless mode, which I thought was the default already but isn't: --disable-features=IsolateOrigins,site-per-process

Background: Without it, iframes for youtube and vimeo are not accessible, and therefore can not be autoplayed: see related issue and work on that from: puppeteer/puppeteer#4960

Will add that to default options in next image release.

ikreymer added a commit to webrecorder/pywb that referenced this issue Jan 26, 2021
* rules: updated rule to fix replay of latest youtube watch and embed pages
include youtube-nocookie variant
fixes #607
part of fix for webrecorder/browsertrix-crawler#4

* rules: additional rules fix for vimeo
ikreymer added a commit to webrecorder/wabac.js that referenced this issue Jan 27, 2021
ikreymer added a commit that referenced this issue Jan 27, 2021
- fixes for iframes, as described in #4
- bump chrome to 88
- bump pywb to 2.5.0
- bump version to 1.0.5
@ikreymer
Copy link
Member

With latest pywb and wabac.js, it should be possible to capture and replay these videos, both in pywb and replayweb.page.

However, for the zimit use case, it is unfortunately a bit more complicated, because additional custom fuzzy matching rules, specific to zimit are needed.. Youtube now requires a POST request to load some data for the video, which was not required before. Both wabac.js and pywb support a prefix based query to find the 'best' match, while zimit currently requires an exact match only, which is less flexible. Perhaps can discuss in more detail on a zimit issue..

@rgaudin
Copy link
Contributor Author

rgaudin commented Jan 27, 2021

OK, please open a ticket there with the details. My understanding is that the problem will be in the replaying as we need a fixed prefix so that we have a unique URL to query data from the ZIM right ?

ikreymer added a commit that referenced this issue Jan 29, 2021
- fixes for iframes, as described in #4
- bump chrome to 88
- bump pywb to 2.5.0
- bump version to 1.0.5
@ikreymer
Copy link
Member

The crawler is now capable of capturing these videos, and they replay at least in pywb and replayweb.page.
ZIM replay will require some additional work in warc2zim, to be tracked in the linked issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants