New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Videos missing #4
Comments
Thanks for including specific examples! I found an issue with Youtube, both for capture and replay, which will require a fix to both pywb and wabac.js. I can confirm from other reports that Youtube changed something about their player recently, which requires updating the rules. You can follow this issue for the pywb fix: webrecorder/pywb#607 For Vimeo, I was not able to find an issue with manual capture. However, lesfondamentaux is not using Youtube or Vimeo, so if there is an issue with those videos, it would actually be a separate issue. It's possible that there are multiple issues involved, youtube and vimeo each require custom rules, and the default fondamentaux player I believe is more standard html5. Will test more soon. |
After further investigation, here are some findings:
So it seems the solution, in addition to the youtube + vimeo-specific rule updates, is probably to default zimit to run with |
OK, can't really think of what could have happen to get different results. Maybe that working Thanks for the progress. |
Found the cause and fix. It turns out the chrome command-line args need an additional flag when not in headless mode, which I thought was the default already but isn't: Background: Without it, iframes for youtube and vimeo are not accessible, and therefore can not be autoplayed: see related issue and work on that from: puppeteer/puppeteer#4960 Will add that to default options in next image release. |
* rules: updated rule to fix replay of latest youtube watch and embed pages include youtube-nocookie variant fixes #607 part of fix for webrecorder/browsertrix-crawler#4 * rules: additional rules fix for vimeo
…imeo content (part of work for webrecorder/browsertrix-crawler#4) bump version to 2.5.3
- fixes for iframes, as described in #4 - bump chrome to 88 - bump pywb to 2.5.0 - bump version to 1.0.5
With latest pywb and wabac.js, it should be possible to capture and replay these videos, both in pywb and replayweb.page. However, for the zimit use case, it is unfortunately a bit more complicated, because additional custom fuzzy matching rules, specific to zimit are needed.. Youtube now requires a POST request to load some data for the video, which was not required before. Both wabac.js and pywb support a prefix based query to find the 'best' match, while zimit currently requires an exact match only, which is less flexible. Perhaps can discuss in more detail on a zimit issue.. |
OK, please open a ticket there with the details. My understanding is that the problem will be in the replaying as we need a fixed prefix so that we have a unique URL to query data from the ZIM right ? |
- fixes for iframes, as described in #4 - bump chrome to 88 - bump pywb to 2.5.0 - bump version to 1.0.5
The crawler is now capable of capturing these videos, and they replay at least in pywb and replayweb.page. |
send 500 on exception
Investigating openzim/zimit#71 I realized I can't seem to be able to scrape videos reliably with the current version.
Even a very simple tests doesn't work:
The actual video content is not fetched.
I think this is the root cause behind fondamentaux missing most (if not all) of videos. See openzim/zimit#78
That may indicate that the problem is not new (those November runs were using
zimit:dev
which at that time usedwebrecorder/browsertrix-crawler:0.1.0
). Maybe something has changed on Youtube.com at that time?The text was updated successfully, but these errors were encountered: