Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Twitter: aborts if media download yields "403 Forbidden", e.g. removed by copyright claim #52

Open
joonas-fi opened this issue Nov 30, 2019 · 4 comments

Comments

@joonas-fi
Copy link
Contributor

joonas-fi commented Nov 30, 2019

Here's the Tweet: https://twitter.com/janl/status/1113015555064201216

Error message:

2019/11/30 18:04:02 [ERROR][twitter/joonas_fi] Getting latest: getting items from service: processing tweet from API: processing tweet 1113180316510957568: making item from tweet that this tweet (1113180316510957568) is in reply to (1113015555064201216): making item from tweet that this tweet (1113015555064201216) embeds (1112473455650172929): media resource returned HTTP status 403 Forbidden: https://pbs.twimg.com/ext_tw_video_thumb/1112471832232259585/pu/img/ywWGTl09hsnLnMOY.jpg

That image URL redirects (when used with browser - different when API use?) to this DMCA warning.

Timeliner cannot cope with this, and trying to re-run Timeliner always gets me this and cannot continue.

@mholt
Copy link
Owner

mholt commented Dec 13, 2019

Ah, oops. Not something I anticipated or encountered. How do you think we should handle this?

@joonas-fi
Copy link
Contributor Author

joonas-fi commented Dec 13, 2019

I dunno, this is a pickle. The obvious error is not being able to continue after 403. My data retrieval process just aborts.

But, what should we do about it? Sure, continue after the error. But, personally, I am not fan of losing any information. In this case the information is:

there once was an attachment, but we didn't manage to fetch it in time because it was later taken down because of a DMCA complaint

I'd prefer this to be stored in the data model. I haven't researched Timeliner's data model, something like attachment: {id: '987654321', permanentFetchFailureReason: '403 not found - Twitter or the author removed it?'} ?

Things to think about:

  • there's a distinction between transient errors and "likely fetch will never work again" (like in this case)
  • I guess Timeliner already supports transient errors (since re-running my retrieval always ended up fetching this errored attachment) and
  • Does Timeliner mind on "force full refresh" where an attachment was managed to be fetched before (we already have a copy of that attachment stored) and now it's unavailable doing "full refresh"? Obviously it should just shrug and carry on instead of aborting.
  • if we're doing "full refresh" and we have a permanentFetchFailureReason, should we still re-try fetching it? I guess chances are kinda slim that the likes of Twitter restore content. But then again, re-trying probably isn't expensive (if we re-try, maybe not pollute the log about errors we think are really likely to surface from re-try attempts).

@Ruthalas
Copy link

...if we're doing "full refresh" and we have a permanentFetchFailureReason, should we still re-try fetching it?

I concur with your conclusion that rechecking is inexpensive, and so worth trying.

@mholt
Copy link
Owner

mholt commented Dec 13, 2019

403 is usually permanent, or something has to be changed on the server to remove that error.

Perhaps Timeliner should simply continue to the next item after seeing a 403. Log the 403, but continue on, since there's nothing we can do about it. This should probably be the behavior no matter what mode it's running in.

But I also agree that simply trying once or twice more before continuing on wouldn't be a bad idea, in case it was a fluke.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants