Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix online dump downloads for the various modes and Closes #869 #872

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

TheEaterr
Copy link

It seems that there were changes in the way dumps are structured for wikidata making the way dump files were automatically downloaded not functional anymore.

Here are the fixes implemented, and the reasoning behind it :
DAILY : update the check for an available dump as status.txt now indicated "done:all". See : https://dumps.wikimedia.org/other/incr/wikidatawiki/20240419/status.txt I put startsWith but it might be preferable to check for done:all equality ? I don't know the various possible values but this version should work at least most of the time (compared to none right now).
JSON : The JSON dumps are currently downloaded from this folder : https://dumps.wikimedia.org/other/wikidata/ It is a bit of a mess including both full and incremental (?) dumps. I switch it to use https://dumps.wikimedia.org/wikidatawiki/entities/ (which seems to be equivalent to https://dumps.wikimedia.org/other/wikibase/wikidatawiki/, I don't know if anyone has any insight into which link should be preferred). Moreover, not all shown dates feature the full json dump so I added a check that the file is actually there (I have not found a status.txt or equivalent for the entitity dump)
SITES : Nothing done, haven't tested it but the file it is looking for is still there so it probably still works.
CURRENT / FULL : Both of these look for a -pages-meta-current.xml.bz2 / -pages-meta-history.xml.bz2 inside a dump (like https://dumps.wikimedia.org/wikidatawiki/20240401/). However these files do not seem to be there. Nevertheless, parts of these file ARE present (of form https://dumps.wikimedia.org/wikidatawiki/20240401/wikidatawiki-20240401-pages-meta-current1.xml-p1p441397.bz2 for example) but not the united file. If anyone knows why the full file isn't there I could try to fix the issue

Copy link
Member

@wetneb wetneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for updating this!

@TheEaterr
Copy link
Author

I made a new commit that uses head requests instead. I had to tap into the specific semantics of the WebRessourceFetcherImpl to change the type of the request, maybe a more broad set method should be added to the WebRessourceFetcher interface ?

Also I updated the test that failed by instead using the real - and not the mock - WebRessourceFetcher and changing around what is expected. I imagine using MockWebRessourceFetcher was intentional but for this specific use case it feels weird, as the part that broke is the online part and I feel like the test should catch that. Else it is probably possible to change the code around a bit to make it work with mocks. If checking online is desirable, it would also be logical to update the other tests.

@wetneb
Copy link
Member

wetneb commented May 13, 2024

Thanks a lot!

It looks like the new version of the test does not pass. In general, I would say it would be better to avoid making any requests to online services in the test suite, to keep it reliable. It is true that this prevents us from checking compatibility with the online service as it evolves, but in my opinion that's not the role of such a Java unit test suite. I don't know if this would have helped in this case, but it is also possible to change the mocking strategy and use a MockWebServer to test against a real but local web server serving pre-determined responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants