Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation criterion 3: biographical data #419

Merged
merged 23 commits into from
Aug 19, 2021
Merged

Validation criterion 3: biographical data #419

merged 23 commits into from
Aug 19, 2021

Conversation

marfox
Copy link
Member

@marfox marfox commented Aug 18, 2021

…iginal data; ensure complete comparison of dates
 ; return extra dates in 'ISO_date/precision' format
Fix bio values comparison; dump shared statements; feeling-lucky QID resolution;
pickle Wikidata cache, instead of JSON dump; simplify bio comparison;
revisit & rename variables.
…hared statements; revisit & rename variables
@marfox marfox added this to the Validator v2 milestone Aug 18, 2021
@marfox marfox added this to Upcoming in v2 via automation Aug 18, 2021
Copy link
Collaborator

@e-dorigatti e-dorigatti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fLGTM (functionally looks good to me) - maybe try to reduce code duplication?

@click.option(
'-s',
'--sandbox',
is_flag=True,
help=f'Perform all edits on the Wikidata sandbox item {vocabulary.SANDBOX_2}.',
)
def people_cli(catalog, statements, sandbox):
def people_cli(catalog, statements, criterion, sandbox):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems to be significant duplication among people_cli/works_cli and add_works_statements/add_people_statements, as well as other functions here (_add_or_reference/_add_or_reference_works) is it worth it to abstract a little to make future changes easier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same: these functions were implemented in version 1.
I will reserve a new PR to refactor those parts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #420

LOGGER.info('Dead identifiers dumped to %s', dead_ids_path)

# Dump Wikidata cache
if dump_wikidata:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is also repeated a few times, maybe worth to make a class for the cache so it's also easier to switch to a new backend if needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, filed in #421

LOGGER.info(
'Identifiers gathered from Wikidata dumped to %s', wd_ids_path
)
dead, wd_cache = dead_ids(catalog, entity)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just set wd_cache to None and call dead_ids once. this will make possible future refactorings easier:

if os.path.isfile(wd_cache_path):
    wd_cache = ....
else:
    wd_cache = None

dead, wd_cache = dead_ids(catalog, entity)

and update later code to save the new cache if needed. (also see my other comment on having an object to handle all this loading and saving of the cache)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion added to #421

@@ -410,21 +540,27 @@ def links(
of URL domains. Default: ``False``
:param wd_cache: (optional) a ``dict`` of links gathered from Wikidata
in a previous run. Default: ``None``
:return: 4 objects
:return: ``tuple`` of 6 objects
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit cumbersome to handle, consider using a NamedTuple or a dataclass

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree, filed in #422

LOGGER.info("No %s to be added, won't dump to file", log_msg_subject)


def _load_wd_cache(file_handle):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why you removed this function altogether?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we switched to Pickle: pickle.load replaces the old function

@marfox
Copy link
Member Author

marfox commented Aug 19, 2021

Thanks a lot for the review, really valuable.
Since we are currently prioritizing new features, I propose to tackle the refactoring suggestions in another PR. See #420 #421 #422.
This is also because I'll open a new PR with additional functionality on the same component right after merging this one.

@marfox marfox merged commit 81a8bb5 into master Aug 19, 2021
v2 automation moved this from Upcoming to Done Aug 19, 2021
@marfox marfox deleted the criterion-3 branch August 19, 2021 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
v2
  
Done
2 participants