WikiCineData

This is a script intended to aid in reconciling film title records in CollectionSpace with data from Wikidata. It comes from a specific use case (just getting film data for reuse in CollectionSpace), but ideally this would be applicable to CSV input too, or other data sources. Since the Wikidata reconciliation service, which was originally written for OpenRefine, follows a proposed "reconciliation API" spec, it might also be extensible to other reconciliation services/endpoints.

Usage

python3 driver.py -m cspace

After setting up the config file (see below) you should be able to run the driver.py script as above and let it rip. Depending on the size of the data set it might take a long time!

Background/details

Because there are so many variables in any give use case, there is a config.json file that allows you to specify things like which authority in CSpace you want to search (works, people, objects, etc.), along with the unique ID for that authority in your system; what item "type" you want to reconcile against in Wikidata ("films/Q11424","humans/Q5","paintings/Q3305213",etc.). It also allows you to fine tune how the script runs queries in parallel (more shortly).

In order to make the script more generalizable, it might be worth trying to make other elements customizable from the config file, like what columns the internal sqlite3 database uses, or other potentially nit-picky details.

There is also a secrets.json file that I'm using to store CSpace credentials.

You can look at the existing config.json file as well as the sample_secrets.json to see the expected format.

Parallel queries

Since the main time consumer is data I/O during API calls, the Wikidata queries are set to run in parallel threads, with batch sizes set by the chunk_size variable in the config file. You can mess with this value to see how large of queries you can get away with :)

There's also a throttle on the number of threads allowed at once. It's currently hard coded into the Database.chunk_me() method at 10, but that might be good as a configurable point too. When the threads are unlimited it quickly crashes Python, runs into errors accessing the database file, and other goofy stuff.

Another potential config point would be the number of results returned per page in a CSpace query. Currently hard coded in cspace_utils.fetch_cspace_items().

Output

The current output is just a sqlite3 database containing one item per row, along with data from the top Wikidata match, if any. It also stores whether the top match is a "100% match" along with the matching score provided by the reconciliation service (the scoring is kind of opaque).

It should be pretty easy to make a CSV output of the sqlite data, but a potentially huge CSV file would be unwieldy. Maybe add another config point to say how many rows you want per CSV file output in a series of batches? Like "2,000 rows per csv" returning 20 individual CSV files or whatever. TBD.

Reconciliation

One oddity is the way the Wikidata reconciliation service needs the JSON in each API call to be formatted. Here's an example:

my_query_dict = {'q1': {'query': 'jaws', 'type': 'Q11424', 'properties': [{'pid': 'P1476', 'v': 'Tiburon'}]}, 'q2': {'query': 'Ernst Schwanhold'}}

query_payload = json.dumps({"queries":json.dumps(my_query_dict)})

requests.post(service_url,data=json.loads(query_payload))

The return value for a single positive match looks like this:

{'q1':
	{'result': [
		{
			'features': [
				{'id': 'P577', 'value': 100},
				{'id': 'P57', 'value': 100},
				{'id': 'all_labels', 'value': 100}
			],
			'id': 'Q2740695',
			'match': True,
			'name': '10 on Ten',
			'score': 100.0,
			'type': [{'id': 'Q11424', 'name': 'film'}]
			},
		{
			'features': [
				{'id': 'P577', 'value': 0},
				{'id': 'P57', 'value': 38},
				{'id': 'all_labels', 'value': 58}
			],
			'id': 'Q746733',
			'match': False,
			'name': 'The Ten Commandments',
			'score': 40.666666666666664,
			'type': [{'id': 'Q11424', 'name': 'film'}]
		},
		{
			'features': [
				{'id': 'P577', 'value': 0},
				{'id': 'P57', 'value': 46},
				{'id': 'all_labels', 'value': 53}
			],
			'id': 'Q22098970',
			'match': False,
			'name': 'Ten Years',
			'score': 39.66666666666667,
			'type': [{'id': 'Q11424', 'name': 'film'}]
		}
		]
	}
}

Null results look like this:

{"q2":{
	"result":[]
	}
}

I initially tried to include things like alternative titles, multiple creators, etc. in the reconciliation call as additional match points, but it seems that the best match is retrieved by just using one main query point (one "title"), and one optional value per "extra parameter" (one creator, one date, etc.) and Wikidata will do some magic to retrieve options.

Dependencies

pip3 install lxml requests

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
api_utils.py		api_utils.py
config.json		config.json
cspace_utils.py		cspace_utils.py
db_stuff.py		db_stuff.py
driver.py		driver.py
out.xml		out.xml
query.json		query.json
readme.md		readme.md
sample_secrets.json		sample_secrets.json
wikidata_utils.py		wikidata_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

api_utils.py

api_utils.py

config.json

config.json

cspace_utils.py

cspace_utils.py

db_stuff.py

db_stuff.py

driver.py

driver.py

out.xml

out.xml

query.json

query.json

readme.md

readme.md

sample_secrets.json

sample_secrets.json

wikidata_utils.py

wikidata_utils.py

Repository files navigation

WikiCineData

Usage

Background/details

Parallel queries

Output

Reconciliation

Dependencies

About

Releases

Packages

Languages

BAM-PFA/wikicinedata

Folders and files

Latest commit

History

Repository files navigation

WikiCineData

Usage

Background/details

Parallel queries

Output

Reconciliation

Dependencies

About

Resources

Stars

Watchers

Forks

Languages