Who's got dirt?

Note: This document is an early draft, that will be updated with our more recent thinking soon.

Many (open) databases publish information about companies, institutions and people to the web. This might include details about company ownership, asset ownership, political finance, government contracts, and many other sources of information about political and economic influence.

Yet our story about how researchers would access these databases is incomplete: would a journalist visit all possible data sites each time they want to look up a set of people or companies? We want to automate this lookup by making sure many of our sites support one simple API call for asking: who knows something about this entity, who's got dirt?

This repository defines a common way of querying data providers about people and companies. The goal is to make lookups of entities across multiple databases easy, and to support consumer tools when they have to merge the data they find into the data they already have.

What is it for?

The standard will allow consumer tools to automatically search and integrate information about companies and people from data providers. This can be useful in the context of various activities:

Searching: generate a simple list of candidates containing all matching records from a variety of sources.
Enrichment: rather than just generating a results list, integrate all relevant data availabale in a data provider system into the consumer tool's data model.
Lead list: rather than just querying for entities by name, find entities and relationships that match certain graph patterns.
Notification: a scheduled job to track changes in service responses, i.e. new entities or relationships.

Detailed use cases

There are also more detailed documents to detail use cases for the trade negotiations case study.

Discussion

How does it work?

Data providers will implement a single API endpoint, with different degrees of completeness (based on the type of data they store, and the capabilities of their tools). The basic mechanism uses query-by-example, i.e. the user will submit all the known properties (e.g. the name) of an entity and expect a list of suggested matches as a response.

Since a common data schema is required for this mechanism, all data will be transferred using the JSON representation of Popolo, a standard way of expressing data about people, organisations and the relationships between them. The Influence Mappers team will extend the field and class definitions of Popolo over the course of the project.

A simple query

Let's look at a query for five matches amongst Organizations sorted by the edit distance between the name Big Corp and the entity name:

{
    "queries": {
        "q0": {
            "name~=": "Big Corp",
            "type": "popolo:Organization",
            "limit": 5
        }
    }
}

The syntax for this query is the Metaweb Query Language used by Freebase to formulate graph queries against it's knowledge base. It is thus a more powerful variant of the better-known Refine Reconciliation API developed by the same team. One feature copied from the reconciliation API is queries, a wrapper list to support bulk requests in the future. The dictionary inside of it is a simple MQL query.

A query using MQL could also make use of link structures within the Popolo data, if the backend supports such queries.

This attempts to find all persons who are members of an organization with the given name. Because of the wildcard queries (*), all properties of the person and membership will be returned.

Response format

A simple response would return the requested entity information in Popolo format, with sources information on a per-record level included.

Given the diverse nature of data providers and use cases, the API supports different response detail levels:

Full data responses will return a structured data record or set of records to match the query.
Reference lists return data maintained by the service may not be fully structured (such as relevant plain-text documents). The returned information is limited to document references which can be accessed by a human researcher.
Information holder contact, in scenarios where the result data itself cannot be shared, contact information for the person holding the relevant material should be indicated.

API conventions

All requests are sent as JSON queries over HTTP. They can use either the GET method (and a queries= argument) or the request body of a POST request. Responses are assumed to be formatted in JSON, even if the result is an error.
Data providers are free to supply links to other resources on their platform, outside of the existing request endpoint.
All of the proposed API will rely on the Popolo specification and Freebase MQL as far as possible, and any further additions, extensions and standards will be documented in this repository.

Data model extensions

While Popolo provides a data model for Persons, Organizations and Memberships, it is mainly structured around parliamentary information use cases.

It does not currently have a notion of Companies, nor does it define relevant connection types such as Control (e.g. for ownership) and Transaction (e.g. for contracts). For a more detailed analysis of potential connection types, see James' inventory of terms.

Getting it adopted

Build a set of proxies around existing data providers (i.e. partners, but also ICIJ Offshore, CorpWatch CrocTail - even WhoIs)
Prototype client applications around different patterns of use

Who should be involved?

Partners of Influence Mappers (OpenCorporates, Poderopedia, LittleSis, OpenNorth, American Academy)
International Consortium of Investigative Journalists and the Organized Crime and Corruption Reporting Project
Consumer-side apps: detective.io, kumu.io, linkurio.us
Producer-side apps: Overview, DocumentCloud, PopIt, Investigative Dashboard, CrocTail

Existing efforts

Popolo data standard for people and organizations.
Linked Data, Linked Data Fragments, FOAF, SPARQL
OpenRefine Reconciliation API for fuzzy entity matching.
Metaweb Query Language, a more sophisticated query mechanism which supports matching entities in terms of their connections to other nodes in a graph. Also: CYPHER.
Survey of partner project data models.
IATI OrgID scheme for company identification.
OpenCorporates API, LittleSis API.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
minutes		minutes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

minutes

minutes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Who's got dirt?

What is it for?

Detailed use cases

Discussion

How does it work?

A simple query

Response format

API conventions

Data model extensions

Getting it adopted

Who should be involved?

Existing efforts

About

Releases

Packages

Contributors 3

License

influencemapping/whos-got-dirt

Folders and files

Latest commit

History

Repository files navigation

Who's got dirt?

What is it for?

Detailed use cases

Discussion

How does it work?

A simple query

Response format

API conventions

Data model extensions

Getting it adopted

Who should be involved?

Existing efforts

About

Resources

License

Stars

Watchers

Forks