Skip to content

Latest commit

 

History

History
49 lines (40 loc) · 4.75 KB

PROCESSING.md

File metadata and controls

49 lines (40 loc) · 4.75 KB

IATI.cloud dataset processing


Introduction

The following is an explanation of the dataset processing flow for IATI.cloud.

Process overview

We use the code4iati dataset metadata and publisher metadata dumps to access all of the available metadata.

  • publisher: we basically immediately index the publisher metadata as it is flat data.
  • dataset: We download the code4iati dataset dump to access all of the available IATI datasets from the IATI Registry. If update is true, we check whether or not the hash has changed from the already indexed datasets. We then loop the datasets within the dataset metadata dump and trigger the subtask_process_dataset. For each dataset we clean the dataset metadata (where we extract the nested resources and extras). We then retrieve the filepath of the actual downloaded dataset based on the organisation name and dataset name. We check if the version is valid (in this case version 2). We get the type of the file from the metadata or the file content itself. We then check the dataset validation. Then we clear the existing data from this dataset if it is found in the IATI.cloud and the update flag is True. Then we trigger the indexing of the actual dataset. Once this is completed we store the success state of the latter to iati_cloud_indexed and we index the entire dataset metadata.

Indexing the dataset

First, we parse the IATI XML dataset. We then convert it to a dict using the BadgerFish algorithm.

We apply our cleaning and add custom fields. We then dump the dataset dict into a JSON file. Latstly, we extract the subtypes (budget, result and transactions)

Cleaning

We then recursively clean the dataset. @ values are removed, @{http://www.w3.org/XML/1998/namespace}lang is replaced with lang, and key-value fields are extracted. Read more here.

Adding custom fields

We have several "custom fields" that we enrich the IATI data with.

  • Codelist fields: These fields are 'name' representations of numeric/code values in the IATI Standard, for example an activity can report transaction-type.code: 3. We then enrich the activity with transaction-type.name: Disbursement.
  • Title narrative: We add a single-valued field with exclusively the first-reported title narrative.
  • Common activity dates: We add single value common start and end dates, so we immediately know a start and an end-date without looking through the planned and actual fields.
  • Combined policy marker: We add policy-marker.combined which is the policy marker code and its connected significance together.
  • Currency conversion: Explained in depth here.
  • Dataset metadata: We add interesting dataset metadata fields to the activity.
  • Hierarchy default value: "If hierarchy is not reported then 1 is assumed.". Ensure this is enforced.
  • JSON dumps: A stringified JSON object of different IATI activity fields.
  • Date quarters: For each iso-date reported, also include a field in which quarter they are.
  • Document link categories: Provides a combined list of all the category codes for each document-link.
  • Currency aggregation: We add converted and aggregated values for budgets, disbursements and transactions/transaction subtypes.
  • Related activity data to parent activity: This 'raises' related activity budget data from the H2 activities to the H1 activities.

Check it out in depth here

Extracting subtypes

We extract the subtypes to single valued fields. Read more here.

Each of these is indexed separately into its respective core.

Final step

Lastly, if the previous steps were all successful, we index the IATI activity data.