Skip to content
/ Heri_Graphs Public template

Code and Dataset for the Paper 'Heri-Graphs: A Workflow of Creating Datasets for Multi-modal Machine Learning on Graphs of Heritage Values and Attributes with Social Media'

License

Notifications You must be signed in to change notification settings

zzbn12345/Heri_Graphs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

90 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Heri-Graphs: A Dataset Creation Framework for Multi-modal Machine Learning on Graphs of Heritage Values and Attributes with Social Media

This is the Code and Dataset for the Paper 'Heri-Graphs: A Dataset Creation Framework for Multi-modal Machine Learning on Graphs of Heritage Values and Attributes with Social Media' published in ISPRS International Journal of Geo-Information showing the collection, preprocessing, and rearrangement of data related to Heritage values and attributes in three cities that have canal-related UNESCO World Heritage properties: Venice, Suzhou, and Amsterdam.

DOI

Cite as

Bai, N., Nourian, P., Luo, R., & Pereira Roders, A. (2022). Heri-Graphs: A Dataset Creation Framework for Multi-Modal Machine Learning on Graphs of Heritage Values and Attributes with Social Media. ISPRS International Journal of Geo-Information, 11(9), 469. MDPI AG. Retrieved from http://dx.doi.org/10.3390/ijgi11090469

@article{Bai2022HeriGraphs, 
    title={Heri-Graphs: A Dataset Creation Framework for Multi-Modal Machine Learning on Graphs of Heritage Values and Attributes with Social Media}, 
    volume={11}, 
    ISSN={2220-9964}, 
    url={http://dx.doi.org/10.3390/ijgi11090469},
    DOI={10.3390/ijgi11090469}, 
    number={9}, 
    journal={ISPRS International Journal of Geo-Information}, 
    publisher={MDPI AG}, 
    author={Bai, Nan and Nourian, Pirouz and Luo, Renqian and Pereira Roders, Ana}, 
    year={2022}, 
    month={Aug}, 
    pages={469} }

or

Nan, Bai, Pirouz, Nourian, Renqian, Luo, & Ana, Pereira Roders. (2022, May 17). Heri_Graphs: arXiv supplementary material for HeriGraph (v1.1). Zenodo. https://doi.org/10.5281/zenodo.6556244

@software{nan_bai_2022,
  author       = {Nan, Bai,
                  Pirouz, Nourian,
                  Renqian, Luo, and
                  Ana, Pereira Roders},
  title        = {Heri\_Graphs: arXiv supplementary material for HeriGraph},
  month        = may,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {v1.1},
  doi          = {10.5281/zenodo.6556244},
  url          = {https://doi.org/10.5281/zenodo.6556244}
}

Table of Content

Dataset (numpy) Summary (updated 19 September 2023)

The following sections about the workflow can be skipped for those who only intend to use the provided datasets.

Requirement and Dependency

deep_translator == 1.7.0

facenet_pytorch == 2.5.2

fastai == 2.5.3

flickrapi == 2.4.0

matplotlib == 3.5.1

networkx == 2.6.3

numpy == 1.22.2

opencv-python == 4.5.5.62

osmnx == 1.1.2

pandas == 1.4.0

pillow == 9.0.1

places365 (please download the repository places365 and put under the root as ./places365)

scipy == 1.8.0

scikit-learn == 1.0.2

torch == 1.10.2+cu113

torchvision == 0.11.3+cu113

transformers == 4.16.2

WHOSe_Heritage (please download the repository WHOSe_Heritage and put under the root as ./WHOSe_Heritage)

Workflow and Dataset

This project provides a workflow to to construct graph-based multi-modal datasets HeriGraph concerning heritage values and attributes using data from social media platform Flickr. The workflow is illustrated as follows:

Workflow of Data

To protect the privacy and copyright of Flickr users, only the final processed (stored) datasets (thus no raw images) will be provided in this repository. The users are invited to collect and construct datasets of the provided case study cities or any other new [city] for their own interests.

Case Studies

Three cities related to UNESCO World Heritage and Historic Urban Landscape were selected as case studies: Amsterdam, the Netherlands (Seventeenth-Century Canal Ring Area of Amsterdam inside the Singelgracht), Suzhou, China (Classical Gardens of Suzhou), and Venice, Italy (Venice and its Lagoon).

The data of each case study has been put in a different folder, such as: ./Amsterdam, ./Suzhou, and ./Venezia. Without further explanation, all the codes and data introduced below will be coresponding to and stored in the respetive folder.

For constructing your own dataset with any other [city], build an individual folder ./[city], and record the GEO-locations [city_lat], [city_lon] and diameter [city_radius] of the demanded area.

Case Study City World Heritage Name Latitude Longitude Diameter
Amsterdam (AMS) Seventeenth-Century Canal Ring Area of Amsterdam inside the Singelgracht 52.365000N 4.887778E 2 km
Suzhou (SUZ) Classical Gardens of Suzhou 31.302300N 120.631300E 5 km
Venice (VEN) Venice and its Lagoon 45.438759N 12.327145E 5 km
[city] World Heritage status of [city] [city_lat] [city_lon] [city_radius]

Dataset (csv) Summary (skip the next parts of project workflow)

As the final outcome of this project, datasets for multi-modal machine learning on multi-graphs are provided for each [city]. The components of the datasets are respectively saved in ./dataset/[city]/, ready to be used for multiple tasks.

The merging and saving of datasets could be found following ./Dataset_Saving.ipynb.

File Name Column Size Description Notation
Visual_Features.csv 984 Visual Features extracted Xvis
Textual_Features.csv 776 Textual Features extracted Xtex
Value_Labels.csv 26 Soft and Hard Labels for Heritage Values together with confidence scores YHV|KHV
Attribute_Labels.csv 19 Soft and Hard Labels for Heritage Attributes together with confidence scores YHA|KHA
Edge_List.csv 18 Adjacency information of Multi-graphs with three types of links (currently only available on Google Drive) A, ATEM, ASOC, ASPA

The complete processed dataset could be obtained through the following Google Drive Link.

Dataset (numpy) Summary (skip the next parts of project workflow)

(updated 19 September 2023)

A numpy array version of the final dataset (currently only available for Venice) is available under ./dataset_np/[city]/, which is more efficient and compact than the csv version, especially for the link data as it uses the Scipy sparse matrices.

The saving process of numpy version dataset could be found following ./Dataset_Saving_np.ipynb, ./HeriGraph_Construction_Venezia(sparse)_np.ipynb and ./HeriGraph_Construction_Venice_Large(sparse)_np.ipynb.

File Name Array Size Description Notation
Visual_Features.npy (*, 984) Visual Features extracted Xvis
Textual_Features.npy (*, 776) Textual Features extracted Xtex
labels.npz - Values and Attributes combined, with respective keys "VAL_LAB" and "ATT_LAB" YHV|KHV, YHA|KHA
node_types.npy (*, ) A boolean variable indicating if the sample only has visual features and no textual features (0) or has both features (1)
train_val_test_idx.npz - Three arrays of indices indicating the training set, validation set, and test set for future tasks, with respective keys "train", "val", "test"
A_simp.npz - Sparse matrix marking the links of the simplified composed graph A
A_SOC.npz - Sparse matrix marking the weights of social links ASOC
A_SPA.npz - Sparse matrix marking the weights of social links ASPA
A_TEM.npz - Sparse matrix marking the weights of social links ATEM

To download the numpy version of Venice directly, use this Google Drive link.

To download the numpy version of Venice-XL directly, use this Google Drive Link.

For loading the numpy version of dataset in Pytorch-Geometric library, check the upcoming repository Stones_Venice in a follow-up project.

Raw Data Collection

Flickr API Requirements

Apply for your own API key from Flickr APP Garden, and save the [api_key] and [api_secret] for later usage of API whenever requested.

Small Datasets (Recommended)

The code to download raw data as IDs of Flickr posts and to save images are given in ./[city]/save_image.py.

Input the respective [api_key], [api_secret] ,[city_lat], [city_lon], and [city_radius] to run the code. A restriction of maximum 5000 IDs has been given to the API to keep datasets comparable to each other.

The downloaded metadata will be saved as ./[city]/data_storage/images/photos_sizes.csv, and the images of which the owner allowed to download with candownload==True flag will be saved in ./[city]/data_storage/images/150/ and ./[city]/data_storage/images/320/, respectively, for the Large Square - url_q (150×150 px) and Small 320 - url_n (320×240 px) versions of the original image.

Large Datasets

To collect large datasets without the restriction of 5000 IDs, follow ./Venezia/collect_data.py to save all the IDs and metadata, and follow ./Venezia/save_image_all.py to download the images in the folder.

Input the respective [api_key], [api_secret] , the range of minumum and maximum [city_lat] and [city_lon] as bounding box of the region, the size of the grid (default 20), and radius of inquiry in the grid (default 0.3) to run the code.

The IDs will be collected in a 20 by 20 grid with the name of ./[city]/data_storage/photo_ids_{}_{}.csv, while the summarized metadata will be saved in ./[city]/data_storage/photos_last.csv. All the saved images will be stored in the folder ./[city]/data_storage/images/grid/ with the Large Square - url_q (150×150 px) version of the original image.

Note that Flickr API might return an error code during the data inquiry. Run the both codes interatively to continue collecting data until the total amount is satisfied.

Multi-modal Feature Generation

Visual Features

The 512-dimensional vector of hidden visual features, 365-dimensional scene category predictions, and 102-dimensional scene attribute predictions could be obtained following ./Places_Prediction.ipynb. The results will be saved as ./[city]/data_storage/IMG_pred_150.csv (150×150 px small images only), and ./[city]/data_storage/IMG_pred.csv (images of both sizes for comparison of confidence and/or consistency).

The 3-dimensional vector of face prediction in images could be obtained following ./Face_Detection_in_Images.ipynb. The results will be saved as ./[city]/data_storage/Face_preds.csv.

The final merged visual features data (982-dimensional) are provided in ./dataset/[city]/Visual_Features.csv, which is effectively a 984-column table.

Column Index Name Description Data Type Notation
0 ID Unique Image Index from Flickr String -
1 IO_Type Indoor/Outdoor Scene String -
2-513 Vis_Feat_[i] Last 512-dimensional Hidden Layer of ResNet-18 pretrained on PlacesCNN as Visual Feature Float Hv
514-516 Face_[*] Number of faces, confidence of face prediction, proportion of faces in the image Float F
517-881 SCE_[*] Smoothened/Filtered 365-dimensional scene category prediction Logit Float σ(5)(Ls)
882-983 ATT_[*] Smoothened/Filtered 102-dimensional scene attribute prediction Logit Float σ(10)(La)

Textual Features

The data cleaning of textual data, and the 3-dimensional vector of original language of posts could be obtained following ./Dataset_Cleaning_and_Merging_[city].ipynb. The results will be saved as ./[city]/data_storage/metadata.csv in post level and ./[city]/data_storage/sentences.csv in sentence level.

The 768-dimensional vector of BERT [CLS] token could be obtained following ./bert_inference_HeriGraph.ipynb. The results will be saved as ./[city]/data_storage/metadata_bert.csv in post level and ./[city]/data_storage/sentences_bert.csv in sentence level.

The final merged textual features data (771-dimensional) are provided in ./dataset/[city]/Textual_Features.csv, which is effectively a 776-column table.

Column Index Name Description Data Type Notation
0 index Unique Image Index from Flickr String -
1 text_bool Whether the original post has a valid textual data (as a filter) Boolean -
2 revised_text The processed and filtered textual data of the post as combination of description, title, and tags. String S
3-4 num_sent/ text_len Number of sentences and number of words in the revised text Integer -
5-772 BERT_[i] The 768-dimensional output vector of [CLS] token Float HB
773-775 English/ Local_Lang/ Other_Lang Detected original language in the posts Boolean O

Contextual Features

The temporal features about the timestamps of the posts in their unique week counts could be obtained following ./Dataset_Cleaning_and_Merging_[city].ipynb. The results will be saved in ./[city]/data_storage/metadata.csv.

The social features about the social relations of the post owners could be obtained following ./Social_Links_of_Interests.ipynb. Input the [api_key] and [api_secret] to activate the queries of the public contacts and public groups of the Flickr users. The information will be respectively saved as ./[city]/data_storage/contacts.csv, ./[city]/data_storage/interest.csv, and ./[city]/data_storage/friendship.csv, while the final merged social information is saved as ./[city]/data_storage/social_links.csv.

The spatial features about the locations of the posts and their connectivity in geographical network could be obtained following ./Geographical_Graph_Construction.ipynb. Input the respective [city_lat], [city_lon], and [city_radius] to run the code. The spatial network information will be saved respectively as ./[city]/data_storage/GEO_nodes.csv showing the intersections in spatial network, ./[city]/data_storage/GEO_edges.csv showing the connectivity of spatial nodes with travel time information, and ./[city]/data_storage/GEO_node_dist.csv showing the travel time between any two nodes. The geo-node assigned to each post will be recorded in ./[city]/data_storage/GEO_metadata.csv.

Label Generation

Heritage Values

This project applied the heritage value definition in UNESCO WHL with regard to ten Outstanding Universal Value selection criteria plus one additional "other" class, which is introduced and trained in WHOSe_Heritage.

The predicted labels on heritage values by BERT could be obtained following ./bert_inference_HeriGraph.ipynb. The results will be saved as ./[city]/data_storage/metadata_bert.csv in post level and ./[city]/data_storage/sentences_bert.csv in sentence level.

The predicted labels on heritage values by ULMFiT could be obtained following ./ulmfit_inference_HeriGraph.ipynb. The results will be saved as ./[city]/data_storage/metadata_ulmfit.csv in post level and ./[city]/data_storage/sentences_ulmfit.csv in sentence level.

The comparison of the both models for performance, coherence, and consistency on both post level and sentence level could be obtained following ./Diagram_Values.ipynb.

The final merged heritage value label data (11-dimensional) are provided in ./dataset/[city]/Value_Labels.csv, which is effectively a 26-column table. A sample is considered as labelled if the average top-3 confidence of both BERT and UMLFiT models is larger than 0.75 and the Jaccard Index of such top-3 predictions is larger than 0.5. This leads to around 40-50% texual samples as labelled (thus around 10-35% of all data samples in each city). Users are invited to adjust the thresholds of labelled data to experiment on the effects.

Column Index Name Description Data Type Notation
0 index Unique Image Index from Flickr String -
1 text_bool Whether the original post has a valid textual data (as a filter) Boolean -
2-12 Criteria_[i]/ Others The average predicted soft label of post text concerning heritage values in terms of OUV. Float YHV
13-18 max_[i]_val/ max_[i]_col The predicted hard top-3 labels of heritage values Float/ String -
19-20 max_[i] The top-k confidence of averaged soft label prediction Float -
21-22 conf_[i] The average model confidence of BERT and ULMFiT for their top-k predictions Float κHV(0)
23-24 same_[i] The model agreement/consistency of BERT and ULMFiT for their top-k predictions in terms of Jaccard Index Float/ Boolean κHV(1)
25 labelled Whether the sample should be considered as "pseudo-labelled" data Boolean -

A demo of labelled heritage values can be seen with the following diagram:

Heritage Values

Heritage Attributes

This project applied the heritage definition by Veldpaus (2015) and Ginzarly et al. (2019), keeping a nine-class category of depicted scenery of an image.

A few models have been trained on the data presented by Ginzarly et al. (2019) in Tripoli, Lebanon. The training process together with hyper-parameter tuning with grid search cross validation with scikit-learn library could be found in ./Machine_Learning_Models_on_Heritage_Attributes_Tripoli.ipynb.

Model_Training

The trained ensemble VOTE and STACK classification models are saved in the folder ./Tripoli/model_storage/ respectively under ./Tripoli/model_storage/vote_classifier.joblib and ./Tripoli/model_storage/stack_classifier.joblib.

The predicted labels on heritage attributes by both classifiers could be obtained following ./Machine_Learning_Models_on_Heritage_Attributes_Tripoli.ipynb. The results will be saved as ./[city]/data_storage/IMG_pred_150_cat.csv.

The comparison of the both models for performance, coherence, and consistency could be obtained following ./Diagram_Attributes.ipynb.

The final merged heritage attribute label data (9-dimensional) are provided in ./dataset/[city]/Attribute_Labels.csv, which is effectively a 19-column table. A sample is considered as labelled if the average top-1 confidence of both VOTE and STACK models is larger than 0.7 and the top-1 predictions is same. This leads to around 35-50% samples as labelled. Users are invited to adjust the thresholds of labelled data to experiment on the effects.

Column Index Name Description Data Type Notation
0 ID Unique Image Index from Flickr String -
1-9 [various names] The average predicted soft label of post image concerning heritage attributes in terms of depicted scenes. Float YHA
10-11 category[-/_id] The predicted hard top-1 labels of heritage attributes String -
12-15 category/ cat_id_[model] The top-1 hard label prediction of VOTE and STACK models String -
16 conf The average model confidence of VOTE and STACK for their top-1 predictions Float κHA(0)
17 category_same The model agreement/consistency of VOTE and STACK for their top-1 predictions Boolean κHA(1)
18 labelled Whether the sample should be considered as "pseudo-labelled" data Boolean -

A demo of labelled heritage attributes can be seen with the following diagram:

Heritage Attributes

Multi-graph Construction

The graph construction process for the Multi-Graphs, the three subgraphs with Temporal, Social, and Spatial links, as well as the simple composed graphs could be obtained following ./HeriGraph_Construction_[city].ipynb.

The Edge Lists that could be directly used by NetworkX or other softwares to construct graphs are provided in ./dataset/[city]/Edge_List.csv provided in the zip file shared on Google Drive, which is effectively a 16-column table. The columns of [Temporal/Social/Spatial]_Similarity are the edge weight for each type of subgraphs, and the column One_Edge is the adjacency indicator for the composed simple graph.

Column Index Name Description Data Type Notation
0 0 Unique Image Index from Flickr for Node 0 String v0
1 1 Unique Image Index from Flickr for Node 1 String v1
2-3 Week_[i] The timestamp ID in week level String ti
4 dist The temporal distance of two nodes String -
5 Temporal_Similarity The edge weight of temporal links Float ATEM
6-7 User_[i] The user ID in Flickr String ui
8 relationship The strength level of relationship of two users Integer -
9 Social_Similarity The edge weight of social links Float ASOC
10-11 GEO_[i] The GEO-location ID in the spatial network String υi
12 geo_distance The spatial distance of two nodes in terms of travel time Float we
13 Spatial_Similarity The edge weight of spatial links Float ASPA
14 One_Edge The adjacency indicator for the composed simple graph Boolean A
15 Same_Node Whether the two nodes are the same one Boolean -

The statistics of generated graphs following the standard of PyTorch-Geometric could be found in the following table:

Name #graphs/ subgraphs #nodes #edges #features #classes/ tasks
HeriGraph 3 ~3271.7 ~907,393.3 - -
└─Amsterdam 3+1 3727 1,271,171 1753 11+9
└─Suzhou 3+1 3137 916,496 1753 11+9
└─Venice 3+1 2951 534,513 1753 11+9

Acknowledgements and License

This project applied the pretrained models of the following projects which are openly released on GitHub or published as python packages. Part of the codes are adpated from the original codes.

Places365-CNNs

Face Recognition Using Pytorch

WHOSe_Heritage

The workflows and datasets of this paper can be used under the Creative Common License (Attribution CC BY 4.0). Please give appropriate credit, such as providing a link to our paper or to this github repository. The copyright of all the downloaded and processed images belongs to the image owners.

About

Code and Dataset for the Paper 'Heri-Graphs: A Workflow of Creating Datasets for Multi-modal Machine Learning on Graphs of Heritage Values and Attributes with Social Media'

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published