Skip to content

This project aims to study the asssessment protocol of the measures of the semantic distance between words and concepts

Notifications You must be signed in to change notification settings

MohamedAliHadjTaieb/Semantic-measure-assessment-review-study

Repository files navigation

Semantic-measure-assessment-review-study

This project aims to study the asssessment protocol of the measures of the semantic distance between words and concepts. If you find this helpful, please consider citing:

Hadj Taieb, M.A., Zesch, T. & Ben Aouicha, M. A survey of semantic relatedness evaluation datasets and procedures. Artif Intell Rev (2019). https://doi.org/10.1007/s10462-019-09796-3

Datasets

This folder conatins the datasets exploited for assessing the semantic simialrity/relatedness measures for different langauges.

Dataset |pairs| Year Type Ref Link
Arabic (AR)
Almarsoomi 70 2013 Sim (Almarsoomi et al., 2013) link
MC 30 2009 Sim (Hassan and Mihalcea, 2009) link
SaifAr 40 2014 Rel (Saif et al., 2014) link
WordSim 352 2009 Rel (Hassan and Mihalcea, 2009) link1
link2
Chinese (CN)
PKU 500 2016 Sim (Wu and Li, 2016) link1
link2
Czech (CS)
WordSim 353 2016 Rel (Cinková, 2016) link
Dutch (NL)
MC 30 2018 Sim (Barzegar et al., 2016) link
RG 65 2018 Sim (Barzegar et al., 2016) link
SimLex 999 2018 Sim (Barzegar et al., 2016) link
WordSim 353 2018 Rel (Barzegar et al., 2016) link2
English (EN)
RG 65 1965 Sim (Rubenstein and Goodenough, 1965) link
MC 30 1991 Sim (Miller and Charles, 1991) link
Martinez&Aldana 28 2013 Sim (Martinez-Gil and Aldana-Montes, 2013) link
SimLex 999 2015 Sim (Hill et al., 2015) link
WP 300 2013 Sim (Li et al., 2013) link
MTurk 287 2011 Rel (Radinsky et al., 2011) link
MTurk 771 2012 Rel (Halawi et al., 2012) link
SL 7576 2014 sim (Silberer and Lapata, 2014) link
Rel 122 2013 rel (Szumlanski et al., 2013) link
Zie25 25 2006 Rel (Ziegler et al., 2006) link
Zie30 30 2006 Rel (Ziegler et al., 2006) link
GM 30 2008 Rel (Gracia and Mena, 2008) link
RareWords 2034 2013 Rel (Luong et al., 2013) link1
link2
YP_Verb 130 2006 Sim (Yang and Powers, 2006) link1
SimVerb 3500 2016 Sim (Gerz et al., 2016) link1
link2
link3
RD_Verb 37 2000 Sim (Resnik and Diab, 2000) link1
link2
Baker_Verb 144 2014 Sim (Baker et al., 2014) link1
link2
French (FR)
RG 65 2011 Sim (Joubarne and Inkpen, 2011) link1
link2
German (DE)
Gur 65 2005 Sim (Gurevych, 2005) link1
link2
AG 201 2015 Sim (Leviant and Reichart, 2015) link1
link2
Gur 350 2005 Rel (Gurevych, 2005) link1
link2
ZG 222 2006 Rel (Zesch and Gurevych, 2006) link1
link2
Cramer 100 2008 Rel (Cramer and Finthammer, 2008) link1
link2
WordSim 252 2005 Rel (Leviant and Reichart, 2015) link1
link2
link3
Hindi
Gujarati-WS-Indian 163 2017 Sim (Akhtar et al., 2017) link
Punjabi-WS-Indian 143 2017 Sim (Akhtar et al., 2017) link
Tamil-WS-Indian 97 2017 Sim (Akhtar et al., 2017) link
Telugu-WS-Indian 111 2017 Sim (Akhtar et al., 2017) link
Urdu-WS-Indian 100 2017 Sim (Akhtar et al., 2017) link
Hungarian (HU)
MC 31 2013 Sim (Ágoston Tóth, 2013) link1
link2
Italian (IT)
link
Japenese (JA)
JWSD_Noun 1103 2017 Sim (Sakaizawa and Komachi, 2017) link1
JWSD_Verb 1464 2017 Sim (Sakaizawa and Komachi, 2017) link1
JWSD_Adjective 960 2017 Sim (Sakaizawa and Komachi, 2017) link1
JWSD_Adverb 902 2017 Sim (Sakaizawa and Komachi, 2017) link1
Persian (FA)
link
Portuguese (PT)
RG 65 2014 Sim (Granada et al., 2014) link1
link2
Romanian (RO)
link
Russian (RU)
link
Spanish (ES)
RG 65 2015 Sim (Camacho-Collados et al., 2015) link1
link2
Swedish (SE)
link
Turkish (TR)
Anlamver 500 Sim 2018 (Ercan and Taner Yıldız, 2018) link1
link2
Ugur 101 Rel 2016 (Ugur and Gonenc, 2016) link1
link2
Anlamver 500 Rel 2018 (Ercan and Taner Yıldız, 2018) link1
link2
Vietnamien (VI)
SimLex 999 2017 Sim (Tan et al., 2017) link1
link2
ViData 400 2018 Sim (Nguyen et al., 2018) link1
link2

Biomedical datasets

Dataset |pairs| Year Type Ref Link
English (EN)
MayoSRS 101 2011 Rel (Pakhomov et al., 2011) link1
link2
link3
UMNSRS 587 2010 Rel (Pakhomov et al., 2010) link1
link2
Chinese (CN)
Words 240 2012 Rel (Wang et al.,2012) link1
link2

Geographic datasets

Dataset |pairs| Year Type Ref Link
English (EN)
GTRD 66 2018 Rel (Chen et al.,2018) link1
link2

Multilingual datasets

RG65 65 Sim
Language Year Type Ref Link
English EN link1
link2
French FR 2011 Sim (Joubarne and Inkpen, 2011) link1
link2
French FR 2018 Sim (Barzegar et al., 2018) link1
link2
Persian FA link1
link2
Portuguese PL 2014 Sim (Granada et al., 2014) link1
link2
Spanish ES link1
link2
Swedish SE link1
link2
WordSim353 353 Rel
link1
link2
SimLex999 999 Sim
link1
link2

Cross-Lingual datasets

Dataset Langauge 1 Langauge 2 |pairs| Year Type Ref Link
WordSim353_DE_IT German Italian 589 2015 Rel (Camacho-Collados et al.,2015) link1
link2
RG65_DE_ES German Spanish 125 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_PT_FA Portuguese Persian 122 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_FR_PT French Portuguese 92 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_FR_FA French Persian 100 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_FR_ES French Spanish 103 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_FR_DE French German 96 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_ES_PT Spanish Portuguese 113 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_ES_FA Spanish Persian 122 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_EN_PT English Portuguese 120 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_EN_FR English French 100 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_EN_FA English Persian 120 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_EN_ES English Spanish 126 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_EN_DE English German 125 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_DE_PT German Portuguese 118 2015 Sim (Camacho-Collados et al.,2015) link1
link2
RG65_DE_FA German Persian 122 2015 Sim (Camacho-Collados et al.,2015) link1
link2

References

  1. Almarsoomi, F.A., O’Shea, J., Bandar, Z., Crockett, K.A., 2013. AWSS: An Algorithm for Measuring Arabic Word Semantic Similarity
  2. Saif, A., Aziz, M.J.A., Omar, N., 2014. Evaluating knowledge-based semantic measures on Arabic. International Journal on Communications Antenna and Propagation 4, 180–194.
  3. Rubenstein, H., Goodenough, J.B., 1965. Contextual Correlates of Synonymy. Commun. ACM 8, 627–633.
  4. Miller, G.A., Charles, W.G., 1991. Contextual correlates of semantic similarity. Language & Cognitive Processes 6, 1–28.
  5. Wu, Y., Li, W., 2016. Overview of the NLPCC-ICCPOL 2016 Shared Task: Chinese Word Similarity Measurement, in: Natural Language Understanding Intelligent Applications - 5th CCF Conference Natural Language Processing Chinese Computing, NLPCC 2016, 24th International Conference Computer Processing Oriental Languages, ICCPOL2016, Kunming, China, December 26, 2016, Proceedings. pp. 828–839.
  6. Camacho-Collados, J., Pilehvar, M.T., Navigli, R., 2015. A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets, in: ACL(2). The Association for Computer Linguistics, pp. 1–7.
  7. Gurevych, I., 2005. Using the Structure of a Conceptual Network in Computing Semantic Relatedness, in: Natural Language Processing IJCNLP 2005,Second International Joint Conference,Jeju Island, Korea, October 11-13, 2005, Proceedings. pp. 767–778.
  8. Joubarne, C., Inkpen, D., 2011. Comparison of Semantic Similarity for Different Languages Using the Google n-gram Corpus and Second-Order Co-occurrence Measures, in: Advances Artificial Intelligence - 24th Canadian Conference Artificial Intelligence, Canadian AI 2011, St.John’s, Canada, May 25-27, 2011. Proceedings. pp. 216–221.
  9. S.V.S. Pakhomov, T. Pedersen, B. McInnes, G.B. Melton, A. Ruggieri, C.G. Chute: Towards a framework for developing semantic relatedness reference standards J Biomed Inform, 44 (2011), pp. 251-265
  10. Pakhomov, S., McInnes, B., Adam, T., Liu, Y., Pedersen, T., Melton, G.B., 2010. Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study. AMIA Annual Symposium proceedings / AMIA Symposium. AMIA Symposium 2010, 572–576.
  11. Zugang Chen, Jia Song and Yaping Yang: An Approach to Measuring Semantic Relatedness of Geographic Terminologies Using a Thesaurus and Lexical Database Sources, International Journal of Geo-Information, 2018.
  12. X. Wang, Y. Jia, B. Zhou, Z. Ding, Z. Liang: Computing semantic relatedness using Chinese Wikipedia links and taxonomy, J. Chinese Comput. Syst., 32 (11) (2012), pp. 2237-2242.
  13. Cramer, I., Finthammer, M., 2008. An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues, in: Proceedings Fourth Global WordNet Conference (GWC 2008). University of Szeged, Department of Informatics, Szeged, Ungarn.
  14. Felix Hill, Roi Reichart, and Anna Korhonen. 2015. SimLex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695.
  15. Ira Leviant and Roi Reichart. 2015. Separated by an un-common language: Towards judgment language informed vector space modeling. arXiv preprint arXiv:1508.00106.
  16. Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S., 2011. A Word at a Time: Computing Word Relatedness Using Temporal Semantic Analysis, in: Proceedings 20th International Conference World Wide Web, WWW’11. ACM, Hyderabad, India, pp. 337–346.
  17. Halawi, G., Dror, G., Gabrilovich, E., Koren, Y., 2012. Large-scale Learning of Word Relatedness with Constraints, in: Proceedings 18th ACM SIGKDD International Conference Knowledge Discovery Data Mining, KDD’12. ACM, Beijing, China, pp. 1406–1414.
  18. Szumlanski, S.R., Gomez, F., Sims, V.K., 2013. A New Set of Norms for Semantic Relatedness Measures., in: ACL(2). The Association for Computer Linguistics, pp. 890–895.
  19. Siamak Barzegar, Brian Davis, Manel Zarrouk, Siegfried Handschuh, André Freitas: SemR-11: A Multi-Lingual Gold-Standard for Semantic Similarity and Relatedness for Eleven Languages. LREC 2018
  20. José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli. A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), Short Papers, Beijing, China, July 27-29, 2015.
  21. Granada, R., Santos, C.T. dos, Vieira, R., 2014. Comparing Semantic Relatedness between Word Pairs in Portuguese Using Wikipedia, in: Computational Processing Portuguese Language - 11th International Conference, PROPOR 2014, São Carlos/SP, Brazil, October 6-8, 2014. Proceedings. pp. 170–175.
  22. Ziegler, C.-N., Simon, K., Lausen, G., 2006. Automatic Computation of Semantic Proximity Using Taxonomic Knowledge, in: Proceedings 15th ACM International Conference Information Knowledge Management, CIKM’06. ACM, Arlington, Virginia, USA, pp. 465–474.
  23. Gracia, J., Mena, E., 2008. Web-based Measure of Semantic Relatedness, in: InProc. 9th International Conference Web Information Systems Engineering (WISE2008), Auckland (NewZealand). Springer, pp. 136–150.
  24. Luong, T., Socher, R., Manning, C., 2013. Better Word Representations with Recursive Neural Networks for Morphology, in: Proceedings Seventeenth Conference Computational Natural Language Learning. IAssociation for Computational Linguistics, Sofia, Bulgaria, pp. 104–113.
  25. Hassan, S., Mihalcea, R., 2009. Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge, in: Proceedings 2009 Conference Empirical Methods Natural Language Processing: Volume 3 - Volume 3, EMNLP’09. Association for Computational Linguistics, Singapore, pp. 1192–1201.
  26. Kim Anh Nguyen, Sabine Schulte im Walde and Ngoc Thang Vu. Introducing two Vietnamese Datasets for Evaluating Semantic Models of (Dis-)Similarity and Relatedness. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HTL). New Orleans, Louisiana, June 2018
  27. Bui Van Tan, Nguyen Phuong Thai, Pham Van Lam: Construction of a word similarity dataset and evaluation of word similarity techniques for Vietnamese. KSE 2017: Hue, Vietnam, 65-70.
  28. Ugur Sopaoglu and Gonenc Ercan. Evaluation of Semantic Relatedness Measures for Turkish Language. In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2016), Konya, Turkey, 2016.
  29. Gökhan Ercan, Olcay Taner Yildiz: AnlamVer: Semantic Model Evaluation Dataset for Turkish - Word Similarity and Relatedness. COLING 2018: 3819-3836
  30. Zesch, T., Gurevych, I., 2006. Automatically creating datasets for measures of semantic relatedness, in: COLING/ACL 2006 Workshop Linguistic Distances. Sydney, Australia, pp. 16–24.
  31. Akhtar, S.S., Gupta, A., Vajpayee, A., Srivastava, A., Shrivastava, M., 2017. Word Similarity Datasets for Indian Languages: Annotation and Baseline Systems, in: LAW@ACL. Association for Computational Linguistics, pp. 91–94.
  32. Ágoston Tóth: How Similar: Word Similarity Judgments in English and Hungarian, 2013.
  33. Sakaizawa, Y., Komachi, M., 2017. Construction of a Japanese Word Similarity Dataset. CoRR abs/1703.05916.
  34. Li, P., Wang, H., Zhu, K.Q., Wang, Z., Wu, X., 2013. Computing Term Similarity by Large Probabilistic is A Knowledge, in: Proceedings 22Nd ACM International Conference Conference Information; Knowledge Management, CIKM’13. ACM, San Francisco, California, USA, pp. 1401–1410
  35. Yang, D., Powers, D.M.W., 2006. Verb Similarity on the Taxonomy of Wordnet, in: In 3rd International WordNet Conference (GWC-06), Jeju Island, Korea.
  36. Gerz, D., Vulic, I., Hill, F., Reichart, R., Korhonen, A., 2016. SimVerb-3500: A Large-Scale Evaluation Set of Verb Similarity, in: Proceedings 2016 Conference Empirical Methods Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016. pp. 2173–2182.
  37. Resnik, P., Diab, M.: Measuring verb similarity. In: Proceedings of the Twenty-second Annual Conference of the Cognitive Science Society: August 13-15, 2000, Institute for Research in Cognitive Science,University of Pennsylvania, Philadelphia, PA (2000)
  38. Martinez-Gil, J., Aldana-Montes, J.F. Semantic similarity measurement using historical google search patterns. Information Systems Frontiers 15(3): 399-410 (2013).
  39. Baker, S., Reichart, R., Korhonen, A., 2014. An Unsupervised Model for Instance Level Subcategorization Acquisition, in: Proceedings 2014 Conference Empirical Methods Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting SIGDAT, Special Interest Group ACL. pp. 278–289.
  40. Carina Silberer and Mirella Lapata. 2014. Learning Grounded Meaning Representations with Autoencoders. In Proceedings of ACL 2014, Baltimore, MD.
  41. Cinková, S., 2016. WordSim353 for Czech, in: Sojka, P., Horák, A., Kopecek, I., Pala, K. (Eds.), Text,Speech, Dialogue: 19th International Conference, TSD 2016, Brno , Czech Republic, September 12-16, 2016, Proceedings. Springer International Publishing, Cham, pp. 190–197.