Skip to content
/ TCoHOT Public

Temporal Classification of HathiTrust OCRed Texts (codes for paper published in iConf 2015)

Notifications You must be signed in to change notification settings

zachguo/TCoHOT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Temporal Classification of HathiTrust OCRed Texts

Paper published in iConference 2015 Proceedings http://hdl.handle.net/2142/73656

This is also a course project for Z604 (Big Data Analytics for Web and Text) in 2014 Spring, taught by Xiaozhong Liu and Miao Chen.

Abstract

In large-scale digital libraries, it is not uncommon that some bibliographic fields in metadata records are incomplete or missing. Adding to the incomplete or missing metadata can greatly facilitate users' search and access to digital library resources. Temporal information, such as publication date, is a key descriptor of digital resources. In this study, we investigate text mining methods to automatically resolve missing publication dates for the HathiTrust corpora, a large collection of documents digitized by optical character recognition (OCR). In comparison with previous approaches using only unigrams as features, our experiment results show that methods incorporating higher order n-gram features, e.g., bigrams and trigrams, can more effectively classify a document into discrete temporal intervals or "chronons". Our approach can be generalized to classify volumes within other digital libraries.

Team

  • Siyuan Guo @zachguo
  • Trevor Edelblute @tedelblu
  • Bin Dai @bindai

About

Temporal Classification of HathiTrust OCRed Texts (codes for paper published in iConf 2015)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •