Skip to content

LASkuma/NewsClustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 

Repository files navigation

#NewsClustering#

This is a repo for CS473 Final Project. It's aimed to cluster news threads from Purdue Newsroom and give summarizations for each thread.

#Getting Started#

  1. Install Python (3.0 > Python >= 2.5)
  2. Install dependencies for gensim, including NumPy and SciPy. sudo apt-get install python-dev
  3. Install gensim
  4. Install stemming 1.0 You can install with easy_install -U stemming if easy_install is installed

You can find detailed Install Steps from gensim website .

  1. Run parse.py
  2. Run transform.py
  3. Run lda.py
  4. Run index.py

#To-do#

  1. Save document urls
  2. Transform to sparse
  3. LDA step
  4. Inverse hash stemmed word to original word
  5. Do TF-IDF tranform to corpus
  6. Similarity queries for all the documents (Clustering)
  7. Summarize topics
  8. Evaluation Datasets for single-label text categorization

#Data# The file "2012data.txt" in Data directory is used as the input for this project. It has the following structure

url\n

<Content>\n

title\n

content\n

<\Content>\n

\n

#Doc_index# doc[doc_b] is a list of pairs. Each pair is (topic_id, probability). wordlist[list[listofwords]] is a list of list of words with its corresponding percentage. e.g. [['0.006', 'student'], ['0.005', 'engin'], ['0.004', 'agricultur'], ['0.004', 'program'] ...

#Resources# Python Official Tutorial

About

CS473 Final Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published