GitHub - LASkuma/NewsClustering: CS473 Final Project

#NewsClustering#

This is a repo for CS473 Final Project. It's aimed to cluster news threads from Purdue Newsroom and give summarizations for each thread.

#Getting Started#

Install Python (3.0 > Python >= 2.5)
Install dependencies for gensim, including NumPy and SciPy. sudo apt-get install python-dev
Install gensim
Install stemming 1.0 You can install with easy_install -U stemming if easy_install is installed

You can find detailed Install Steps from gensim website .

Run parse.py
Run transform.py
Run lda.py
Run index.py

#To-do#

~~Save document urls~~
~~Transform to sparse~~
~~LDA step~~
~~Inverse hash stemmed word to original word~~
~~Do TF-IDF tranform to corpus~~
~~Similarity queries for all the documents (Clustering)~~
Summarize topics
Evaluation Datasets for single-label text categorization

#Data# The file "2012data.txt" in Data directory is used as the input for this project. It has the following structure

url\n

<Content>\n

title\n

content\n

<\Content>\n

\n

#Doc_index# doc[doc_b] is a list of pairs. Each pair is (topic_id, probability). wordlist[list[listofwords]] is a list of list of words with its corresponding percentage. e.g. [['0.006', 'student'], ['0.005', 'engin'], ['0.004', 'agricultur'], ['0.004', 'program'] ...

#Resources# Python Official Tutorial

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
Data		Data
source		source
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data

Data

source

source

.gitignore

.gitignore

README.md

README.md

Repository files navigation

About

Releases

Packages

Contributors 3

Languages

LASkuma/NewsClustering

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages