homework5

move the homework5 data files to sourcedata.

Dec 8, 2017

e4a7213 · Dec 8, 2017

Name	Name	Last commit message	Last commit date
parent directory ..
initialResult	initialResult	homework5 finished.	Nov 28, 2017
1-2500-393018_2017-11-30 18:07:33.h5	1-2500-393018_2017-11-30 18:07:33.h5	homework5 most best result 0.29648.use this modelel and codes.	Nov 30, 2017
README.md	README.md	commit hw5 to moodle.this is last version of hw5.maybe.	Dec 8, 2017
TF_IDF.py	TF_IDF.py	commit hw5 to moodle.this is last version of hw5.maybe.	Dec 8, 2017
dictionary.py	dictionary.py	keras train initial finished.	Nov 21, 2017
documentTF.py	documentTF.py	homework5 finished.	Nov 28, 2017
getDocL.py	getDocL.py	homework5 initial finish.	Nov 23, 2017
getFileList.py	getFileList.py	homework5 algorithm modify , for input data.	Nov 22, 2017
getQuerysList.py	getQuerysList.py	homework5 initial finish.	Nov 23, 2017
getTF_IDF.py	getTF_IDF.py	homework5 finished.	Nov 28, 2017
idfResult.py	idfResult.py	CBOW maybe right and need more train time and small training data siz…	Nov 25, 2017
keras_10.py	keras_10.py	commit hw5 to moodle.this is last version of hw5.maybe.	Dec 8, 2017
main_withTFIDF.py	main_withTFIDF.py	commit hw5 to moodle.this is last version of hw5.maybe.	Dec 8, 2017
queryTF.py	queryTF.py	homework5 finished.	Nov 28, 2017
submission_TFIDF_MAP100.txt	submission_TFIDF_MAP100.txt	add keras_100.	Dec 1, 2017
tFIDF_result.txt	tFIDF_result.txt	homework5 finished.	Nov 28, 2017

README.md

IR Homework1 Introduction

This is the homework5 of NTUST CSIE IR course in 2017 fall.

Run the main function to get the TOP 100 of final result:

python3 main_withTFIDF.py

The final result file is submission_TFIDF_MAP100.txt.

If you want to re-generate TF_IDF score, you will need about three hours and you should modify the TF_IDF.py, then run :

python3 TF_IDF.py

If you want to re-generate CBOW score, you will need about many hours and you should modify the keras_10.py, then run :

python3 keras_10.py

But the keras train maybe not right.The CBOW result may be better only need 20~50 times train. My 2500 times train's good result is just a coincidence.

And maybe we should given the word in query but not in document the socre of the three lowest TF_IDF socre, teacher's advice.

steps1 Vector model

1 : get the query list
2 : get the dictionary and document tf,idf and query tf
3 : document term weight，先取得一個 document 作爲豎着向量。再找到 query term weight，作爲橫着向量
4 : compute the 兩個向量點乘
5 : compute the 兩個向量長度
6 : 計算 cos
7 : 排序比較對於這個 query ，最接近的結果 cosVal 越大
8 : 找到所有的 query 的 cosVal

step2 Word Embedding(CBOW)

get the all files into a file.
initial data, three words are a group of input-data.
initial left of the first data, third data.
initial middle data list, source from the second data in second step.
convert middle data to one-hot style.
keras uses Word Embedding method - CBOW to train. And save weight.
get Embedding result of weight file.
compute cosVal by Embedding result.

step3 Add and get final result

get the all documents 's colVal of all querys from step1.'
get the all documents 's colVal of all querys from step2.'
Add two groups colVal.
Sort result.

Others:

If we use 2500 times train and split input-data to 10 groups for one train, we will get best result for this homework5 when only train three groups input-data.It is amazing but in fact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

homework5

homework5

README.md

IR Homework1 Introduction

steps1 Vector model

step2 Word Embedding(CBOW)

step3 Add and get final result

Others:

Files

homework5

Directory actions

More options

Directory actions

More options

Latest commit

History

homework5

Folders and files

parent directory

README.md

IR Homework1 Introduction

steps1 Vector model

step2 Word Embedding(CBOW)

step3 Add and get final result

Others: