IMFS-Summer-Kaggle

What

The Office of Investment Management Fintech Strategies (IMFS) is excited to announce the kick-off of our second annual data science collaborative and friendly competition. This collaborative effort brings together crew from across IMG and IMS that have an interest in learning about data science and getting some hands on experience. Beginning in late June, all interested crew members will be split up into teams to participate in a Kaggle competition. Kaggle is a platform that brings together data scientists from across the world to work on data problems. The IMFS data science team has lined up an interesting problem to tackle and we hope that as many people that are interested in can join us.

Timeline

Kickoff - June 28th 2018
Weekly meeting 1 - July 6th 2018
Weekly meeting 2 - July 13th 2018
Weekly meeting 3 - July 20th 2018
Weekly meeting 4 - July 27th 2018
Demo day and celebratory lunch - August 3rd 2018

How

Here’s a brief rundown of what you need to do:

Join the competition on Slack. Once in Slack, make sure to join your team's channel.
Review git commands. Here is a good refresher
Clone this repository locally
Create a git branch where you will work on your own
Start with "Phase 1 - similarityCompetition"
Train your model using "train.csv"
Run your model using 'test.csv' as your input
Submit your work using the "submission_phase1_yourteamname.csv" template
Save your output results in the "submission" folder under "Phase 1 - submission". Make sure to name your output CSV as "submission_phase1_yourteamname.csv".
If you still have time, move on to "Phase 2 - stockDirection"
Train your model using "train.csv"
Run your model using 'test.csv' as your input
Submit your work using the "submission_phase2_yourteamname.csv" template
Save your output results in the "submission" folder under "Phase 2 - submission". Make sure to name your output CSV as "submission_phase2_yourteamname.csv".
Merge your branch to the main (master) repo
Confirm your submissions are present in the "submission" folder in the main repository
Post "completed with project - yourteamname" in the "general" Slack channel

Data and Rules

Data

Phase 1 - similarityCompetition

Analysts tend to use different descriptions to refer to the same security. In this challenge, you are asked to make use of these descriptions to predict whether a pair of descriptions refers to the same security or not.

Refer to the "dataset" folder.

The "train.csv" = your training dataset

description_x = description on a security x
description_y = description on a security y
ticker_x = ticker for security x
ticker_y = ticker for security y
same_security = binary (true/False). Whether x and y refer to the same security (true) or not (false)

The "test.csv" = your test dataset used to grade your results

description_x = description on a security x
description_y = description on a security y
same_security = binary (true/False). The output you need to predict.

Refer to the "submission" folder.

The "submission_phase1_yourteamname.csv" = what your output file must look like

description_x = description on a security x
description_y = description on a security y
same_security = your predictions (binary true/false)

Phase 2 - stockDirection

Can you use eight years of daily news headlines to predict stock market movement? You have historical news headlines from Reddit WorldNews Channel ranked by reddit users' votes, and only the top 25 headlines are considered for a single date. (Range: 2008-06-08 to 2016-07-01). You are also provided with Dow Jones Industrial Average (DJIA) stock data (Range: 2008-08-08 to 2016-07-01). The challenge is to use news headlines data to predict stock directionality, i.e. whether the DJIA Adj. Close value decreases, or stays the same/increases.

Some algorithms you may consider: Naives Bayes, Logistic Regression, SVM and Random Forest.

Refer to the "dataset" folder.

The "train.csv" = your training dataset

Date = date stock/news were pulled
Open = stock price at market open on the day
High = highest stock price on the day
Low = lowest stock price on the day
Close = stock price at market close on the day
Volume = volume
Adj Close = stock price at market close, adjusted for fair value
Top1 thru Top25 = top 25 news headlines from reddit

The "test.csv" = your test dataset used to grade your results

Date = date stock/news were pulled
Top1 thru Top25 = top 25 news headlines from reddit

Refer to the "submission" folder.

The "submission_phase2_yourteamname.csv" = what your output file must look like

Date = date stock/news were pulled
stock_directionality = your prediction of stock movement. Must be binary (1 if (DJIA Adj. Close - Open)>= 0 | 0 otherwise).

Rules

We evaluate the accuracy of your predictions using RMSE or AUC
Your submission must follow the template "submission_phasenumber_yourteamname.csv"

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
codes		codes
dataset		dataset
scores		scores
submission		submission
.gitignore		.gitignore
Final_scores_2018_07_31_sorted.csv		Final_scores_2018_07_31_sorted.csv
Final_scores_2018_08_01_sorted.csv		Final_scores_2018_08_01_sorted.csv
LICENSE.md		LICENSE.md
README.md		README.md
Teams.csv		Teams.csv
scorer.py		scorer.py

License

ehameyie/IMFS-Summer-Kaggle

Folders and files

Latest commit

History

Repository files navigation

IMFS-Summer-Kaggle

What

Timeline

How

Data and Rules

Data

Phase 1 - similarityCompetition

Refer to the "dataset" folder.

Refer to the "submission" folder.

Phase 2 - stockDirection

Refer to the "dataset" folder.

Refer to the "submission" folder.

Rules

[Participants] (https://github.com/ehameyie/IMFS-Summer-Kaggle/blob/master/Teams.csv)

[Scoring Board Leaders] (https://github.com/ehameyie/IMFS-Summer-Kaggle/blob/master/Final_scores_2018_07_31_sorted.csv)

About

Topics

Resources

License

Stars

Watchers

Forks

Languages