Github_Crawler

Crawl github data using github API or no-API, then store it into Mysql database

you need to install:

Python2.7 (better Anaconda)
mysql, you can install and configure according to this tutorial http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001391435131816c6a377e100ec4d43b3fc9145f3bb8056000

File introduction

CrawlerWithoutAPI.py demo crwal github without API, give a URL of REPO then return result
GithubCrawler.py crawl github using API, search all java project and extract readme,description,topic and all dependency file from .gradle and .pom
MysqlOption.py Mysql option, create database, table and insert/search
CleanUtils.py Some tools to do cleaning and extraction
token_key your Github API token
data_prepare.py prepare data from database to do deeplearning or data analyze

User guide

following steps:

generate your github access token, following this https://github.com/settings/tokens
mkdir a new file in the project path named token_key, then copy&paste your personal access token into it(no need to add \n)
modify MysqlOption.py and set your mysql USER and PASSWORD
run MysqlOption.py to create database and new table
modify GithubCrawl.py ,set START_FROM_TIME = YOUR START,set END_TO_TIME = YOUR END, set SRART_FROM_PAGE = YOUR START since the return project count is 320W and every query total max return result is 1000, and for once time, max return result is 100,so firstly we need to split these result according to repo create time, ensure every query total return result is less than 1000, for every specific time period, we need to split the result(max is 1000) into page so we can get all result
happily run GithubCrawl.py

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.gitignore		.gitignore
README.md		README.md
clean_utils.py		clean_utils.py
crawler_without_API.py		crawler_without_API.py
data_prepare.py		data_prepare.py
github_crawler.py		github_crawler.py
mysql_option.py		mysql_option.py
picked_library.csv		picked_library.csv
picked_tag.csv		picked_tag.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

clean_utils.py

clean_utils.py

crawler_without_API.py

crawler_without_API.py

data_prepare.py

data_prepare.py

github_crawler.py

github_crawler.py

mysql_option.py

mysql_option.py

picked_library.csv

picked_library.csv

picked_tag.csv

picked_tag.csv

Repository files navigation

Github_Crawler

you need to install:

File introduction

User guide

if you think this project may helpful, may you give my repo a STAR? :)

About

Releases

Packages

Languages

yang1young/GithubCrawler

Folders and files

Latest commit

History

Repository files navigation

Github_Crawler

you need to install:

File introduction

User guide

if you think this project may helpful, may you give my repo a STAR? :)

About

Topics

Resources

Stars

Watchers

Forks

Languages