Skip to content

gewoonrik/pullreqs-dnn

Repository files navigation

pullreqs-dnn

Trying to predict whether a PR will be merged just by examining its diff

Installing dependencies

This project uses Keras. With Keras, you can have multiple backends, including Tensorflow and Theano. The instructions below are for Tensorflow.

It is also advisable to use CUDA; installation instructions can be found at NVIDIA's site

Then, install Tensorflow by exporing one of the following variables.

# Tensorflow without GPU
export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.12.0rc0-cp27-none-linux_x86_64.whl

# Tensorflow with GPU
export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.12.0rc0-cp27-none-linux_x86_64.whl

Then, install all dependencies

sudo apt-get install python-pip python-dev libhdf5-dev
sudo pip install --upgrade $TF_BINARY_URL
sudo pip install pandas keras h5py pbr funcsigs

Running

There are three steps to train the network: 1. data splitting. 2. preprocess the data 2. training. The idea is that you tag a dataset with a prefix while splitting and use this dataset version for preprocessing and training by applying the same prefix

Splitting

The splitting script downloads (~15GB) the required datasets and transforms them into something the preprocessing script can read. It also allows to filter by language, configure the balancing ratio between merged and unmerged PRs (the dataset is very unbalanced, as 85% of the PRs are merged on GitHub). The dataset is then randomly split in a testing and training dataset. The training set is split once more into a training and validation set.

The script can be configured as follows:

./split.py --help
usage: split.py [-h] [--prefix PREFIX] [--balance_ratio BALANCE_RATIO]
                [--langs [LANGS [LANGS ...]]]
                [--validation_split VALIDATION_SPLIT]
                [--test_split TEST_SPLIT]

optional arguments:
  -h, --help            show this help message and exit
  --prefix PREFIX
  --balance_ratio BALANCE_RATIO
  --langs [LANGS [LANGS ...]]
  --validation_split VALIDATION_SPLIT
  --test_split TEST_SPLIT

Preprocessing

The preprocessing script takes the CSV files generated by the split script, tokenizes the data and creates a vocabulary from the tokens.

./preprocess.py --help
usage: preprocess.py [-h] [--prefix PREFIX]
                     [--diff_vocabulary_size DIFF_VOCABULARY_SIZE]
                     [--comment_vocabulary_size COMMENT_VOCABULARY_SIZE]
                     [--title_vocabulary_size TITLE_VOCABULARY_SIZE]
                     [--max_diff_sequence_length MAX_DIFF_SEQUENCE_LENGTH]
                     [--max_comment_sequence_length MAX_COMMENT_SEQUENCE_LENGTH]
                     [--max_title_sequence_length MAX_TITLE_SEQUENCE_LENGTH]

optional arguments:
  -h, --help            show this help message and exit
  --prefix PREFIX
  --diff_vocabulary_size DIFF_VOCABULARY_SIZE
  --comment_vocabulary_size COMMENT_VOCABULARY_SIZE
  --title_vocabulary_size TITLE_VOCABULARY_SIZE
  --max_diff_sequence_length MAX_DIFF_SEQUENCE_LENGTH
  --max_comment_sequence_length MAX_COMMENT_SEQUENCE_LENGTH
  --max_title_sequence_length MAX_TITLE_SEQUENCE_LENGTH

You can run the preprocessing script like this:

# Only produce data for Ruby projects
./preprocess.py --prefix=ruby --langs=ruby --validation_split=0.1 --vocabulary_size 20000 --max_sequence_length 200

Training

After preprocessing the data, you'll need to train a model. We have implemented two models. The first only accepts the diffs, while the second also uses the title and the description of the PR. You can configure multiple parameters, for example: the number of outputs from the LSTM layer, the number of outputs from the embeddings layer, the number of epochs to run and the batch size (number of samples per iteration).

The training is configured to stop if no improvement is seen in validation loss after 5 epochs.

./train_model1.py --help
usage: train_model1.py [-h] [--prefix PREFIX] [--batch_size BATCH_SIZE]
                       [--epochs EPOCHS] [--dropout DROPOUT]
                       [--lstm_output LSTM_OUTPUT]
                       [--embedding_output EMBEDDING_OUTPUT]
                       [--checkpoint CHECKPOINT]

optional arguments:
  -h, --help            show this help message and exit
  --prefix PREFIX
  --batch_size BATCH_SIZE
  --epochs EPOCHS
  --dropout DROPOUT
  --lstm_output LSTM_OUTPUT
  --embedding_output EMBEDDING_OUTPUT
  --checkpoint CHECKPOINT
./train_model2.py --help
usage: train_model2.py [-h] [--prefix PREFIX] [--batch_size BATCH_SIZE]
                       [--epochs EPOCHS] [--dropout DROPOUT]
                       [--lstm_diff_output LSTM_DIFF_OUTPUT]
                       [--lstm_title_output LSTM_TITLE_OUTPUT]
                       [--lstm_comment_output LSTM_COMMENT_OUTPUT]
                       [--diff_embedding_output DIFF_EMBEDDING_OUTPUT]
                       [--title_embedding_output TITLE_EMBEDDING_OUTPUT]
                       [--comment_embedding_output COMMENT_EMBEDDING_OUTPUT]
                       [--checkpoint CHECKPOINT]

optional arguments:
  -h, --help            show this help message and exit
  --prefix PREFIX
  --batch_size BATCH_SIZE
  --epochs EPOCHS
  --dropout DROPOUT
  --lstm_diff_output LSTM_DIFF_OUTPUT
  --lstm_title_output LSTM_TITLE_OUTPUT
  --lstm_comment_output LSTM_COMMENT_OUTPUT
  --diff_embedding_output DIFF_EMBEDDING_OUTPUT
  --title_embedding_output TITLE_EMBEDDING_OUTPUT
  --comment_embedding_output COMMENT_EMBEDDING_OUTPUT
  --checkpoint CHECKPOINT
# Train on the previously produced data
./train_model1.py --prefix=ruby --batch_size=256 --epochs=20
./train_model2.py --prefix=ruby --batch_size=256 --epochs=20