BERTOverflow

This repository contains pre-trained BERT on StackOverflow data, which has shown state-of-the-art performance (with CRF layer) on software domain NER. The checkpoints can be downloaded here.

For further details, see the accompanying paper: Code and Named Entity Recognition in StackOverflow

Note: this is just a reference for BERT pre-training with your own data. First, you need to download the original BERT codebase, then apply for TPU usage through TFRC, finally follow this readme for BERT pre-training.

Data

We extract 152M sentences from StackOverflow questions and answers.

Vocabulary

We create 80K cased WordPiece vocabulary with 2K different UNK symbols:

import tokenizers
bwpt = tokenizers.BertWordPieceTokenizer(
    vocab_file=None,
    add_special_tokens=True,
    unk_token='[UNK]',
    sep_token='[SEP]',
    cls_token='[CLS]',
    clean_text=True,
    lowercase=False,
    handle_chinese_chars=True,
    strip_accents=True,
    wordpieces_prefix='##'
)
bwpt.train(
    files=["all_lines_from_ques_ans_xml_excluded_ann.txt"],
    vocab_size=80000,
    min_frequency=30,
    limit_alphabet=2000,
    special_tokens=['[PAD]', '[UNK]', '[CLS]', '[MASK]', '[SEP]']
)
bwpt.save("./","soft-bert-vocab")

TF Records

We split large file and create TF Records parallelly:

split -l 400000 ../../data/my_data.all my-data- --verbose

ls ../saved_model/softbert/raw_txt_data/ | xargs -n 1 -P 16 -I{} python create_pretraining_data.py --input_file=../saved_model/softbert/raw_txt_data/{} --output_file=../saved_model/softbert/tf_records_data/{}.tfrecord --vocab_file=../saved_model/softbert/vocab.txt --max_seq_length=128 --max_predictions_per_seq=20 --masked_lm_prob=0.15 --random_seed=12345 --dupe_factor=5 --do_whole_word_mask=False --do_lower_case=False

Pre-training

The pre-training is conducted on TPU v2-8 with Tensorflow 1.15:

python3 run_pretraining.py --input_file=gs://softbert_data/processed_data/*.tfrecord --output_dir=gs://softbert_data/model_base/ --do_train=True --do_eval=True --bert_config_file=gs://softbert_data/model_base/bert_config.json --train_batch_size=512 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=1500000 --num_warmup_steps=10000 --learning_rate=1e-4 --use_tpu=True --tpu_name=$TPU_NAME --save_checkpoints_steps 100000

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
bertoverflow.png		bertoverflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

bertoverflow.png

bertoverflow.png

Repository files navigation

BERTOverflow

Data

Vocabulary

TF Records

Pre-training

About

Releases

Packages

lanwuwei/BERTOverflow

Folders and files

Latest commit

History

README.md

README.md

bertoverflow.png

bertoverflow.png

Repository files navigation

BERTOverflow

Data

Vocabulary

TF Records

Pre-training

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages