src

Nov 17, 2022

7cdda2c · Nov 17, 2022

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	update src	Nov 17, 2022
lert_config_base.json	lert_config_base.json	update src	Nov 17, 2022
lert_config_large.json	lert_config_large.json	update src	Nov 17, 2022
lert_config_small.json	lert_config_small.json	update src	Nov 17, 2022
modeling.py	modeling.py	update src	Nov 17, 2022
optimization.py	optimization.py	update src	Nov 17, 2022
run.pretrain.sh	run.pretrain.sh	update src	Nov 17, 2022
run_pretraining.py	run_pretraining.py	update src	Nov 17, 2022
tokenization.py	tokenization.py	update src	Nov 17, 2022
vocab.txt	vocab.txt	update src	Nov 17, 2022

README.md

This folder contains pre-training scripts for LERT, which is mainly based on Google's BERT implementation.

The original implementation is based on Tensorflow 1.15 with Cloud TPU devices.

The run.pretrain.sh is the starting script (on TPU).

#!/bin/bash
set -ex

TPU_NAME="your-tpu-name"
TPU_ZONE="your-tpu-zone"
DATA_DIR=./your-path-to-tfrecords
MODEL_DIR=./your-path-to-model-saving
CONFIG_FILE=./your-path-to-config-file

# run pretraining
python run_pretraining.py \
	--input_file=${DATA_DIR}/tf_examples.tfrecord.* \
	--output_dir=${MODEL_DIR} \
	--do_train=True \
	--bert_config_file=${CONFIG_FILE} \
	--train_batch_size=1024 \
	--eval_batch_size=1024 \
	--max_seq_length=512 \
	--max_predictions_per_seq=75 \
	--num_train_steps=2000000 \
	--num_warmup_steps=10000 \
	--save_checkpoints_steps=50000 \
	--learning_rate=1e-4 \
	--do_lower_case=True \
	--use_tpu=True \
	--tpu_name=${TPU_NAME} \
	--tpu_zone=${TPU_ZONE}

Read This Before Using

Linguistic tags used in this paper are listed in lines 110-112 of run_pretraining.py, where POS feature has 28 tags, NER feature has 13 tags, and DEP feature has 14 tags. Please do not change the order of them, as they are directly mapped to our released pre-trained weights (if you would like load these weights and perform further pre-training or linguistic prediction).

POS_LIST = ["POS-n", "POS-v", "POS-wp", "POS-u", "POS-d", "POS-a", "POS-m", "POS-p", "POS-r", "POS-ns", "POS-c", "POS-q", "POS-nt", "POS-nh", "POS-nd", "POS-j", "POS-i", "POS-b", "POS-ni", "POS-nz", "POS-nl", "POS-z", "POS-k", "POS-ws", "POS-o", "POS-h", "POS-e", "POS-%"]
NER_LIST = ["NER-O", "NER-S-Ns", "NER-S-Nh", "NER-B-Ni", "NER-E-Ni", "NER-I-Ni", "NER-S-Ni", "NER-B-Ns", "NER-E-Ns", "NER-I-Ns", "NER-B-Nh", "NER-E-Nh", "NER-I-Nh"]
DEP_LIST = ["DEP-ATT", "DEP-WP", "DEP-ADV", "DEP-VOB", "DEP-SBV", "DEP-COO", "DEP-RAD", "DEP-HED", "DEP-POB", "DEP-CMP", "DEP-LAD", "DEP-FOB", "DEP-DBL", "DEP-IOB"]

To perform linguistically-informed pre-training, please specify the end steps of scaling for each linguistic feature. Line 292-294 in run_pretraining.py represents the one used in our paper.
You MUST generate tfrecords yourself, before using this script. For intellectual reasons, we DO NOT provide scripts for data generation. You can use BERT original create_pretraining_data.py implementation (possibly with this tutorial) and adjust them to our pre-training task, these includes:

Perform whole word masking, N-gram masking.
Generate linguistic features for the masked tokens using LTP (or other something similar tool). Once again, please note that if you would like to reuse the pre-trained linguistic head weights, please generate linguistic tags within the one provided above (note #1).
Remove the next sentence prediction (NSP) task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

src

src

README.md

Read This Before Using

Files

src

Directory actions

More options

Directory actions

More options

Latest commit

History

src

Folders and files

parent directory

README.md

Read This Before Using