EmuBert Creator

EmuBert is the largest open-source masked language model for Australian law. This repository preserves the code used to create EmuBert.

If you're looking to download EmuBert, you may do so on Hugging Face.

Setup 🛠️

The EmuBert Creator has only been tested on Python 3.11 but should work for later versions and may also work for earlier versions.

To set up the Creator, start by running the following commands:

git clone https://github.com/umarbutler/emubert-creator.git
cd emubert-creator
pip install -r requirements.txt

Next, download the version of the Open Australian Legal Corpus you'd like to train EmuBert on by navigating to its changelog, clicking on the version number you'd like to use, clicking on the file named corpus.jsonl and finally hitting 'download'. Any version of the Corpus that begins with the number 4 should be compatible with the Creator. The specific version of the Corpus used to produce EmuBert is 4.2.1 and can be downloaded here.

Finally, you can either place the Corpus in a directory named data in the root of the repository, define an environment variable named OALC that points to the Corpus or override the corpus_path variable in scripts/config.py.

Usage 👩‍💻

To train EmuBert, run the following scripts in the scripts directory in order:

preprocess.py, which cleans documents, splits them into training, validation and test sets, filters out short documents from the training set, deduplicates the training set, trains a tokeniser and finally save the resulting data.
block.py, which splits texts into block of the same size as EmuBert's context window and saves them.
train.py, which trains EmuBert and saves it to a directory named model (unless the model_dir variable in config.py is overridden). If training is interrupted at any point, set the script's RESUME variable to True.
convert.py, which converts EmuBert from a Better Transformer into a vanilla Transformer.
benchmark.py, which benchmarks EmuBert against other popular masked language models.

Licence 📜

The Creator is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
scripts		scripts
LICENCE		LICENCE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

LICENCE

LICENCE

README.md

README.md

Repository files navigation

EmuBert Creator

Setup 🛠️

Usage 👩‍💻

Licence 📜

About

Languages

License

umarbutler/emubert-creator

Folders and files

Latest commit

History

scripts

scripts

LICENCE

LICENCE

README.md

README.md

Repository files navigation

EmuBert Creator

Setup 🛠️

Usage 👩‍💻

Licence 📜

About

Topics

Resources

License

Stars

Watchers

Forks

Languages