#

tokenizers

Here are 21 public repositories matching this topic...

xebia-functional / xef

Building applications with LLMs through composability, in Kotlin, Scala, ...

kotlin scala ai functional-programming embeddings artificial-intelligence openai multiplatform agents tokenizers llm chatgpt-api

Updated Jun 4, 2024
Kotlin

jshuadvd / LongRoPE

Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper

nlp machine-learning natural-language-processing ai deep-learning transformers artificial-intelligence gpt language-model natural-language-inference natural tokenization natural-language-understanding attention-is-all-you-need attention-mechanisms transformer-architecture natural-language-procressing tokenizers llm

Updated Jun 1, 2024
Python

sayakpaul / count-tokens-hf-datasets

This project shows how to derive the total number of training tokens from a large text dataset from 🤗 datasets with Apache Beam and Dataflow.

transformers dataflow apache-beam tokenizers hf-datasets unigram-tokenization

Updated Oct 20, 2022
Python

Prismadic / magnet

the small distributed language model toolkit; fine-tune state-of-the-art LLMs anywhere, rapidly

Updated Mar 29, 2024
Python

megagonlabs / ginza-transformers

Use custom tokenizers in spacy-transformers

nlp natural-language-processing transformers spacy ginza spacy-transformers tokenizers sudachitra

Updated Aug 9, 2022
Python

unfoldingWord / string-punctuation-tokenizer

Small library that provides functions to tokenize a string into an array of words with or without punctuation

javascript nlp segmentation nlp-library tokenizers scripture-open-components

Updated Aug 9, 2023
JavaScript

Hugging-Face-Supporter / tftokenizers

Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels

nlp natural-language-processing tensorflow tokenizer transformers bert tensorflow-hub tokenizers sentencepie

Updated Mar 29, 2022
Python

arturom / search-analysis

A graphical user interface for the Elasticsearch Analyze API

react elasticsearch text-analysis filters analyzers tokenizers analyze-api

Updated Nov 1, 2023
JavaScript

Beomi / megatronlm_dataset_autotokenizer

Megatron-LM/GPT-NeoX compatible Text Encoder with 🤗Transformers AutoTokenizer.

transformers gpt-neox tokenizers megatron-lm

Updated Nov 16, 2023
Python

Anush008 / tokenizers

Multi-arch bindings for @huggingface/tokenizers.

huggingface tokenizers

Updated Sep 17, 2023
Rust

mickymultani / LLM-Architecture

Visualize some important concepts related to LLM architectures.

transformers attention-mechanism huggingface huggingface-transformers tokenizers llm llm-inference llm-architecture

Updated Oct 16, 2023
Jupyter Notebook

kojix2 / blingfire-crystal

crystal tokenizers

Updated Apr 3, 2023
Crystal

jungsoh / transformers-question-answering

Fine tuning pre-trained transformer models in TensorFlow and in PyTorch for question answering

tensorflow pytorch question-answering babi-dataset pytorch-api distilbert-model huggingface-transformers gradient-tape tokenizers

Updated Feb 5, 2022
Jupyter Notebook

adkwn1 / question-answer-app

Question and Answer web applicaiton using fine-tuned and pre-trained T5 models. Application runs on Streamlit.

python transformers text-generation question-answering summarization t5 streamlit tokenizers

Updated Nov 7, 2023
Jupyter Notebook

symanto-research / merge-tokenizers

Package to align tokens from different tokenizations.

distance transformers tokens tokenizers

Updated Mar 25, 2024
Python

victoryosiobe / kingchop

Kingchop ⚔️ is a JavaScript English based library for tokenizing text (chopping text). It uses vast rules for tokenizing, and you can adjust them easily.

nodejs javascript natural-language-processing text-processing sentence-tokenizer text-tokenization word-tokenizer tokenizers paragraph-tokenizer

Updated Jan 22, 2024
JavaScript

Matesxs / CodeTransformer

github model tokenizer transformers python3 pytorch gpt tensorflow2 gpt2 tokenizers

Updated Jul 31, 2021
Python

s2458588 / wsm-tokenizer

Bachelor Thesis Repository. Wsm-tokenizer (word shape mapping) uses vocabulary comparisons to find probable morphemes in lexemic tokens.

nlp machine-learning lexemes tokenizers

Updated Feb 5, 2023
Jupyter Notebook

OmkarBorhade98 / Text_Summarization

Text Summarization using NLP

nlp transformers tokenizers

Updated Jan 20, 2024
Jupyter Notebook

DanielPFlorian / Transformers-Github-Semantic-Search

NLP Dataset Creation and Semantic Search Demonstration

nlp natural-language-processing transformers semantic-search text-embedding huggingface tokenizers

Updated Feb 27, 2024
Jupyter Notebook

Improve this page

Add a description, image, and links to the tokenizers topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the tokenizers topic, visit your repo's landing page and select "manage topics."