The Llama household of fashions are massive language fashions launched by Meta (previously Fb). These decoder-only transformer fashions are used for technology duties. Nearly all decoder-only fashions these days use the Byte-Pair Encoding (BPE) algorithm for tokenization. On this article, you’ll study BPE. Specifically, you’ll be taught:
- What BPE is in comparison with different tokenization algorithms
- The best way to put together a dataset and practice a BPE tokenizer
- The best way to use the tokenizer
Coaching a Tokenizer for Llama Mannequin
Picture by Joss Woodhead. Some rights reserved.
Let’s get began.
Overview
This text is split into 4 elements; they’re:
- Understanding BPE
- Coaching a BPE tokenizer with Hugging Face tokenizers library
- Coaching a BPE tokenizer with SentencePiece library
- Coaching a BPE tokenizer with tiktoken library
Understanding BPE
Byte-Pair Encoding (BPE) is a tokenization algorithm used to tokenize textual content into sub-word models. As an alternative of splitting textual content into solely phrases and punctuation, BPE can additional break up the prefixes and suffixes of phrases in order that prefixes, stems, and suffixes can every be related to which means within the language mannequin. With out sub-word tokenization, a language mannequin would discover it tough to be taught that “joyful” and “sad” are antonyms of one another.
BPE just isn’t the one sub-word tokenization algorithm. WordPiece, which is the default for BERT, is one other one. A well-implemented BPE doesn’t want “unknown” within the vocabulary, and nothing is OOV (Out of Vocabulary) in BPE. It’s because BPE can begin with 256 byte values (therefore generally known as byte-level BPE) after which merge probably the most frequent pairs of tokens into a brand new vocabulary till the specified vocabulary measurement is reached.
These days, BPE is the tokenization algorithm of alternative for many decoder-only fashions. Nonetheless, you do not need to implement your personal BPE tokenizer from scratch. As an alternative, you should use tokenizer libraries resembling Hugging Face’s tokenizers, OpenAI’s tiktoken, or Google’s sentencepiece.
Coaching a BPE tokenizer with Hugging Face tokenizers Library
To coach a BPE tokenizer, it is advisable to put together a dataset so the tokenizer algorithm can decide probably the most frequent pair of tokens to merge. For decoder-only fashions, a subset of the mannequin’s coaching knowledge is often acceptable.
Coaching a tokenizer is time-consuming, particularly for big datasets. Nonetheless, not like a language mannequin, a tokenizer doesn’t have to be taught the language context of the textual content, solely how typically tokens seem in a typical textual content corpus. When you might have trillions of tokens to coach language mannequin, you solely want a couple of million tokens to coach tokenizer.
As talked about in a earlier article, there are a number of well-known textual content datasets for language mannequin coaching. For a toy challenge, it’s your decision a smaller dataset for sooner experimentation. The HuggingFaceFW/fineweb dataset is an efficient alternative for this objective. In its full measurement, it’s a 15 trillion token dataset, however it additionally has 10B, 100B, and 350B sizes for smaller initiatives. The dataset is derived from Frequent Crawl and filtered by Hugging Face to enhance knowledge high quality.
Beneath is how one can print a couple of samples from the dataset:
|
import datasets
dataset = datasets.load_dataset(“HuggingFaceFW/fineweb”, identify=“sample-10BT”, break up=“practice”, streaming=True) depend = 0 for pattern in dataset: print(pattern) depend += 1 if depend >= 5: break |
Operating this code will print the next:
|
Lil {‘textual content’: ‘*sigh* Fundamentalist neighborhood, let me move on some recommendation to you I learne…’, ‘id’: ‘ ‘url’: ‘http://endogenousretrovirus.blogspot.com/2007/11/if-you-have-set-yourself-on…’, ‘date’: ‘2013-05-18T06:43:03Z’, ‘file_path’: ‘s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/struggle…’, ‘language’: ‘en’, ‘language_score’: 0.9737711548805237, ‘token_count’: 703} … |
For coaching a tokenizer (and even a language mannequin), you solely want the textual content area of every pattern.
To coach a BPE tokenizer utilizing the tokenizers library, you merely feed the textual content samples to the coach. Beneath is the entire code:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
from typing import Iterator
import datasets from tokenizers import Tokenizer, fashions, trainers, pre_tokenizers, decoders, normalizers
# Load FineWeb 10B pattern (utilizing solely a slice for demo to save lots of reminiscence) dataset = datasets.load_dataset(“HuggingFaceFW/fineweb”, identify=“sample-10BT”, break up=“practice”, streaming=True)
def get_texts(dataset: datasets.Dataset, restrict: int = 100_000) -> Iterator[str]: “”“Get texts from the dataset till the restrict is reached or the dataset is exhausted”“” depend = 0 for pattern in dataset: yield pattern[“text”] depend += 1 if restrict and depend >= restrict: break
# Initialize a BPE mannequin: both byte_fallback=True or set unk_token=”[UNK]” tokenizer = Tokenizer(fashions.BPE(byte_fallback=True)) tokenizer.normalizer = normalizers.NFKC() tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True, use_regex=False) tokenizer.decoder = decoders.ByteLevel()
# Coach coach = trainers.BpeTrainer( vocab_size=25_000, min_frequency=2, special_tokens=[“[PAD]”, “[CLS]”, “[SEP]”, “[MASK]”], show_progress=True, )
# Prepare and save the tokenizer to disk texts = get_texts(dataset, restrict=10_000) tokenizer.train_from_iterator(texts, coach=coach) tokenizer.save(“bpe_tokenizer.json”)
# Reload the tokenizer from disk tokenizer = Tokenizer.from_file(“bpe_tokenizer.json”)
# Take a look at: encode/decode textual content = “Let’s have a pizza celebration! 🍕” enc = tokenizer.encode(textual content) print(“Token IDs:”, enc.ids) print(“Decoded:”, tokenizer.decode(enc.ids)) |
Whenever you run this code, you will notice:
|
Resolving knowledge recordsdata: 100%|███████████████████████| 27468/27468 [00:03<00:00, 7792.97it/s] [00:00:01] Pre-processing sequences ████████████████████████████ 0 / 0 [00:00:02] Tokenize phrases ████████████████████████████ 10000 / 10000 [00:00:00] Rely pairs ████████████████████████████ 10000 / 10000 [00:00:38] Compute merges ████████████████████████████ 24799 / 24799 Token IDs: [3548, 277, 396, 1694, 14414, 227, 12060, 715, 9814, 180, 188] Decoded: Let’s have a pizza celebration! 🍕 |
To keep away from loading the complete dataset directly, use the streaming=True argument within the load_dataset() perform. The tokenizers library expects solely textual content for coaching BPE, so the get_texts() perform yields textual content samples one after the other. The loop terminates when the restrict is reached for the reason that complete dataset just isn’t wanted to coach a tokenizer.
To create byte-level BPE, set the byte_fallback=True argument within the BPE mannequin and configure the ByteLevel pre-tokenizer and decoder. Including a NFKC normalizer can be really helpful to scrub Unicode textual content for higher tokenization.
For a decoder-only mannequin, additionally, you will want particular tokens resembling , , and . The token alerts the tip of a textual content sequence, permitting the mannequin to declare when sequence technology is full.
As soon as the tokenizer is educated, reserve it to a file for later use. To make use of a tokenizer, name the encode() technique to transform textual content right into a sequence of token IDs, or the decode() technique to transform token IDs again to textual content.
Observe that the code above units a small vocabulary measurement of 25,000 and limits the coaching dataset to 10,000 samples for demonstration functions, enabling coaching to finish in an inexpensive time. In follow, use a bigger vocabulary measurement and coaching dataset so the language mannequin can seize the variety of the language. As a reference, the vocabulary measurement of the Llama 2 is 32,000 and that of Llama 3 is 128,256.
Coaching a BPE tokenizer with SentencePiece library
As an alternative choice to Hugging Face’s tokenizers library, you should use Google’s sentencepiece library. The library is written in C++ and is quick, although its API and documentation are much less refined than these of the tokenizers library.
The earlier code rewritten utilizing the sentencepiece library is as follows:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
from typing import Iterator
import datasets import sentencepiece as spm
# Load FineWeb 10B pattern (utilizing solely a slice for demo to save lots of reminiscence) dataset = datasets.load_dataset(“HuggingFaceFW/fineweb”, identify=“sample-10BT”, break up=“practice”, streaming=True)
def get_texts(dataset: datasets.Dataset, restrict: int = 100_000) -> Iterator[str]: “”“Get texts from the dataset till the restrict is reached or the dataset is exhausted”“” depend = 0 for pattern in dataset: yield pattern[“text”] depend += 1 if restrict and depend >= restrict: break
# Outline particular tokens as comma-separated string spm.SentencePieceTrainer.Prepare( sentence_iterator=get_texts(dataset, restrict=10_000), byte_fallback=True, model_prefix=“sp_bpe”, vocab_size=32_000, model_type=“bpe”, unk_id=0, bos_id=1, eos_id=2, pad_id=3, # set to -1 to disable character_coverage=1.0, input_sentence_size=10_000, shuffle_input_sentence=False, )
# Load the educated SentencePiece mannequin sp = spm.SentencePieceProcessor(model_file=“sp_bpe.mannequin”)
# Take a look at: encode/decode textual content = “Let’s have a pizza celebration! 🍕” ids = sp.encode(textual content, out_type=int, enable_sampling=False) # default: no particular tokens tokens = sp.encode(textual content, out_type=str, enable_sampling=False) print(“Tokens:”, tokens) print(“Token IDs:”, ids) decoded = sp.decode(ids) print(“Decoded:”, decoded) |
Whenever you run this code, you will notice:
|
… Tokens: [‘▁Let’, “‘”, ‘s’, ‘▁have’, ‘▁a’, ‘▁pizza’, ‘▁party’, ‘!’, ‘▁’, ‘<0xF0>’, ‘<0x9F>’, ‘<0x8D>’, ‘<0x95>’] Token IDs: [2703, 31093, 31053, 422, 261, 10404, 3064, 31115, 31046, 244, 163, 145, 153] Decoded: Let’s have a pizza celebration! 🍕 |
The coach in SentencePiece is extra verbose than the one in tokenizers, each in code and output. The secret’s to set byte_fallback=True within the SentencePieceTrainer; in any other case, the tokenizer might require an unknown token. The emoji within the check textual content serves as a nook case to confirm that the tokenizer can deal with unseen Unicode characters, which byte-level BPE ought to deal with gracefully.
Coaching a BPE tokenizer with tiktoken Library
The third library you should use for BPE tokenization is OpenAI’s tiktoken library. Whereas it’s simple to load pre-trained tokenizers, coaching with this library just isn’t really helpful.
The code within the earlier sections could be rewritten utilizing the tiktoken library as follows:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
import sys from typing import Iterator
import datasets import tiktoken from tiktoken._educational import SimpleBytePairEncoding
# Load FineWeb 10B pattern (utilizing solely a slice for demo to save lots of reminiscence) dataset = datasets.load_dataset(“HuggingFaceFW/fineweb”, identify=“sample-10BT”, break up=“practice”, streaming=True)
def get_texts(dataset: datasets.Dataset, restrict: int = 100_000) -> Iterator[str]: “”“Get texts from the dataset till the restrict is reached or the dataset is exhausted”“” depend = 0 for pattern in dataset: yield pattern[“text”] depend += 1 if depend >= restrict: break
# Gather texts as much as some manageable restrict for tokenizer coaching restrict = 1_000 texts = “n”.be a part of(get_texts(dataset, restrict=restrict))
# Prepare a easy BPE tokenizer pat_str=r“”“‘s|’t|’re|’ve|’m|’ll|’d| ?[p{L}]+| ?[p{N}]+| ?[^sp{L}p{N}]+|s+(?!S)|s+”“” enc_simple = SimpleBytePairEncoding.practice(training_data=texts, vocab_size=300, pat_str=pat_str)
# Convert to actual tiktoken encoding and save to disk enc = tiktoken.Encoding( identify=“my_bpe”, pat_str=enc_simple.pat_str, # similar regex used throughout coaching mergeable_ranks=enc_simple.mergeable_ranks, special_tokens={}, )
# check textual content = “Let’s have a pizza celebration! 🍕” tok_ids = enc.encode(textual content) print(“Token IDs:”, tok_ids) print(“Decoded:”, enc.decode(tok_ids)) |
Whenever you run this code, you will notice:
|
… Token IDs: [76, 101, 116, 39, 115, 293, 97, 118, 101, 257, 278, 105, 122, 122, 97, 278, 286, 116, 121, 33, 32, 240, 159, 141, 149] Decoded: Let’s have a pizza celebration! 🍕 |
The tiktoken library doesn’t have an optimized coach. The one accessible module is a Python implementation of the BPE algorithm through the SimpleBytePairEncoding class. To coach a tokenizer, it is advisable to outline how the enter textual content needs to be break up into phrases utilizing the pat_str argument, which defines a “phrase” utilizing a daily expression.
The coaching output is a dictionary referred to as mergeable ranks, which incorporates pairs of tokens that may be merged together with their merge priorities. To create a tokenizer, merely move the pat_str and mergeable_ranks arguments to the Encoding class.
Observe that the tokenizer in tiktoken doesn’t have a save perform. As an alternative, save the pat_str and mergeable_ranks arguments if wanted.
Since coaching is finished in pure Python, it is extremely sluggish. Coaching your personal tokenizer this fashion just isn’t really helpful.
Additional Readings
Beneath are some sources that you could be discover helpful:
Abstract
On this article, you realized about byte-level BPE and learn how to practice a BPE tokenizer. Particularly, you realized learn how to practice a BPE tokenizer with the tokenizers, sentencepiece, and tiktoken libraries. You additionally realized {that a} tokenizer can encode textual content into an inventory of integer token IDs and decode them again to textual content.


