Out of Vocabulary

我们在NLP任务中一般都会有一个词表，这个词表一般可以使用一些大牛论文中的词表或者一些大公司的词表，或者是从自己的数据集中提取的词。但是无论当后续的训练还是预测，总有可能会出现并不包含在词表中的词，这种情况叫做Out of Vocabulary。

那么当我们遇到OOV的问题时，有以下解决方式。

Ignore

直接忽略OOV的情形，也就是不做处理，效果肯定不好。

`UNK`

这种方式就是把所有超出词表的词，转换为统一的\<unk>标识，对于性能的提升有限，特别是当超出词表的词是比较重要的词时。

Enlarge vocabulary

扩充词表就是把所有的词都尽可能的添加到词表中（包括非常低频率的词），这样就会带来问题：

计算量大，速度变慢，因为每次要预测的数量会变大；
低频率词的训练样本会很少，难以充分训练，最终结果就是训练不到一个理想的效果。

Individual Character

拆分成字符（如英文字母）训练，缺点就是丢失了语义和语法。

Spell Check

拼写检查，避免因为数据错误而导致的OOV。

`subword`

子词（类似于词根的思想），此方法介于单词和字符之间，借助n-gram实现。

缺点：会生成非常多的子词，为了解决这一问题，利用hash将词哈希到数字以减少内存的占用。

`BPE`（Byte Pair Encoding）

BPE的方法其实是对subword方法的一种优化，就是减少子词。在GPT和RoBERTa中就是使用的BPE的方法。

代码实现

自己动手实现（简化版）

import re
from typing import Dict, Tuple, List, Set

# 这里的词表需要根据自己的文本，把词的频率统计出来，并按字符分割
vocab = {
    'l o w </w>': 5,
    'l o w e r </w>': 2,
    'n e w e s t </w>': 6,
    'w i d e s t </w>': 3,
    'h a p p i e r </w>': 2
}


# 获取所有的组合和频率
def get_pair_status(vocab: Dict[str, int]) -> Dict[Tuple[str, str], int]:
    pairs = {}
    for word, frequency in vocab.items():
        symbols = word.split()

        for i in range(len(symbols) - 1):
            pair = (symbols[i], symbols[i+1])
            current_frequency = pairs.get(pair, 0)
            pairs[pair] = current_frequency+frequency
    return pairs


pair_status = get_pair_status(vocab)
print(pair_status)

{('l', 'o'): 7, ('o', 'w'): 7, ('w', '</w>'): 5, ('w', 'e'): 8, ('e', 'r'): 4, ('r', '</w>'): 4, ('n', 'e'): 6, ('e', 'w'): 6, ('e', 's'): 9, ('s', 't'): 9, ('t', '</w>'): 9, ('w', 'i'): 3, ('i', 'd'): 3, ('d', 'e'): 3, ('h', 'a'): 2, ('a', 'p'): 2, ('p', 'p'): 2, ('p', 'i'): 2, ('i', 'e'): 2}

# 根据已经筛选到的最大频率组合，合并词表
def merge_vocab(pair: Tuple[str, str], vocab_in: Dict[str, int]) -> Dict[str, int]:
    vocab_out = {}
    pattern = re.escape(' '.join(pair))
    replacement = ''.join(pair)

    for word_in in vocab_in:
        word_out = re.sub(pattern, replacement, word_in)
        vocab_out[word_out] = vocab_in[word_in]

    return vocab_out


# 根据频率值，获取最大字符对
best_pair = max(pair_status, key=pair_status.get)
print(best_pair)

new_vocab = merge_vocab(best_pair, vocab)
print(new_vocab)


('e', 's')
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3, 'h a p p i e r </w>': 2}

上述过程只是一轮迭代的结果，在BPE算法中，最终要生成预先设定好的词表大小的字符对，那么也就是需要不断循环来实现。

# 完整迭代
bpe_codes = {}
nums_merge = 10
for i in range(nums_merge):
    print('\niteration', i)
    pair_status = get_pair_status(vocab)
    if not pair_status:
        break

    best_pair = max(pair_status, key=pair_status.get)
    bpe_codes[best_pair] = i

    print('vocabulary: ', vocab)
    print('best pair: ', best_pair)
    vocab = merge_vocab(best_pair, vocab)

print('\n final vocabulary: ', vocab)
print('\n byte pair encoding: ', bpe_codes)


iteration 0
vocabulary:  {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3, 'h a p p i e r </w>': 2}
best pair:  ('e', 's')
iteration 1
vocabulary:  {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3, 'h a p p i e r </w>': 2}
best pair:  ('es', 't')
iteration 2
vocabulary:  {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3, 'h a p p i e r </w>': 2}
best pair:  ('est', '</w>')
iteration 3
vocabulary:  {'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3, 'h a p p i e r </w>': 2}
best pair:  ('l', 'o')
iteration 4
vocabulary:  {'lo w </w>': 5, 'lo w e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3, 'h a p p i e r </w>': 2}
best pair:  ('lo', 'w')
iteration 5
vocabulary:  {'low </w>': 5, 'low e r </w>': 2, 'n e w est</w>': 6, 'w i d est</w>': 3, 'h a p p i e r </w>': 2}
best pair:  ('n', 'e')
iteration 6
vocabulary:  {'low </w>': 5, 'low e r </w>': 2, 'ne w est</w>': 6, 'w i d est</w>': 3, 'h a p p i e r </w>': 2}
best pair:  ('ne', 'w')
iteration 7
vocabulary:  {'low </w>': 5, 'low e r </w>': 2, 'new est</w>': 6, 'w i d est</w>': 3, 'h a p p i e r </w>': 2}
best pair:  ('new', 'est</w>')
iteration 8
vocabulary:  {'low </w>': 5, 'low e r </w>': 2, 'newest</w>': 6, 'w i d est</w>': 3, 'h a p p i e r </w>': 2}
best pair:  ('low', '</w>')
iteration 9
vocabulary:  {'low</w>': 5, 'low e r </w>': 2, 'newest</w>': 6, 'w i d est</w>': 3, 'h a p p i e r </w>': 2}
best pair:  ('e', 'r')
    
 final vocabulary:  {'low</w>': 5, 'low er </w>': 2, 'newest</w>': 6, 'w i d est</w>': 3, 'h a p p i er </w>': 2}
 byte pair encoding:  {('e', 's'): 0, ('es', 't'): 1, ('est', '</w>'): 2, ('l', 'o'): 3, ('lo', 'w'): 4, ('n', 'e'): 5, ('ne', 'w'): 6, ('new', 'est</w>'): 7, ('low', '</w>'): 8, ('e', 'r'): 9}

这里我们自行实现（思考：怎么判断生成的词数量达到预先设定的词表大小了？），锻炼动手能力，而实际上已经有相关的包封装了BPE的实现过程，主要有下面两种。

Google的`sentencepiece`

参考github项目：https://github.com/google/sentencepiece

安装：pip install sentencepiece

input_file = 'botchan.txt'  # 在上面的github项目中
max_num_words = 10000
model_type = 'bpe'
model_prefix = 'sentencepiece'
pad_id = 0
unk_id = 1
bos_id = 2
eos_id = 3

sentencepiece_params = ' '.join([
    '--input={}'.format(input_file),
    '--model_type={}'.format(model_type),
    '--model_prefix={}'.format(model_prefix),
    '--vocab_size={}'.format(max_num_words),
    '--pad_id={}'.format(pad_id),
    '--unk_id={}'.format(unk_id),
    '--bos_id={}'.format(bos_id),
    '--eos_id={}'.format(eos_id)
])

print(sentencepiece_params)
SentencePieceTrainer.train(sentencepiece_params)

这里方法名叫做train，实际上也就是计算，适合大规模的数据处理，速度比较快。最后会生成并保存两个文件分别为：sentencepiece.model和sentencepiece.vocab。

通过以下方式可以加载上面训练得到的词表

sp = SentencePieceProcessor()
sp.load("{}.model".format(model_prefix))
print('Found {} unique tokens.'.format(sp.get_piece_size()))

Found 10000 unique tokens.

original = 'This is a test'
encoded_piece = sp.encode_as_pieces(original)
print(encoded_piece)

encoded_ids = sp.encode_as_ids(original)
print(encoded_ids)

['▁This', '▁is', '▁a', '▁t', 'est']
[475, 98, 6, 4, 264]

decoded_pieces = sp.decode_pieces(encoded_piece)
print(decoded_pieces)

decoded_ids = sp.decode_ids(encoded_ids)
print(decoded_ids)

This is a test
This is a test

piece_id = sp.piece_to_id('▁This')
print(piece_id)

print(sp.id_to_piece(piece_id))

475
▁This

`huggingface transformer的 tokenizers`

安装：pip install tokenizers

tokenizer = Tokenizer(BPE())

tokenizer.normalizer = Sequence([NFKC(), Lowercase()])
tokenizer.pre_tokenizer = ByteLevel()

tokenizer.decoder = ByteLevelDecoder()
trainer = BpeTrainer(vocab_size=10000, show_progress=True, initial_alphabet=ByteLevel.alphabet())
tokenizer.train(files=["botchan.txt"], trainer=trainer)
print("Trained vocab size: {}".format(tokenizer.get_vocab_size()))
tokenizer.model.save(".")

tokenizer.model = BPE('vocab.json', 'merges.txt')
encoding = tokenizer.encode("This is a simple input to be tokenized")
print("Encoded string: {}".format(encoding.tokens))

decoded = tokenizer.decode(encoding.ids)
print("Decoded string: {}".format(decoded))

Trained vocab size: 9579
Encoded string: ['Ġthis', 'Ġis', 'Ġa', 'Ġsimple', 'Ġin', 'p', 'ut', 'Ġto', 'Ġbe', 'Ġto', 'ken', 'ized']
Decoded string:  this is a simple input to be tokenized
import sys; print('Python %s on %s' % (sys.version, sys.platform))

`WordPiece`

wordpiece与BPE算法本质一样，都是基于subword的优化算法，主要区别在于如何选择子词进行合并，下面的unigram language model也是这样。

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers import normalizers
from tokenizers.normalizers import Lowercase, NFD, StripAccents
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing
from tokenizers.trainers import WordPieceTrainer


bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])
bert_tokenizer.pre_tokenizer = Whitespace()

bert_tokenizer.post_process = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2)
    ],
)

trainer = WordPieceTrainer(
    vocab_size=10000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
files = ["botchan.txt"]
bert_tokenizer.train(files, trainer)

encoding = bert_tokenizer.encode("This is a simple input to be tokenized")
print("Encoded string: {}".format(encoding.tokens))
decoded = bert_tokenizer.decode(encoding.ids)
print("Decoded string: {}".format(decoded))

Encoded string: ['this', 'is', 'a', 'simple', 'in', '##p', '##ut', 'to', 'be', 'to', '##ke', '##n', '##ized']
Decoded string: this is a simple in ##p ##ut to be to ##ke ##n ##ized

`Unigram Language Model`

同样可以借助Google的SentencePiece包来实现，只需要把model_type改为unigram即可。