likes
comments
collection
share

OpenAI 大模型高效Tokenizer: tictoken

作者站长头像
站长
· 阅读数 3

chatgpt 每一个模型的tokens计算方法都是一样的么?

本节解决的问题

  1. 什么时候需要 计算tokens
  2. get_token_ids方法的小问题
  3. MODEL_TO_ENCODING中 没有我这个模型的 encoding_name,我怎么办?
  4. 为什么加 with _lock 这个锁
  5. openai_public.py 中的属性都是什么意思
  6. 为什么tiktoken 对中文编码会变长

tiktoken

tiktoken是一个快速的BPE分词器,可用于OpenAI的模型。通过将文本转换成tokens序列,让模型可以处理它们,而BPE是一种将文本转换成tokens序列的方式。tiktoken可以将文本转换成tokens序列,同时还有以下特点:

  1. 可逆且无损,可以将tokens重新转换回原始文本。
  2. 可以处理任意文本,即使文本不在分词器的训练数据中。
  3. 压缩文本:tokens序列比原始文本对应的字节要短。平均而言,每个token相当于4个字节。
  4. 尝试让模型看到常见的子词。例如,在英语中,“ing”是一个常见的子词,因此BPE编码通常会将“encoding”拆分为“encod”和“ing”(而不是“enc”和“oding”)。因为模型将在不同的上下文中再次看到“ing”token,所以它有助于模型更好地理解语法。

此外,tiktoken还包含一个教育子模块,如果您想了解有关BPE的更多细节,包括帮助可视化BPE过程的代码,可以使用此子模块。

通过安装PyPI版本的tiktoken,您可以使用tiktoken的API。

tiktoken还具有可扩展性,您可以使用tiktoken_ext插件机制来注册自己的编码,并使用tiktoken.get_encoding找到您的编码。

例子代码


from langchain.llms import OpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

llm = OpenAI(temperature=0)
with open('data/scene.md', 'r',encoding="utf-8") as file:
    text = file.read()

# Printing the first 285 characters as a preview
print (text[:285])
num_tokens = llm.get_num_tokens(text)

print (f"There are {num_tokens} tokens in your file")

debug 步骤

encoding_for_model

def encoding_for_model(model_name: str) -> Encoding:
    """Returns the encoding used by a model."""
    encoding_name = None
    # tiktoken 中的model.py 
    if model_name in MODEL_TO_ENCODING:
        encoding_name = MODEL_TO_ENCODING[model_name]
    else:
        # Check if the model matches a known prefix
        # Prefix matching avoids needing library updates for every model version release
        # Note that this can match on non-existent models (e.g., gpt-3.5-turbo-FAKE)
        for model_prefix, model_encoding_name in MODEL_PREFIX_TO_ENCODING.items():
            if model_name.startswith(model_prefix):
                return get_encoding(model_encoding_name)

    if encoding_name is None:
        raise KeyError(
            f"Could not automatically map {model_name} to a tokeniser. "
            "Please use `tiktok.get_encoding` to explicitly get the tokeniser you expect."
        ) from None

    return get_encoding(encoding_name)

OpenAI 大模型高效Tokenizer: tictoken

怎么获取encoding_name

# 先model_name 精准匹配
if model_name in MODEL_TO_ENCODING:
    encoding_name = MODEL_TO_ENCODING[model_name]
else:
    # 如果精准匹配不到,就前缀匹配
    for model_prefix, model_encoding_name in MODEL_PREFIX_TO_ENCODING.items():
        if model_name.startswith(model_prefix):
            return get_encoding(model_encoding_name)

如果最后还匹配不到,报错

if encoding_name is None:
    raise KeyError(
        f"Could not automatically map {model_name} to a tokeniser. "
        "Please use `tiktok.get_encoding` to explicitly get the tokeniser you expect."
    ) from None

根据encoding_name 获取

def get_encoding(encoding_name: str) -> Encoding:
    if encoding_name in ENCODINGS:
        return ENCODINGS[encoding_name]

    with _lock:
        if encoding_name in ENCODINGS:
            return ENCODINGS[encoding_name]
        # 因为这一步不能并发走,所以需要加lock锁
        if ENCODING_CONSTRUCTORS is None:
            _find_constructors()
            assert ENCODING_CONSTRUCTORS is not None
        # 如果 encoding_name 不在 ENCODING_CONSTRUCTORS 中,直接报错
        if encoding_name not in ENCODING_CONSTRUCTORS:
            raise ValueError(f"Unknown encoding {encoding_name}")

        constructor = ENCODING_CONSTRUCTORS[encoding_name]
        enc = Encoding(**constructor())
        ENCODINGS[encoding_name] = enc
        return enc

问题

什么时候需要 计算tokens

因为tokens 涉及到chatgpt4 或者其他收费的gpt的衡量标准

get_token_ids方法的小问题

def get_token_ids(self, text: str) -> List[int]:
    """Get the token IDs using the tiktoken package."""
    # tiktoken NOT supported for Python < 3.8
    # 现在版本是3.8.16 取的是其中的8,如果以后python 升级到4.1.1,这个判断就会出现1 <8 ,判断不过的问题
    if sys.version_info[1] < 8:
        return super().get_num_tokens(text)
    xxx

现在版本是3.8.16 取的是其中的8,如果以后python 升级到4.1.1,这个判断就会出现1 <8 ,判断不过的问题,应该写成

if sys.version_info[0] <= 3 && sys.version_info[1] < 8:
        return super().get_num_tokens(text)

MODEL_TO_ENCODING中 没有我这个模型的 encoding_name,我怎么办?

为什么加 with _lock 这个锁

OpenAI 大模型高效Tokenizer: tictoken

这个时候 只有 'tiktoken_ext.openai_public' 这一个文件

openai_public.py

from tiktoken.load import data_gym_to_mergeable_bpe_ranks, load_tiktoken_bpe

ENDOFTEXT = "<|endoftext|>"
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
ENDOFPROMPT = "<|endofprompt|>"


def gpt2():
    mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
        vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
        encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
    )
    return {
        "name": "gpt2",
        "explicit_n_vocab": 50257,
        "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": {"<|endoftext|>": 50256},
    }


def r50k_base():
    mergeable_ranks = load_tiktoken_bpe(
        "https://openaipublic.blob.core.windows.net/encodings/r50k_base.tiktoken"
    )
    return {
        "name": "r50k_base",
        "explicit_n_vocab": 50257,
        "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": {ENDOFTEXT: 50256},
    }


def p50k_base():
    mergeable_ranks = load_tiktoken_bpe(
        "https://openaipublic.blob.core.windows.net/encodings/p50k_base.tiktoken"
    )
    return {
        "name": "p50k_base",
        "explicit_n_vocab": 50281,
        "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": {ENDOFTEXT: 50256},
    }


def p50k_edit():
    mergeable_ranks = load_tiktoken_bpe(
        "https://openaipublic.blob.core.windows.net/encodings/p50k_base.tiktoken"
    )
    special_tokens = {ENDOFTEXT: 50256, FIM_PREFIX: 50281, FIM_MIDDLE: 50282, FIM_SUFFIX: 50283}
    return {
        "name": "p50k_edit",
        "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": special_tokens,
    }


def cl100k_base():
    mergeable_ranks = load_tiktoken_bpe(
        "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
    )
    special_tokens = {
        ENDOFTEXT: 100257,
        FIM_PREFIX: 100258,
        FIM_MIDDLE: 100259,
        FIM_SUFFIX: 100260,
        ENDOFPROMPT: 100276,
    }
    return {
        "name": "cl100k_base",
        "pat_str": r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": special_tokens,
    }


ENCODING_CONSTRUCTORS = {
    "gpt2": gpt2,
    "r50k_base": r50k_base,
    "p50k_base": p50k_base,
    "p50k_edit": p50k_edit,
    "cl100k_base": cl100k_base,
}

OpenAI 大模型高效Tokenizer: tictoken

最后遍历这个map 维护成 ENCODING_CONSTRUCTORS 对象

OpenAI 大模型高效Tokenizer: tictoken

openai_public.py 中的属性都是什么意思

Args:
    name: The name of the encoding. It should be clear from the name of the encoding
        what behaviour to expect, in particular, encodings with different special tokens
        should have different names.
    pat_str: A regex pattern string that is used to split the input text.
    mergeable_ranks: A dictionary mapping mergeable token bytes to their ranks. The ranks
        must correspond to merge priority.
    special_tokens: A dictionary mapping special token strings to their token values.
    explicit_n_vocab: The number of tokens in the vocabulary. If provided, it is checked
        that the number of mergeable tokens and special tokens is equal to this number.
def p50k_base():
    mergeable_ranks = load_tiktoken_bpe(
       "https://openaipublic.blob.core.windows.net/encodings/p50k_base.tiktoken"
    )
    return {
        "name": "p50k_base",
        "explicit_n_vocab": 50281,
        "pat_str": r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        "mergeable_ranks": mergeable_ranks,
        "special_tokens": {ENDOFTEXT: 50256},
    }

为什么tiktoken 对中文编码会变长

为什么会这么长呢?我们根据OpenAI放出来的词表cl100k_base_vocab.json (可以在这里看到),可以查到每个数字对应的词表内容:

["山", "东", "b'\\xe6\\xb7'", "b'\\x84'", "b'\\xe5\\x8d'", "b'\\x9a'", "b'\\xe5\\x90'", "b'\\x83'", "b'\\xe7'", "b'\\x83'", "b'\\xa7'", "b'\\xe7'", "b'\\x83'", "b'\\xa4'"]

可以发现除了“山“、”东”这两个相对比较简单的汉字词表里面直接就有,其他的都是一些非常奇怪的Unicode编码表示。稍加观察和尝试我们发现,tokens[85315, 226] 对应的"b'\\xe6\\xb7'", "b'\\x84'" 拼接起来,然后按照utf-8解码回去 b'\xe6\xb7\x84'.decode('utf-8') 得到的就是“淄”。

看到这里熟悉BPE的读者应该恍然大悟了,原来是OpenAI为了支持多种语言的Tokenizer,采用了文本的一种通用表示:UTF-8的编码方式,这是一种针对Unicode的可变长度字符编码方式,它将一个Unicode字符编码为1到4个字节的序列。比如:

>>> '山东淄博吃烧烤'.encode('utf-8')
b'\xe5\xb1\xb1\xe4\xb8\x9c\xe6\xb7\x84\xe5\x8d\x9a\xe5\x90\x83\xe7\x83\xa7\xe7\x83\xa4'
>>> '淄博'.encode('utf-8')
b'\xe6\xb7\x84\xe5\x8d\x9a'
>>> '淄'.encode('utf-8')
b'\xe6\xb7\x84'
>>> '博'.encode('utf-8')
b'\xe5\x8d\x9a'

输出中,\x 表示16进制编码,可以发现'淄博'分别被编码为6个16进制数字,分别占3个字节。随后,GPT-4将每2个16进制数字,也就是1字节的数据作为最小颗粒度的token,然后进行BPE的迭代、合并词表。