赞
踩
HF
的 Transformers
库的经典 API
以及大致架构我们都从前面已经学习的差不多了Tokenizer
和 Specific Model
的运作原理是什么?tokenization
,config
与 modeling
三个最重要的类LlamaConfig
:网络参数配置SpecificConfig
都是从 PretrainedConfig
继承过来的class LlamaConfig(PretrainedConfig):
vocab_size
:词汇表大小,默认 32000
hidden_size
:隐藏层维度(即每一层的神经元个数),默认 4096
num_hidden_layers
:隐藏层层数,默认 32
intermediate_size
:MLP
层的层数,默认 11008
num_attention_head
:每一个注意力层的注意力头的个数,默认 32
hidden_act
:非线性层的激活函数,默认为 silu
max_position_embeddings
:最大位置编码(也就是输入到的序列长度),用于位置编码initializer_range
:truncated_normal_initializer
初始化方法的 std dev
config
他们的参数可能不相同。Args: vocab_size (`int`, *optional*, defaults to 32000): Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling [`LlamaModel`] hidden_size (`int`, *optional*, defaults to 4096): Dimension of the hidden representations. intermediate_size (`int`, *optional*, defaults to 11008): Dimension of the MLP representations. num_hidden_layers (`int`, *optional*, defaults to 32): Number of hidden layers in the Transformer encoder. num_attention_heads (`int`, *optional*, defaults to 32): Number of attention heads for each attention layer in the Transformer encoder. hidden_act (`str` or `function`, *optional*, defaults to `"silu"`): The non-linear activation function (function or string) in the decoder. max_position_embeddings (`int`, *optional*, defaults to 2048): The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). initializer_range (`float`, *optional*, defaults to 0.02): The standard deviation of the truncated_normal_initializer for initializing all weight matrices. rms_norm_eps (`float`, *optional*, defaults to 1e-12): The epsilon used by the rms normalization layers. use_cache (`bool`, *optional*, defaults to `True`): Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if `config.is_decoder=True`. tie_word_embeddings(`bool`, *optional*, defaults to `False`): Whether to tie weight embeddings Example: ```python >>> from transformers import LlamaModel, LlamaConfig >>> # Initializing a LLaMA llama-7b style configuration >>> configuration = LlamaConfig() >>> # Initializing a model from the llama-7b style configuration >>> model = LlamaModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ```"""
model_type = "llama"
token_id
pad_token_id=0
bos_token_id=1
eos_token_id=2
LlamaTokenizer
:分词工具LlamaTokenizer
自然也是继承自 PretrainedTokenizer
input_ids
和注意力遮罩 attention_mask
VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"} PRETRAINED_VOCAB_FILES_MAP = { "vocab_file": { "hf-internal-testing/llama-tokenizer": "https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model", }, "tokenizer_file": { "hf-internal-testing/llama-tokenizer": "https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer_config.json", }, } PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { "hf-internal-testing/llama-tokenizer": 2048, } class LlamaTokenizer(PreTrainedTokenizer): """ Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. Args: vocab_file (`str`): Path to the vocabulary file. """ vocab_files_names = VOCAB_FILES_NAMES pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES model_input_names = ["input_ids", "attention_mask"]
init
方法中加载一些参数,并且添加了 bos, eos, unk, pad
tokenssentencepiece .SentencePieceProcessor
;并加载了对应的词汇表文件bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(vocab_file)
get, set, convert
等方法vocab_size
,可以 get_vocab
获得所有词汇convert_tokens_to_string(tokens)
可以把输入的 tokens 转成对应的字符串文本_tokenize(text)
可以把输入的文本进行分词_convert_token_to_id
可以把一个 token(str) 转成一个 id_convert_id_to_token
可以把一个 id 转成一个 token(str)@property
def vocab_size(self):
"""Returns vocab size"""
return self.sp_model.get_piece_size()
def get_vocab(self):
"""Returns vocab as a dict"""
vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
vocab.update(self.added_tokens_encoder)
return vocab
(str)
,进行分词成 tokens (List[str])
tokens
进行转成对应的 ids (List[int])
ids (List[int])
还原成原来的 tokens (List[str])
tokens (List[str])
还原成原来的字符串 str
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。