赞
踩
Coursera课堂笔记Natural Language Processing in TensorFlow
pading的效果是补0使所有句子长度一致,最后组成矩阵
例1.
- from tensorflow.keras.preprocessing.text import Tokenizer
- from tensorflow.keras.preprocessing.sequence import pad_sequences
-
- sentences = [
- 'i love my dog',
- 'I love my cat',
- 'You love my dog!',
- 'Do you think my dog is amazing?'
- ]
-
- tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
- tokenizer.fit_on_texts(sentences)
- word_index = tokenizer.word_index
- sequences = tokenizer.texts_to_sequences(sentences)
-
- padded = pad_sequences(sequences)
- print(word_index)
- print(sequences)
- print(padded)
输出:
- {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
- [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
- [[ 0 0 0 5 3 2 4]
- [ 0 0 0 5 3 2 7]
- [ 0 0 0 6 3 2 4]
- [ 8 6 9 2 4 10 11]]
可以看出,对较短的句子,是在前面补0。如果你想在后面补0,可以指参数padding='post'
例2.
- from tensorflow.keras.preprocessing.text import Tokenizer
- from tensorflow.keras.preprocessing.sequence import pad_sequences
-
- sentences = [
- 'i love my dog',
- 'I love my cat',
- 'You love my dog!',
- 'Do you think my dog is amazing?'
- ]
-
- tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
- tokenizer.fit_on_texts(sentences)
- word_index = tokenizer.word_index
- sequences = tokenizer.texts_to_sequences(sentences)
-
- #padded = pad_sequences(sequences)
- padded = pad_sequences(sequences, padding='post')
- #padded = pad_sequences(sequences, padding='post', maxlen=5)
- print(word_index)
- print(sequences)
- print(padded)
输出:
- {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
- [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
- [[ 5 3 2 4 0 0 0]
- [ 5 3 2 7 0 0 0]
- [ 6 3 2 4 0 0 0]
- [ 8 6 9 2 4 10 11]]
你还可以用参数maxLen=5来指定最大长度
例3:
- from tensorflow.keras.preprocessing.text import Tokenizer
- from tensorflow.keras.preprocessing.sequence import pad_sequences
-
- sentences = [
- 'i love my dog',
- 'I love my cat',
- 'You love my dog!',
- 'Do you think my dog is amazing?'
- ]
-
- tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
- tokenizer.fit_on_texts(sentences)
- word_index = tokenizer.word_index
- sequences = tokenizer.texts_to_sequences(sentences)
-
- #padded = pad_sequences(sequences)
- #padded = pad_sequences(sequences, padding='post')
- padded = pad_sequences(sequences, padding='post', maxlen=5)
- print(word_index)
- print(sequences)
- print(padded)
输出:
- {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
- [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
- [[ 5 3 2 4 0]
- [ 5 3 2 7 0]
- [ 6 3 2 4 0]
- [ 9 2 4 10 11]]
可看出,对于长度大于maxLen的句子,是从前面开始截断的。同理,如果你想从后面截断,可用参数truncating='post'来指定
例4:
- from tensorflow.keras.preprocessing.text import Tokenizer
- from tensorflow.keras.preprocessing.sequence import pad_sequences
-
- sentences = [
- 'i love my dog',
- 'I love my cat',
- 'You love my dog!',
- 'Do you think my dog is amazing?'
- ]
-
- tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
- tokenizer.fit_on_texts(sentences)
- word_index = tokenizer.word_index
- sequences = tokenizer.texts_to_sequences(sentences)
-
- #padded = pad_sequences(sequences)
- #padded = pad_sequences(sequences, padding='post')
- #padded = pad_sequences(sequences, padding='post', maxlen=5)
- padded = pad_sequences(sequences, padding='post', truncating='post', maxlen=5)
- print(word_index)
- print(sequences)
- print(padded)
输出:
- {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
- [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
- [[5 3 2 4 0]
- [5 3 2 7 0]
- [6 3 2 4 0]
- [8 6 9 2 4]]
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。