当前位置:   article > 正文

NLP-TF2.0-C3W1L6-Padding_nlp padding

nlp padding

Coursera课堂笔记Natural Language Processing in TensorFlow

pading的效果是补0使所有句子长度一致,最后组成矩阵

例1.

  1. from tensorflow.keras.preprocessing.text import Tokenizer
  2. from tensorflow.keras.preprocessing.sequence import pad_sequences
  3. sentences = [
  4. 'i love my dog',
  5. 'I love my cat',
  6. 'You love my dog!',
  7. 'Do you think my dog is amazing?'
  8. ]
  9. tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
  10. tokenizer.fit_on_texts(sentences)
  11. word_index = tokenizer.word_index
  12. sequences = tokenizer.texts_to_sequences(sentences)
  13. padded = pad_sequences(sequences)
  14. print(word_index)
  15. print(sequences)
  16. print(padded)

输出:

  1. {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
  2. [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
  3. [[ 0  0  0  5  3  2  4]
  4.  [ 0  0  0  5  3  2  7]
  5.  [ 0  0  0  6  3  2  4]
  6.  [ 8  6  9  2  4 10 11]]

可以看出,对较短的句子,是在前面补0。如果你想在后面补0,可以指参数padding='post'

例2.

  1. from tensorflow.keras.preprocessing.text import Tokenizer
  2. from tensorflow.keras.preprocessing.sequence import pad_sequences
  3. sentences = [
  4. 'i love my dog',
  5. 'I love my cat',
  6. 'You love my dog!',
  7. 'Do you think my dog is amazing?'
  8. ]
  9. tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
  10. tokenizer.fit_on_texts(sentences)
  11. word_index = tokenizer.word_index
  12. sequences = tokenizer.texts_to_sequences(sentences)
  13. #padded = pad_sequences(sequences)
  14. padded = pad_sequences(sequences, padding='post')
  15. #padded = pad_sequences(sequences, padding='post', maxlen=5)
  16. print(word_index)
  17. print(sequences)
  18. print(padded)

输出:

  1. {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
  2. [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
  3. [[ 5 3 2 4 0 0 0]
  4. [ 5 3 2 7 0 0 0]
  5. [ 6 3 2 4 0 0 0]
  6. [ 8 6 9 2 4 10 11]]

你还可以用参数maxLen=5来指定最大长度

例3:

  1. from tensorflow.keras.preprocessing.text import Tokenizer
  2. from tensorflow.keras.preprocessing.sequence import pad_sequences
  3. sentences = [
  4. 'i love my dog',
  5. 'I love my cat',
  6. 'You love my dog!',
  7. 'Do you think my dog is amazing?'
  8. ]
  9. tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
  10. tokenizer.fit_on_texts(sentences)
  11. word_index = tokenizer.word_index
  12. sequences = tokenizer.texts_to_sequences(sentences)
  13. #padded = pad_sequences(sequences)
  14. #padded = pad_sequences(sequences, padding='post')
  15. padded = pad_sequences(sequences, padding='post', maxlen=5)
  16. print(word_index)
  17. print(sequences)
  18. print(padded)

输出:

  1. {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
  2. [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
  3. [[ 5 3 2 4 0]
  4. [ 5 3 2 7 0]
  5. [ 6 3 2 4 0]
  6. [ 9 2 4 10 11]]

可看出,对于长度大于maxLen的句子,是从前面开始截断的。同理,如果你想从后面截断,可用参数truncating='post'来指定

例4:

  1. from tensorflow.keras.preprocessing.text import Tokenizer
  2. from tensorflow.keras.preprocessing.sequence import pad_sequences
  3. sentences = [
  4. 'i love my dog',
  5. 'I love my cat',
  6. 'You love my dog!',
  7. 'Do you think my dog is amazing?'
  8. ]
  9. tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
  10. tokenizer.fit_on_texts(sentences)
  11. word_index = tokenizer.word_index
  12. sequences = tokenizer.texts_to_sequences(sentences)
  13. #padded = pad_sequences(sequences)
  14. #padded = pad_sequences(sequences, padding='post')
  15. #padded = pad_sequences(sequences, padding='post', maxlen=5)
  16. padded = pad_sequences(sequences, padding='post', truncating='post', maxlen=5)
  17. print(word_index)
  18. print(sequences)
  19. print(padded)

输出:

  1. {'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
  2. [[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
  3. [[5 3 2 4 0]
  4. [5 3 2 7 0]
  5. [6 3 2 4 0]
  6. [8 6 9 2 4]]

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/750860
推荐阅读
相关标签
  

闽ICP备14008679号