当前位置:   article > 正文

论文复现|pointer-generator_ptbtokenizer tokenized 549237 tokens at 1510708.98

ptbtokenizer tokenized 549237 tokens at 1510708.98 tokens per second.

论文代码链接:https://github.com/becxer/pointer-generator/

一、数据(cnn,dailymail)
数据处理(代码链接):https://github.com/becxer/cnn-dailymail/

把数据集处理成二进制形式

1、下载数据
需翻墙,下载cnn和daily mail的两个stories文件

有的文件包含的例子中的文章缺失了,新代码中把这些去除了。

2、下载Stanford corenlp(现在最新版是3.8.0,但是笔者试了是不行的,必须要用3.7.0版的)

环境:linux

我们需要Stanford corenlp来把数据分词。
把下列这行代码加到你的.bashrc里面(vim .bashrc)

export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
  • 1

把/path/to/替换为你保存stanford-corenlp-full-2016-10-31的地方的路径

检测:
运行下列代码:

echo "Please tokenize this text." | java edu.stanford.nlp.process.PTBTokenizer
  • 1

你会看到下列输出:

Please
tokenize
this
text
.
PTBTokenizer tokenized 5 tokens at 68.97 tokens per second.
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

3、Process into .bin and vocab files

运行:

python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories
  • 1

把/path/to/cnn/stories替换为你保存cnn/stories文件的路径;dailymail同样

这个脚本做了以下几件事:1、将生成cnn_stories_tokenized和dm_stories_tokenized两个文件夹,里面的数据是已经被分词了的的cnn/stories和dailymail/stories。这可能需要花一些时间。你可能会看到一些来自Stanford Tokenizer “Untokenizable:”的警告,这似乎是跟Unicode character有关。2、对于每一个all_train.txt, all_val.txt and all_test.txt,相应的分词的数据,被小写进二进制文件train.bin, val.bin and test.bin中。同时放在新生产的finished_files文件夹里,这也需要花点时间。3、例外,从训练数据中会生成一个vocab文件,这个文件也被放在finished_files里。4、最后,train.bin, val.bin and test.bin将被分为数据块,每个数据块里有1000个例子。这些数据块文件会被保存在finished_fies/chunked里,例如train_000.bin, train_001.bin, …, train_287.bin。你可以使用单独的文件或者数据块作为模型的输入。(注意事项)

运行结果:

Untokenizable: ‪ (U+202A, decimal: 8234)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable:   (U+202F, decimal: 8239)
Untokenizable: ️ (U+FE0F, decimal: 65039)
Untokenizable: ‬ (U+202C, decimal: 8236)
Untokenizable: ‪ (U+202A, decimal: 8234)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: ‪ (U+202A, decimal: 8234)
Untokenizable:   (U+202F, decimal: 8239)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable:  (U+F06E, decimal: 61550)
Untokenizable: ‬ (U+202C, decimal: 8236)
Untokenizable: ₩ (U+20A9, decimal: 8361)
Untokenizable: ₩ (U+20A9, decimal: 8361)
PTBTokenizer tokenized 80044550 tokens at 864874.49 tokens per second.
Stanford CoreNLP Tokenizer has finished.
Successfully finished tokenizing ../cnn/stories to cnn_stories_tokenized.

Preparing to tokenize ../dailymail/stories to dm_stories_tokenized...(同上,省略过程)
......
PTBTokenizer tokenized 203071165 tokens at 916186.85 tokens per second.
Stanford CoreNLP Tokenizer has finished.
Successfully finished tokenizing ../dailymail/stories to dm_stories_tokenized.


Making bin file for URLs listed in url_lists/all_val.txt...
Writing story 0 of 13368; 0.00 percent done
Writing story 1000 of 13368; 7.48 percent done
Writing story 2000 of 13368; 14.96 percent done
Writing story 3000 of 13368; 22.44 percent done
Writing story 4000 of 13368; 29.92 percent done
Writing story 5000 of 13368; 37.40 percent done
Writing story 6000 of 13368; 44.88 percent done
Writing story 7000 of 13368; 52.36 percent done
Writing story 8000 of 13368; 59.84 percent done
Writing story 9000 of 13368; 67.32 percent done
Writing story 10000 of 13368; 74.81 percent done
Writing story 11000 of 13368; 82.29 percent done
Writing story 12000 of 13368; 89.77 percent done
Writing story 13000 of 13368; 97.25 percent done
Finished writing file finished_files/val.bin

Making bin file for URLs listed in url_lists/all_train.txt...(同前两个,省略过程)
......
Writing story 287000 of 287227; 99.92 percent done
Finished writing file finished_files/train.bin

Writing vocab file...
Finished writing vocab file
Splitting train data into chunks...
Splitting val data into chunks...
Splitting test data into chunks...
Saved chunked data in finished_files/chunked
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54

补充:

每篇文章用空格隔开每个句子(形式为:‘句子1 句子2 句子3…’) 每个句子里面也是分好词的(PTB)

本文内容由网友自发贡献,转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号