当前位置:   article > 正文

毕业设计 nlp深度学习项目 - tensorflow_nlp毕设

nlp毕设


0 项目说明

基于tensorflow的nlp深度学习项目

提示:适合用于课程设计或毕业设计,工作量达标,源码开放


1 介绍

本项目支持的NLP任务包括 分类、匹配、序列标注、文本生成等.

  • 对于分类任务,目前支持多分类、多标签分类,通过选择不同的loss即可。
  • 对于匹配任务,目前已支持交互模型和表示模型。
  • 对于NER任务,目前已支持rnn+crf,idcnn+crf以及bert+crf

2 数据

训练数据(目前data下均内置了样例数据):
(1)对于分类任务的数据使用csv格式,csv头部包括列名‘target’和‘text’;
(2)对于匹配任务的数据使用csv格式,csv头部包括列名‘target’,‘text’ 或者‘target’,‘text_a’,‘text_b’
(3)对于NER任务的数据,参考”data/ner/train_data”,或者使用其它格式的数据的话,修改task/ner.py中的read_data方法即可。
预训练数据(目前在分类和匹配任务上已支持):

  • 如果使用到bert作为预训练(直接下载google训练好的模型即可),直接运行”sh scripts/prepare.sh”
  • 如果使用elmo作为预训练,需要准备一份corpus.txt训练语料放在language_model/bilm_tf/data/目录下
然后执行指令进行预训练: 
      cd language_model/bilm_tf
      sh start.sh
  • 1
  • 2
  • 3

3 快速开始

[依赖]

环境:python3+tensorflow 1.10(python2.7已支持)
pip3 install --user -r requirements.txt
  • 1
  • 2

各类任务的参数定义在conf/model/内的以任务名命名的yml文件中"conf/model/***.yml"
目前已支持的常见任务如下:

[分类]

 1.生成tfrecords数据,训练:
    python3 run.py classify.yml mode=train
   或者直接使用脚本:
    sh scripts/restart.sh classify.yml
 
 2.测试:
   单个测试:python3 run.py classify.yml model=test_one
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

[匹配]

 1.生成tfrecords数据,训练:
     python3 run.py match.yml mode=train
   或者直接使用脚本:
     sh scripts/restart.sh match.yml
 2.测试:
    单个测试:python3 run.py match.yml model=test_one
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

[序列标注]

...
sh scripts/restart.sh ner.yml
  • 1
  • 2

[翻译]

...
sh scripts/restart.sh translation.yml
  • 1
  • 2

4 模块

1. encoder
    cnn
    fasttext
    text_cnn
    dcnn
    idcnn
    dpcnn
    vdcnn
    rnn        
    rcnn
    attention_rnn
    capsule
    esim
    han
    matchpyramid
    abcnn
    transformer

2. common 
    loss
    attention
    lr
    ...

3. utils
    data process
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26

5 项目源码

#-*- coding:utf-8 -*-
import gensim
import sys,os
ROOT_PATH = '/'.join(os.path.abspath(__file__).split('/')[:-2])
sys.path.append(ROOT_PATH)
import numpy as np
from itertools import chain
import tensorflow as tf
from utils.preprocess import *
from embedding.embedding_base import Base
from common.layers import get_initializer
import collections
import pickle
import pandas as pd
import pdb


class WordEmbedding(Base):
    def __init__(self, text_list, dict_path, vocab_dict, random = False,\
                 maxlen = 20, embedding_size = 128, **kwargs):
        super(WordEmbedding, self).__init__(**kwargs)
        self.embedding_path = kwargs['conf']['word_embedding_path']
        self.vocab_dict = vocab_dict
        self.maxlen= maxlen
        self.dict_path = dict_path
        self.size = embedding_size
        self.trainable = kwargs['conf'].get('embedding_trainable', True)
        if random:
            self.embedding = tf.get_variable("embeddings",
                                         shape = [len(self.vocab_dict), self.size],
                                         initializer=get_initializer('xavier'),
                                         trainable = self.trainable)


        else:
            loaded_embedding = self._get_embedding(self.vocab_dict)
            self.embedding = tf.get_variable("embeddings",
                                     shape = [len(self.vocab_dict),self.size],
                                     initializer=get_initializer('xavier'),
                                     trainable = self.trainable)
            tf.assign(self.embedding, loaded_embedding)
        self.input_ids = {}

    def __call__(self, features = None, name = "word_embedding"):
        """define placeholder"""
        if features == None:
            self.input_ids[name] = tf.placeholder(dtype=tf.int32, shape=[None,
                                                                     self.maxlen], name = name)
        else:
            self.input_ids[name] = features[name]
        return tf.nn.embedding_lookup(self.embedding, self.input_ids[name])

    def feed_dict(self, input_x, name = 'word_embedding'):
        feed_dict = {}
        feed_dict[self.input_ids[name]] = input_x
        return feed_dict

    def pb_feed_dict(self, graph, input_x, name = 'word_embedding'):
        feed_dict = {}
        input_x_node = graph.get_operation_by_name(name).outputs[0]
        feed_dict[input_x_node] = input_x
        return feed_dict

    @staticmethod
    def build_dict(dict_path, text_list = None,  mode = "train"):
        if not os.path.exists(dict_path) or mode == "train":
            assert text_list != None, "text_list can't be None in train mode"
            words = list()
            for content in text_list:
                for word in word_tokenize(clean_str(content)):
                    words.append(word)

            word_counter = collections.Counter(words).most_common()
            vocab_dict = dict()
            vocab_dict["<pad>"] = 0
            vocab_dict["<unk>"] = 1
            for word, _ in word_counter:
                vocab_dict[word] = len(vocab_dict)

            with open(dict_path, "wb") as f:
                pickle.dump(vocab_dict, f)
        else:
            with open(dict_path, "rb") as f:
                vocab_dict = pickle.load(f)

        return vocab_dict

    @staticmethod
    def text2id(text_list, vocab_dict, maxlen, need_preprocess = True):
        """
        文本id化
        """
        if need_preprocess:
            pre = Preprocess()
            text_list = [pre.get_dl_input_by_text(text) for text in text_list]
        x = list(map(lambda d: word_tokenize(clean_str(d)), text_list))
        x_len = [min(len(text), maxlen) for text in x]
        x = list(map(lambda d: list(map(lambda w: vocab_dict.get(w, vocab_dict["<unk>"]), d)), x))
        x = list(map(lambda d: d[:maxlen], x))
        x = list(map(lambda d: d + (maxlen - len(d)) * [vocab_dict["<pad>"]], x))
        return text_list, x, x_len

    def _get_embedding(self, vocab_dict, add_embedding_word = True):
        """get embedding vector by dict and embedding_file"""
        model = self._load_embedding_file(self.embedding_path)
        embedding = []
        dict_rev = {vocab_dict[word]:word for word in vocab_dict}
        for idx in range(len(vocab_dict)):
            word = dict_rev[idx]
            if word in model:
                embedding.append(model[word])
            else:
                embedding.append(self._get_rand_embedding())
        if add_embedding_word:
            for key in model.vocab.keys():
                if key not in vocab_dict:
                    vocab_dict[key] = len(vocab_dict)
                    embedding.append(model[key])
            with open(self.dict_path, "wb") as f:
                pickle.dump(vocab_dict, f)
        return tf.convert_to_tensor(np.array(embedding), tf.float32)

    def _get_rand_embedding(self):
        """random embedding"""
        return np.random.randn(self.size)

    def _load_embedding_file(self, path):
        """
        模型格式有两种bin和model,使用方式:
        a. bin模式:model = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=True)
        b. model模式:model = gensim.models.Word2Vec.load(model_path)
        model from 
        """
        model = gensim.models.KeyedVectors.load_word2vec_format(path,
                                                                binary=False)
        assert model.vector_size == self.size, "the size of vector\
            from embedding file {} != defined embedding_size {}".format(
                model.vector_size, self.size)
        return model

if __name__ == '__main__':
    embedding = WordEmbedding()

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
  • 142
  • 143

6 最后

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/337123
推荐阅读
相关标签
  

闽ICP备14008679号