赞
踩
导读:“达观杯”文本智能信息抽取挑战赛已吸引来自中、美、英、法、德等26个国家和地区的2400余名选手参赛,目前仍在火热进行中(点击阅读原文进入比赛页面,QQ群见上图或文末二维码)。达观数据目前已经举行过两次围绕比赛的技术直播分享,并开源了baseline模型。本文是这两次技术直播的内容总结,包括信息抽取传统算法和前沿算法详解、比赛介绍,以及比赛baseline模型代码分析和改进建议。
在前半部分,达观数据的联合创始人高翔详细讲解了自然语言处理中信息抽取算法技术。在后半部分,达观数据的工程师们分享并介绍了“达观杯”文本信息抽取挑战赛的baseline代码以及改进建议。最后,针对参赛选手和其他观众的疑问,三位专家也一一做了解答。
高翔是达观数据联合创始人,达观数据前端产品组、文本挖掘组总负责人;自然语言处理技术专家,负责文本阅读类产品、搜索引擎、文本挖掘及大数据调度系统的开发工作,在自然语言处理和机器学习等技术方向有着丰富的理论与工程经验。
第一部分:文本信息抽取详解
第二部分:“达观杯”baseline代码分享
第三部分:问题答疑
下面我们看一下baseline代码。首先需要引入相关的库:
- import codecs
- import os
整个代码分成以下5个部分:
- # 0 install crf++ https://taku910.github.io/crfpp/
- # 1 train data in
- # 2 test data in
- # 3 crf train
- # 4 crf test
- # 5 submit test
首先我们需要CRF++工具,大家可以到https://taku910.github.io/crfpp/下载工具。然后我们可以分析一下代码:
第一步:处理训练数据
- # step 1 train data in
- with codecs.open('train.txt','r', encoding='utf-8')as f:
- lines = f.readlines()
- results =[]
- for line in lines:
- features =[]
- tags =[]
- samples = line.strip().split(' ')
- for sample in samples:
- sample_list = sample[:-2].split('_')
- tag = sample[-1]
- features.extend(sample_list)
- tags.extend(['O']* len(sample_list))if tag =='o'else tags.extend(['B-'+tag]+['I-'+ tag]*(len(sample_list)-1))
- results.append(dict({'features': features,'tags': tags}))
- train_write_list =[]
- with codecs.open('dg_train.txt','w', encoding='utf-8')as f_out:
- for result in results:
- for i in range(len(result['tags'])):
- train_write_list.append(result['features'][i]+'\t'+ result['tags'][i]+'\n')
- train_write_list.append('\n')
- f_out.writelines(train_write_list)

- # step 2 test data in
- with codecs.open('test.txt','r', encoding='utf-8')as f:
- lines = f.readlines()
- results =[]
- for line in lines:
- features =[]
- sample_list = line.split('_')
- features.extend(sample_list)
- results.append(dict({'features': features}))
- test_write_list =[]
- with codecs.open('dg_test.txt','w', encoding='utf-8')as f_out:
- for result in results:
- for i in range(len(result['features'])):
- test_write_list.append(result['features'][i]+'\n')
- test_write_list.append('\n')
- f_out.writelines(test_write_list)

- # 3 crf train
- crf_train ="crf_learn -f 3 template.txt dg_train.txt dg_model"
- os.system(crf_train)
- # Unigram
- U00:%x[-3,0]
- U01:%x[-2,0]
- U02:%x[-1,0]
- U03:%x[0,0]
- U04:%x[1,0]
- U05:%x[2,0]
- U06:%x[3,0]
- U07:%x[-2,0]/%x[-1,0]/%x[0,0]
- U08:%x[-1,0]/%x[0,0]/%x[1,0]
- U09:%x[0,0]/%x[1,0]/%x[2,0]
- U10:%x[-3,0]/%x[-2,0]
- U11:%x[-2,0]/%x[-1,0]
- U12:%x[-1,0]/%x[0,0]
- U13:%x[0,0]/%x[1,0]
- U14:%x[1,0]/%x[2,0]
- U15:%x[2,0]/%x[3,0]
- # Bigram
- B

- # 4 crf test
- crf_test ="crf_test -m dg_model dg_test.txt -o dg_result.txt"
- os.system(crf_test)
- # 5 submit data
- f_write =codecs.open('dg_submit.txt','w', encoding='utf-8')
- with codecs.open('dg_result.txt','r', encoding='utf-8')as f:
- lines = f.read().split('\n\n')
- for line in lines:
- if line =='':
- continue
- tokens = line.split('\n')
- features =[]
- tags =[]
- for token in tokens:
- feature_tag = token.split()
- features.append(feature_tag[0])
- tags.append(feature_tag[-1])
- samples =[]
- i =0
- while i < len(features):
- sample =[]
- if tags[i]=='O':
- sample.append(features[i])
- j = i +1
- while j < len(features)and tags[j]=='O':
- sample.append(features[j])
- j +=1
- samples.append('_'.join(sample)+'/o')
- else:
- if tags[i][0]!='B':
- print(tags[i][0]+'error start')
- j = i +1
- else:
- sample.append(features[i])
- j = i +1
- while j < len(features)and tags[j][0]=='I'and tags[j][-1]== tags[i][-1]:
- sample.append(features[j])
- j +=1
- samples.append('_'.join(sample)+'/'+tags[i][-1])
- i = j
- f_write.write(' '.join(samples)+'\n')

Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。