当前位置:   article > 正文

数据科学导论实验:XML及StanfordParser_et.xmlparser(recover=true)

et.xmlparser(recover=true)

之前有用过xml.ElementTree包,而且Beautiful Soup的方法也差不多,所以lxml上手也挺快的;
另外要注意,文件夹下不能有和包同名的py,比如xml.py

xml解析

from lxml import etree

parser = etree.XMLParser(recover=True)
tree = etree.parse('reviews/video/reviews.xml',parser)
root = tree.getroot()
root
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
<Element reviews at 0x289a8d440c8>
  • 1

报错,'utf-8' codec can't decode byte 0xe8 in position 3278: invalid continuation byte

utf-8解码没毛病,应当是文件的编码有问题,SublimeText里改一下文件的编码就不报错了

转为pandas

观察文件可以看到,reviews包含多个review,review为一行,11个字段,
其中unique_idproduct_type字段各自重复了两次,需要合并为list

import pandas as pd

lists = []

for review in root.findall('review'):
    id_list = [j.text.strip() for j in review.findall("unique_id")]
    asin = review.find("asin").text.strip()
    product_name = review.find("product_name").text.strip()
    type_list = [j.text.strip() for j in review.findall("product_type")]
    helpful = review.find("helpful").text.strip()
    rating = review.find("rating").text.strip()
    title = review.find("title").text.strip()
    date = review.find("date").text.strip()
    reviewer = review.find("reviewer").text.strip()
    reviewer_location = review.find("reviewer_location").text.strip()
    review_text = review.find("review_text")
    review_text = review_text.text.strip() if review_text is not None else "" 

    lists.append([id_list, asin, product_name, type_list, 
                    helpful, rating, title, date, reviewer,
                    reviewer_location, review_text])

cols = ['unique_id', 'asin', 'product_name', 'product_type',"helpful","rating","title","date","reviewer","reviewer_location","review_text"]
df = pd.DataFrame(lists, columns=cols)
df.head(3)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
unique_idasinproduct_nameproduct_typehelpfulratingtitledatereviewerreviewer_locationreview_text
0[B00006IUOZ :disappointi…B00006I UOZInsomnia: Video: Al Pacino,Martin Donovan (II)…[video, video]1 of 22.0Disappointing in more ways than as a remakeOctober 5, 2005PipBoyparticularly the internal affairs sub-plotIt’s not just that this film is a bloated rema…
1[B00006JE7V :disappointi…B00006 JE7VInsomnia (2002) (Spanish) (Sub): Video: Al Pac…[video, video]1 of 22.0Disappointing in more ways than as a remakeOctober 5, 2005PipBoyparticularly the internal affairs sub-plotIt’s not just that this film is a bloated rema…
2[1569494088 :ambrose_b…156949 4088Schoolhouse Rock! - America Rock: Video: Jack …[video, video]0 of 181.0Ambrose Bierce is a Better AuthorityJuly 28, 2004Charlotte Tellson “The Keeper of Bierce”I have rented this video for the sake of a tri…

输出保存

i = 0
for row in df.iterrows():
    with open('reviews/output/review_text{}.txt'.format('%05d' % i), 'w', encoding='utf-8') as f:
        review_text = row[1]['review_text']
        if review_text != "":
            f.write(review_text)
        i += 1
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

Stanfordparser的使用(NLP)

Stanfordparser在Linux下和windows下都可以运行,
对应运行的文件后缀分别是sh和bat,
另外这个工具还提供了GUI

加载解析器

加载文件

dependency trees

lexparser修改

因为我这里是windows环境,所以修改bat

java -mx150m -cp "*;" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz %1
  • 1

-mx150m:限定jvm最大内存,可以调大一些,否则碰见长句会报错中断

另外outputFormat改成xml不行,需要改成xmlTree

查看输出文件:

批量处理

查到了bat怎么循环,但没查到数字怎么前面补0,换python做,但是不在改bat同路径下会报找不到类的错误,按网上改了CLASSPATH也不行,-cp后的改成绝对路径也不行,还是在StanfordParse文件夹里运行python了

import os
from tqdm import tqdm

filelist = os.listdir('../reviews/output')
for file in tqdm(filelist):
    os.system(r'.\dependencyparser.bat ../reviews/output/review_text{0}.txt > ../reviews_xml/review_parsed{0}.xml'.format(file[11:-4]))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

这样可以是可以…但是估摸着需要15个小时才能跑完,太慢了,还是扔linux服务器上跑了

for i in $(seq 0 15351)
do
  id=`printf "%05d" $i`
  echo $(./dependencyparser.sh "../reviews/output/review_text"$id".txt" > "../reviews_xml/review_parsed"$id".xml")
done
  • 1
  • 2
  • 3
  • 4
  • 5

然后跑了一夜

文件文件大小(B)
review_text63220
review_parsed406324

情感词提取

xml中需要注意的几种缩写如下所示:

  • JJ: adjective or numeral, ordinal 形容词或序数词,如great
  • ADJP:形容词短语,包裹了多个JJ
  • NN: 名词
  • NP: 名词短语

我们想找到的句子一般是这样的结构:

NP

|-- JJ /ADJP +

|-- NN {1,1}

因为NP里可能嵌套NP,所以不能直接找NP,应该从底层JJ找起,找JJ上层最近的NP,然后往下找NN;

不过,如果自下而上地查找,需要判断一下是不是同一个NP,如果是,放一个list里面,之后可以逗号隔开生成一个string

也可以分别查找NP/JJNP/ADJP,再从JJ或者ADJP的父节点NP,再找下面的NN和所有的JJADJP

因为一个文件输出的不是一个树,需要包裹一层再解析

import os
from lxml import etree
import re
from tqdm import tqdm_notebook

result = []
for filename in tqdm_notebook(os.listdir('./reviews_xml')):
    with open('./reviews_xml/' + filename) as f:
        content = '<root>' + ''.join(f.readlines()) + '</root>'
        tree = etree.XML(content)
        dic = {}
        nps = set() # 存一下,防止地址被覆盖
        for jj in tree.xpath('//node[@value="NP"]/node[starts-with(@value, "JJ")]'):
            np = jj.getparent()
            if len(np.xpath('./node[starts-with(@value, "NN")]')) == 0:
                continue
            nn = np.xpath('./node[starts-with(@value, "NN")]')[0]
            if id(nn) not in dic:
                dic[id(nn)] = [nn.find('leaf').attrib['value'], jj.find('leaf').attrib['value']]
                nps.add(np)
            else:
                dic[id(nn)] += [jj.find('leaf').attrib['value']]
                          
        for adjp in tree.xpath('//node[@value="NP"]/node[@value="ADJP"]'):
            np = adjp.getparent()
            if len(np.xpath('./node[starts-with(@value, "NN")]')) == 0:
                continue
            nn = np.xpath('./node[starts-with(@value, "NN")]')[0]
            ad = " ".join([leaf.attrib['value'] for leaf in adjp.xpath('.//leaf')])
            if id(np) not in dic:
                dic[id(np)] = [nn.find('leaf').attrib['value'], ad]
                nps.add(np)
            else:
                dic[id(np)] += [ad]
        result += [(value[0], ", ".join(list(value[1:]))) for value in dic.values()]
    
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36

import pandas as pd

df = pd.DataFrame(result, columns=['target', 'sentiment words or phase'])
df
  • 1
  • 2
  • 3
  • 4
targetsentiment words or phase
0remakebloated
1thrillerquick, clean, near-perfect, minimalist
2affairsinternal, moral
3shelltoned-down
4performancebelievable
5bitmethod-acting
6scenesingle
7imitationpale
8rolesbetter
9performancesfine
10contemporariesadult
11performancesfar-too-short
12detectivedangerously off-kilter
13charactersmore complicated
14directorsublimely talented
15hometowndown-to-earth
16changerefreshing
17remakebloated
18thrillerquick, clean, near-perfect, minimalist
19affairsinternal, moral
20shelltoned-down
21performancebelievable
22bitmethod-acting
23scenesingle
24imitationpale
25rolesbetter
26performancesfine
27contemporariesadult
28performancesfar-too-short
29detectivedangerously off-kilter
124108filmSci-Fi, Semi-Porn
124109moviesrandy
124110sciencehokey
124111moviesother, similar
124112Featuresfollowing, Special
124113movieaverage
124114sensecommercial
124115hokeyaverage
124116bodiesgreat
124117publicsexually repressed
124118malesimilarly attired
124119shscared
124120vignettelast and most bizarre
124121sci-fiextremely weak
124122starlow four
124123extravaganzabetter than your average hokey Sci-Fi Semi-Porn
124124scienceerotic
124125employeeformer
124126bitleast
124127contentsexual
124128employersformer
124129knowledgevery
124130storiesindividual
124131abductionalien
124132pilotfemale
124133planetdistant
124134numbersignificant, lovely
124135softcoreimpressive
124136prooffurther
124137moviesbest, erotic

124138 rows × 2 columns

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/煮酒与君饮/article/detail/855923
推荐阅读
相关标签
  

闽ICP备14008679号