煮酒与君饮

这个屌丝很懒，什么也没留下！

热门标签

数据科学导论实验：XML及StanfordParser_et.xmlparser(recover=true)

作者：煮酒与君饮 | 2024-07-20 08:19:17

踩

et.xmlparser(recover=true)

之前有用过xml.ElementTree包，而且Beautiful Soup的方法也差不多，所以lxml上手也挺快的；
另外要注意，文件夹下不能有和包同名的py，比如xml.py

xml解析

from lxml import etree

parser = etree.XMLParser(recover=True)
tree = etree.parse('reviews/video/reviews.xml',parser)
root = tree.getroot()
root
1
2
3
4
5
6

<Element reviews at 0x289a8d440c8>
1

报错，'utf-8' codec can't decode byte 0xe8 in position 3278: invalid continuation byte

utf-8解码没毛病，应当是文件的编码有问题，SublimeText里改一下文件的编码就不报错了

转为pandas

观察文件可以看到，reviews包含多个review，review为一行，11个字段，
其中unique_id和product_type字段各自重复了两次，需要合并为list

import pandas as pd

lists = []

for review in root.findall('review'):
    id_list = [j.text.strip() for j in review.findall("unique_id")]
    asin = review.find("asin").text.strip()
    product_name = review.find("product_name").text.strip()
    type_list = [j.text.strip() for j in review.findall("product_type")]
    helpful = review.find("helpful").text.strip()
    rating = review.find("rating").text.strip()
    title = review.find("title").text.strip()
    date = review.find("date").text.strip()
    reviewer = review.find("reviewer").text.strip()
    reviewer_location = review.find("reviewer_location").text.strip()
    review_text = review.find("review_text")
    review_text = review_text.text.strip() if review_text is not None else "" 

    lists.append([id_list, asin, product_name, type_list, 
                    helpful, rating, title, date, reviewer,
                    reviewer_location, review_text])

cols = ['unique_id', 'asin', 'product_name', 'product_type',"helpful","rating","title","date","reviewer","reviewer_location","review_text"]
df = pd.DataFrame(lists, columns=cols)
df.head(3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

	unique_id	asin	product_name	product_type	helpful	rating	title	date	reviewer	reviewer_location	review_text
0	[B00006IUOZ :disappointi…	B00006I UOZ	Insomnia: Video: Al Pacino,Martin Donovan (II)…	[video, video]	1 of 2	2.0	Disappointing in more ways than as a remake	October 5, 2005	PipBoy	particularly the internal affairs sub-plot	It’s not just that this film is a bloated rema…
1	[B00006JE7V :disappointi…	B00006 JE7V	Insomnia (2002) (Spanish) (Sub): Video: Al Pac…	[video, video]	1 of 2	2.0	Disappointing in more ways than as a remake	October 5, 2005	PipBoy	particularly the internal affairs sub-plot	It’s not just that this film is a bloated rema…
2	[1569494088 :ambrose_b…	156949 4088	Schoolhouse Rock! - America Rock: Video: Jack …	[video, video]	0 of 18	1.0	Ambrose Bierce is a Better Authority	July 28, 2004	Charlotte Tellson “The Keeper of Bierce”		I have rented this video for the sake of a tri…

输出保存

i = 0
for row in df.iterrows():
    with open('reviews/output/review_text{}.txt'.format('%05d' % i), 'w', encoding='utf-8') as f:
        review_text = row[1]['review_text']
        if review_text != "":
            f.write(review_text)
        i += 1
1
2
3
4
5
6
7

Stanfordparser的使用（NLP）

Stanfordparser在Linux下和windows下都可以运行，
对应运行的文件后缀分别是sh和bat，
另外这个工具还提供了GUI

加载解析器

加载文件

dependency trees

lexparser修改

因为我这里是windows环境，所以修改bat

java -mx150m -cp "*;" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz %1
1

-mx150m:限定jvm最大内存，可以调大一些，否则碰见长句会报错中断

另外outputFormat改成xml不行，需要改成xmlTree

查看输出文件：

批量处理

查到了bat怎么循环，但没查到数字怎么前面补0，换python做，但是不在改bat同路径下会报找不到类的错误，按网上改了CLASSPATH也不行，-cp后的改成绝对路径也不行，还是在StanfordParse文件夹里运行python了

import os
from tqdm import tqdm

filelist = os.listdir('../reviews/output')
for file in tqdm(filelist):
    os.system(r'.\dependencyparser.bat ../reviews/output/review_text{0}.txt > ../reviews_xml/review_parsed{0}.xml'.format(file[11:-4]))
1
2
3
4
5
6

这样可以是可以…但是估摸着需要15个小时才能跑完，太慢了，还是扔linux服务器上跑了

for i in $(seq 0 15351)
do
  id=`printf "%05d" $i`
  echo $(./dependencyparser.sh "../reviews/output/review_text"$id".txt" > "../reviews_xml/review_parsed"$id".xml")
done
1
2
3
4
5

然后跑了一夜

文件	文件大小(B)
review_text	63220
review_parsed	406324

情感词提取

xml中需要注意的几种缩写如下所示：

JJ: adjective or numeral, ordinal 形容词或序数词，如great
ADJP：形容词短语，包裹了多个JJ
NN: 名词
NP: 名词短语

我们想找到的句子一般是这样的结构：

NP

|-- JJ /ADJP +

|-- NN {1,1}

因为NP里可能嵌套NP，所以不能直接找NP，应该从底层JJ找起，找JJ上层最近的NP，然后往下找NN;

不过，如果自下而上地查找，需要判断一下是不是同一个NP，如果是，放一个list里面，之后可以逗号隔开生成一个string

也可以分别查找NP/JJ和NP/ADJP，再从JJ或者ADJP的父节点NP，再找下面的NN和所有的JJ、ADJP

因为一个文件输出的不是一个树，需要包裹一层再解析

import os
from lxml import etree
import re
from tqdm import tqdm_notebook

result = []
for filename in tqdm_notebook(os.listdir('./reviews_xml')):
    with open('./reviews_xml/' + filename) as f:
        content = '<root>' + ''.join(f.readlines()) + '</root>'
        tree = etree.XML(content)
        dic = {}
        nps = set() # 存一下，防止地址被覆盖
        for jj in tree.xpath('//node[@value="NP"]/node[starts-with(@value, "JJ")]'):
            np = jj.getparent()
            if len(np.xpath('./node[starts-with(@value, "NN")]')) == 0:
                continue
            nn = np.xpath('./node[starts-with(@value, "NN")]')[0]
            if id(nn) not in dic:
                dic[id(nn)] = [nn.find('leaf').attrib['value'], jj.find('leaf').attrib['value']]
                nps.add(np)
            else:
                dic[id(nn)] += [jj.find('leaf').attrib['value']]
                          
        for adjp in tree.xpath('//node[@value="NP"]/node[@value="ADJP"]'):
            np = adjp.getparent()
            if len(np.xpath('./node[starts-with(@value, "NN")]')) == 0:
                continue
            nn = np.xpath('./node[starts-with(@value, "NN")]')[0]
            ad = " ".join([leaf.attrib['value'] for leaf in adjp.xpath('.//leaf')])
            if id(np) not in dic:
                dic[id(np)] = [nn.find('leaf').attrib['value'], ad]
                nps.add(np)
            else:
                dic[id(np)] += [ad]
        result += [(value[0], ", ".join(list(value[1:]))) for value in dic.values()]
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

import pandas as pd

df = pd.DataFrame(result, columns=['target', 'sentiment words or phase'])
df
1
2
3
4

	target	sentiment words or phase
0	remake	bloated
1	thriller	quick, clean, near-perfect, minimalist
2	affairs	internal, moral
3	shell	toned-down
4	performance	believable
5	bit	method-acting
6	scene	single
7	imitation	pale
8	roles	better
9	performances	fine
10	contemporaries	adult
11	performances	far-too-short
12	detective	dangerously off-kilter
13	characters	more complicated
14	director	sublimely talented
15	hometown	down-to-earth
16	change	refreshing
17	remake	bloated
18	thriller	quick, clean, near-perfect, minimalist
19	affairs	internal, moral
20	shell	toned-down
21	performance	believable
22	bit	method-acting
23	scene	single
24	imitation	pale
25	roles	better
26	performances	fine
27	contemporaries	adult
28	performances	far-too-short
29	detective	dangerously off-kilter
…	…	…
124108	film	Sci-Fi, Semi-Porn
124109	movies	randy
124110	science	hokey
124111	movies	other, similar
124112	Features	following, Special
124113	movie	average
124114	sense	commercial
124115	hokey	average
124116	bodies	great
124117	public	sexually repressed
124118	male	similarly attired
124119	sh	scared
124120	vignette	last and most bizarre
124121	sci-fi	extremely weak
124122	star	low four
124123	extravaganza	better than your average hokey Sci-Fi Semi-Porn
124124	science	erotic
124125	employee	former
124126	bit	least
124127	content	sexual
124128	employers	former
124129	knowledge	very
124130	stories	individual
124131	abduction	alien
124132	pilot	female
124133	planet	distant
124134	number	significant, lovely
124135	softcore	impressive
124136	proof	further
124137	movies	best, erotic

124138 rows × 2 columns

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/煮酒与君饮/article/detail/855923