赞
踩
之前有用过xml.ElementTree包,而且Beautiful Soup的方法也差不多,所以lxml上手也挺快的;
另外要注意,文件夹下不能有和包同名的py,比如xml.py
from lxml import etree
parser = etree.XMLParser(recover=True)
tree = etree.parse('reviews/video/reviews.xml',parser)
root = tree.getroot()
root
<Element reviews at 0x289a8d440c8>
报错,'utf-8' codec can't decode byte 0xe8 in position 3278: invalid continuation byte
utf-8解码没毛病,应当是文件的编码有问题,SublimeText里改一下文件的编码就不报错了
观察文件可以看到,reviews包含多个review,review为一行,11个字段,
其中unique_id
和product_type
字段各自重复了两次,需要合并为list
import pandas as pd lists = [] for review in root.findall('review'): id_list = [j.text.strip() for j in review.findall("unique_id")] asin = review.find("asin").text.strip() product_name = review.find("product_name").text.strip() type_list = [j.text.strip() for j in review.findall("product_type")] helpful = review.find("helpful").text.strip() rating = review.find("rating").text.strip() title = review.find("title").text.strip() date = review.find("date").text.strip() reviewer = review.find("reviewer").text.strip() reviewer_location = review.find("reviewer_location").text.strip() review_text = review.find("review_text") review_text = review_text.text.strip() if review_text is not None else "" lists.append([id_list, asin, product_name, type_list, helpful, rating, title, date, reviewer, reviewer_location, review_text]) cols = ['unique_id', 'asin', 'product_name', 'product_type',"helpful","rating","title","date","reviewer","reviewer_location","review_text"] df = pd.DataFrame(lists, columns=cols) df.head(3)
unique_id | asin | product_name | product_type | helpful | rating | title | date | reviewer | reviewer_location | review_text | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | [B00006IUOZ :disappointi… | B00006I UOZ | Insomnia: Video: Al Pacino,Martin Donovan (II)… | [video, video] | 1 of 2 | 2.0 | Disappointing in more ways than as a remake | October 5, 2005 | PipBoy | particularly the internal affairs sub-plot | It’s not just that this film is a bloated rema… |
1 | [B00006JE7V :disappointi… | B00006 JE7V | Insomnia (2002) (Spanish) (Sub): Video: Al Pac… | [video, video] | 1 of 2 | 2.0 | Disappointing in more ways than as a remake | October 5, 2005 | PipBoy | particularly the internal affairs sub-plot | It’s not just that this film is a bloated rema… |
2 | [1569494088 :ambrose_b… | 156949 4088 | Schoolhouse Rock! - America Rock: Video: Jack … | [video, video] | 0 of 18 | 1.0 | Ambrose Bierce is a Better Authority | July 28, 2004 | Charlotte Tellson “The Keeper of Bierce” | I have rented this video for the sake of a tri… |
i = 0
for row in df.iterrows():
with open('reviews/output/review_text{}.txt'.format('%05d' % i), 'w', encoding='utf-8') as f:
review_text = row[1]['review_text']
if review_text != "":
f.write(review_text)
i += 1
Stanfordparser在Linux下和windows下都可以运行,
对应运行的文件后缀分别是sh和bat,
另外这个工具还提供了GUI
因为我这里是windows环境,所以修改bat
java -mx150m -cp "*;" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn,typedDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz %1
-mx150m:限定jvm最大内存,可以调大一些,否则碰见长句会报错中断
另外outputFormat
改成xml不行,需要改成xmlTree
查看输出文件:
查到了bat怎么循环,但没查到数字怎么前面补0,换python做,但是不在改bat同路径下会报找不到类的错误,按网上改了CLASSPATH
也不行,-cp后的改成绝对路径也不行,还是在StanfordParse文件夹里运行python了
import os
from tqdm import tqdm
filelist = os.listdir('../reviews/output')
for file in tqdm(filelist):
os.system(r'.\dependencyparser.bat ../reviews/output/review_text{0}.txt > ../reviews_xml/review_parsed{0}.xml'.format(file[11:-4]))
这样可以是可以…但是估摸着需要15个小时才能跑完,太慢了,还是扔linux服务器上跑了
for i in $(seq 0 15351)
do
id=`printf "%05d" $i`
echo $(./dependencyparser.sh "../reviews/output/review_text"$id".txt" > "../reviews_xml/review_parsed"$id".xml")
done
然后跑了一夜
文件 | 文件大小(B) |
---|---|
review_text | 63220 |
review_parsed | 406324 |
xml中需要注意的几种缩写如下所示:
JJ
: adjective or numeral, ordinal 形容词或序数词,如greatADJP
:形容词短语,包裹了多个JJNN
: 名词NP
: 名词短语我们想找到的句子一般是这样的结构:
NP
|-- JJ
/ADJP
+
|-- NN
{1,1}
因为NP
里可能嵌套NP
,所以不能直接找NP
,应该从底层JJ
找起,找JJ
上层最近的NP
,然后往下找NN
;
不过,如果自下而上地查找,需要判断一下是不是同一个NP
,如果是,放一个list里面,之后可以逗号隔开生成一个string
也可以分别查找NP
/JJ
和NP
/ADJP
,再从JJ
或者ADJP
的父节点NP
,再找下面的NN
和所有的JJ
、ADJP
因为一个文件输出的不是一个树,需要包裹一层再解析
import os from lxml import etree import re from tqdm import tqdm_notebook result = [] for filename in tqdm_notebook(os.listdir('./reviews_xml')): with open('./reviews_xml/' + filename) as f: content = '<root>' + ''.join(f.readlines()) + '</root>' tree = etree.XML(content) dic = {} nps = set() # 存一下,防止地址被覆盖 for jj in tree.xpath('//node[@value="NP"]/node[starts-with(@value, "JJ")]'): np = jj.getparent() if len(np.xpath('./node[starts-with(@value, "NN")]')) == 0: continue nn = np.xpath('./node[starts-with(@value, "NN")]')[0] if id(nn) not in dic: dic[id(nn)] = [nn.find('leaf').attrib['value'], jj.find('leaf').attrib['value']] nps.add(np) else: dic[id(nn)] += [jj.find('leaf').attrib['value']] for adjp in tree.xpath('//node[@value="NP"]/node[@value="ADJP"]'): np = adjp.getparent() if len(np.xpath('./node[starts-with(@value, "NN")]')) == 0: continue nn = np.xpath('./node[starts-with(@value, "NN")]')[0] ad = " ".join([leaf.attrib['value'] for leaf in adjp.xpath('.//leaf')]) if id(np) not in dic: dic[id(np)] = [nn.find('leaf').attrib['value'], ad] nps.add(np) else: dic[id(np)] += [ad] result += [(value[0], ", ".join(list(value[1:]))) for value in dic.values()]
import pandas as pd
df = pd.DataFrame(result, columns=['target', 'sentiment words or phase'])
df
target | sentiment words or phase | |
---|---|---|
0 | remake | bloated |
1 | thriller | quick, clean, near-perfect, minimalist |
2 | affairs | internal, moral |
3 | shell | toned-down |
4 | performance | believable |
5 | bit | method-acting |
6 | scene | single |
7 | imitation | pale |
8 | roles | better |
9 | performances | fine |
10 | contemporaries | adult |
11 | performances | far-too-short |
12 | detective | dangerously off-kilter |
13 | characters | more complicated |
14 | director | sublimely talented |
15 | hometown | down-to-earth |
16 | change | refreshing |
17 | remake | bloated |
18 | thriller | quick, clean, near-perfect, minimalist |
19 | affairs | internal, moral |
20 | shell | toned-down |
21 | performance | believable |
22 | bit | method-acting |
23 | scene | single |
24 | imitation | pale |
25 | roles | better |
26 | performances | fine |
27 | contemporaries | adult |
28 | performances | far-too-short |
29 | detective | dangerously off-kilter |
… | … | … |
124108 | film | Sci-Fi, Semi-Porn |
124109 | movies | randy |
124110 | science | hokey |
124111 | movies | other, similar |
124112 | Features | following, Special |
124113 | movie | average |
124114 | sense | commercial |
124115 | hokey | average |
124116 | bodies | great |
124117 | public | sexually repressed |
124118 | male | similarly attired |
124119 | sh | scared |
124120 | vignette | last and most bizarre |
124121 | sci-fi | extremely weak |
124122 | star | low four |
124123 | extravaganza | better than your average hokey Sci-Fi Semi-Porn |
124124 | science | erotic |
124125 | employee | former |
124126 | bit | least |
124127 | content | sexual |
124128 | employers | former |
124129 | knowledge | very |
124130 | stories | individual |
124131 | abduction | alien |
124132 | pilot | female |
124133 | planet | distant |
124134 | number | significant, lovely |
124135 | softcore | impressive |
124136 | proof | further |
124137 | movies | best, erotic |
124138 rows × 2 columns
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。