赞
踩
近期学习Neo4j,以豆瓣top250数据为研究对象,实现python在线爬取数据写入Neo4j创建知识图谱,下文详细介绍步骤。
1、知识图谱设计
通过分析网页,爬取网页可以得到movie、country、type、time、director、actor、score等信息,此处我将movie、country、type、time、director、actor作为节点,而score作为movie的属性,网上有很多地方讲到只将movie、director、actor作为节点,其余均作为movie的属性,这个我之前也做过,但最后的效果并不是我想要的,至于什么效果,后文会提到。节点和关系设计如下图。
2、爬取数据并写入Neo4j
此处就直接上代码了:
- from bs4 import BeautifulSoup
- from urllib.request import urlopen,urlparse,urlsplit,Request
- import urllib.request
- import re
- import codecs
- import random
- import py2neo
- from py2neo import Graph
- #
- ua_list = [
- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36",#Chrome
- "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0",#firwfox
- "Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko",#IE
- "Opera/9.99 (Windows NT 5.1; U; zh-CN) Presto/9.9.9",#Opera
- ]
-
- if __name__ == "__main__":
- # connect to graph
- graph = Graph (
- "http://localhost:11010/",
- username="admin",
- password="password"
- )
- for i in range(0,9):
- ua = random.choice( ua_list )
- url = 'https://movie.douban.com/top250?start='+str(i*25)+'&filter='
- req = urllib.request.Request( url, headers={'User-agent' : ua} )
- html=urlopen(req).read()
- soup = BeautifulSoup ( html, 'lxml' )
- page=soup.find_all('div', {'class' : 'item'})
- punc = ':· - ...:-'
- list_item=[]
- for item in page:
- content = {}
- try :
- text0=item.find ( 'p', {'class' : ""} ).text.strip ( ).split ( '\n' )[0]
- text1=item.find ( 'p', {'class' : ""} ).text.strip ( ).split ( '\n' ) [1]
- #get film
- film=item.find( 'span', {'class' : 'title'} ).text.strip ( )
- film=re.sub ( r"[%s]+" % punc, "", film.strip ( ) )
- # get score
- score=item.find ( 'span', {'class' : 'rating_num'} ).text.strip ( )
- graph.run (
- "CREATE (movie:Movie {name:'" + film + "', score:'" + score +"'})" )
- #get director
- directors=text0.strip().split(' ')[0].strip().split(':')[1]
- directors = re.sub ( r"[%s]+" % punc, "", directors.strip ( ) )#存在特殊字符需要先去除
- # director=directors.split ( '/' )
- if len ( directors.split ( '/' ))>1:
- print(film+'has more than one director')
- #创建director节点
- if directors not in list_item:
- graph.run (
- "CREATE (director:Person {name:'" + directors + "'})" )
- list_item.append ( directors )
- #创建director-movie关系
- graph.run (
- "match (p:Person{name:'" + directors + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:directed]->(b)" )
- #get actor
- actors = text0.strip ( ).split ( ' ' ) [1].strip ( ).split ( ':' ) [1]
- actors = re.sub ( r"[%s]+" % punc, "", actors.strip ( ) )#存在特殊字符需要先去除
- if len ( actors.split ( '/' ) ) == 1 :
- actor = actors
- if actor not in list_item:
- graph.run (
- "CREATE (actor:Person {name:'" + actor + "'})" )
- list_item.append ( actor )
- graph.run (
- "match (p:Person{name:'" + actor + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:acted_in]->(b)" )
- else :
- actor = actors.split ( '/' )
- if '...' in actor:
- actor.remove ( '...' )
- for i in range(len(actor)-1):
- if actor[i] not in list_item :
- graph.run (
- "CREATE (actor:Person {name:'" + actor [i] + "'})" )
- list_item.append ( actor [i] )
- graph.run (
- "match (p:Person{name:'" + actor[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:acted_in]->(b)" )
- #get time
- time=text1.strip ( ).split ( '/' ) [0].strip()
- if time not in list_item:
- graph.run (
- "CREATE (time:Time {year:'" + time + "'})" )
- list_item.append ( time )
- graph.run (
- "match (p:Time{year:'" + time + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:created_in]->(p)" )
- #get country
- #maybe more than one
- country=text1.strip ( ).split ( '/' ) [1].strip().split(' ')[0]
- if country not in list_item:
- graph.run (
- "CREATE (country:Country {name:'" + country + "'})" )
- list_item.append ( country )
- graph.run (
- "match (p:Country {name:'" + country + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:produced_by]->(p)" )
- #get type
- types=text1.strip ( ).split ( '/' ) [2].strip().split(' ')
- if len(types)==1:
- type = types
- if type not in list_item:
- graph.run (
- "CREATE (type:Type {name:'" + type + "'})" )
- list_item.append ( type )
- graph.run (
- "match (p:Type{name:'" + type + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)" )
- else:
- for i in range(0,len(types)):
- if types[i] not in list_item:
- graph.run (
- "CREATE (type:Type {name:'" + types[i] + "'})" )
- list_item.append ( types[i] )
- type_relation="match (p:Type{name:'" + types[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)"
- graph.run (
- "match (p:Type{name:'" + types[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)" )
- except:
- continue
代码比较粗糙,后续再完善。
3、知识图谱show
整体效果如上图,即可以通过country、type、time信息显性化的检索相关信息,如果只将movie、director、actor作为node,则需要点击具体节点才能看到其属性country、type、time等信息。
如此,一个简易的豆瓣top250知识图谱就构建好了,但是,此处仍存在一个问题-数据重复,做完后发现不仅仅是节点有重复,关系竟然也有重复的,这个问题还在探究中。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。