当前位置:   article > 正文

python在线爬取数据导入Neo4j创建知识图谱_python 构建中文知识图谱

python 构建中文知识图谱

近期学习Neo4j,以豆瓣top250数据为研究对象,实现python在线爬取数据写入Neo4j创建知识图谱,下文详细介绍步骤。

1、知识图谱设计

通过分析网页,爬取网页可以得到movie、country、type、time、director、actor、score等信息,此处我将movie、country、type、time、director、actor作为节点,而score作为movie的属性,网上有很多地方讲到只将movie、director、actor作为节点,其余均作为movie的属性,这个我之前也做过,但最后的效果并不是我想要的,至于什么效果,后文会提到。节点和关系设计如下图。

2、爬取数据并写入Neo4j

此处就直接上代码了:

  1. from bs4 import BeautifulSoup
  2. from urllib.request import urlopen,urlparse,urlsplit,Request
  3. import urllib.request
  4. import re
  5. import codecs
  6. import random
  7. import py2neo
  8. from py2neo import Graph
  9. #
  10. ua_list = [
  11. "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36",#Chrome
  12. "Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0",#firwfox
  13. "Mozilla/5.0 (compatible, MSIE 11, Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko",#IE
  14. "Opera/9.99 (Windows NT 5.1; U; zh-CN) Presto/9.9.9",#Opera
  15. ]
  16. if __name__ == "__main__":
  17. # connect to graph
  18. graph = Graph (
  19. "http://localhost:11010/",
  20. username="admin",
  21. password="password"
  22. )
  23. for i in range(0,9):
  24. ua = random.choice( ua_list )
  25. url = 'https://movie.douban.com/top250?start='+str(i*25)+'&filter='
  26. req = urllib.request.Request( url, headers={'User-agent' : ua} )
  27. html=urlopen(req).read()
  28. soup = BeautifulSoup ( html, 'lxml' )
  29. page=soup.find_all('div', {'class' : 'item'})
  30. punc = ':· - ...:-'
  31. list_item=[]
  32. for item in page:
  33. content = {}
  34. try :
  35. text0=item.find ( 'p', {'class' : ""} ).text.strip ( ).split ( '\n' )[0]
  36. text1=item.find ( 'p', {'class' : ""} ).text.strip ( ).split ( '\n' ) [1]
  37. #get film
  38. film=item.find( 'span', {'class' : 'title'} ).text.strip ( )
  39. film=re.sub ( r"[%s]+" % punc, "", film.strip ( ) )
  40. # get score
  41. score=item.find ( 'span', {'class' : 'rating_num'} ).text.strip ( )
  42. graph.run (
  43. "CREATE (movie:Movie {name:'" + film + "', score:'" + score +"'})" )
  44. #get director
  45. directors=text0.strip().split('   ')[0].strip().split(':')[1]
  46. directors = re.sub ( r"[%s]+" % punc, "", directors.strip ( ) )#存在特殊字符需要先去除
  47. # director=directors.split ( '/' )
  48. if len ( directors.split ( '/' ))>1:
  49. print(film+'has more than one director')
  50. #创建director节点
  51. if directors not in list_item:
  52. graph.run (
  53. "CREATE (director:Person {name:'" + directors + "'})" )
  54. list_item.append ( directors )
  55. #创建director-movie关系
  56. graph.run (
  57. "match (p:Person{name:'" + directors + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:directed]->(b)" )
  58. #get actor
  59. actors = text0.strip ( ).split ( '   ' ) [1].strip ( ).split ( ':' ) [1]
  60. actors = re.sub ( r"[%s]+" % punc, "", actors.strip ( ) )#存在特殊字符需要先去除
  61. if len ( actors.split ( '/' ) ) == 1 :
  62. actor = actors
  63. if actor not in list_item:
  64. graph.run (
  65. "CREATE (actor:Person {name:'" + actor + "'})" )
  66. list_item.append ( actor )
  67. graph.run (
  68. "match (p:Person{name:'" + actor + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:acted_in]->(b)" )
  69. else :
  70. actor = actors.split ( '/' )
  71. if '...' in actor:
  72. actor.remove ( '...' )
  73. for i in range(len(actor)-1):
  74. if actor[i] not in list_item :
  75. graph.run (
  76. "CREATE (actor:Person {name:'" + actor [i] + "'})" )
  77. list_item.append ( actor [i] )
  78. graph.run (
  79. "match (p:Person{name:'" + actor[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (p)-[:acted_in]->(b)" )
  80. #get time
  81. time=text1.strip ( ).split ( '/' ) [0].strip()
  82. if time not in list_item:
  83. graph.run (
  84. "CREATE (time:Time {year:'" + time + "'})" )
  85. list_item.append ( time )
  86. graph.run (
  87. "match (p:Time{year:'" + time + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:created_in]->(p)" )
  88. #get country
  89. #maybe more than one
  90. country=text1.strip ( ).split ( '/' ) [1].strip().split(' ')[0]
  91. if country not in list_item:
  92. graph.run (
  93. "CREATE (country:Country {name:'" + country + "'})" )
  94. list_item.append ( country )
  95. graph.run (
  96. "match (p:Country {name:'" + country + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:produced_by]->(p)" )
  97. #get type
  98. types=text1.strip ( ).split ( '/' ) [2].strip().split(' ')
  99. if len(types)==1:
  100. type = types
  101. if type not in list_item:
  102. graph.run (
  103. "CREATE (type:Type {name:'" + type + "'})" )
  104. list_item.append ( type )
  105. graph.run (
  106. "match (p:Type{name:'" + type + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)" )
  107. else:
  108. for i in range(0,len(types)):
  109. if types[i] not in list_item:
  110. graph.run (
  111. "CREATE (type:Type {name:'" + types[i] + "'})" )
  112. list_item.append ( types[i] )
  113. type_relation="match (p:Type{name:'" + types[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)"
  114. graph.run (
  115. "match (p:Type{name:'" + types[i] + "'}),(b:Movie{name:'" + film + "'})" + "CREATE (b)-[:belong_to]->(p)" )
  116. except:
  117. continue

代码比较粗糙,后续再完善。

3、知识图谱show

整体效果如上图,即可以通过country、type、time信息显性化的检索相关信息,如果只将movie、director、actor作为node,则需要点击具体节点才能看到其属性country、type、time等信息。

如此,一个简易的豆瓣top250知识图谱就构建好了,但是,此处仍存在一个问题-数据重复,做完后发现不仅仅是节点有重复,关系竟然也有重复的,这个问题还在探究中。

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/菜鸟追梦旅行/article/detail/161264?site
推荐阅读
相关标签
  

闽ICP备14008679号