当前位置:   article > 正文

#python学习笔记#使用python爬取网站数据并保存到数据库_用python把数据存在网上

用python把数据存在网上

上篇说到如何使用python通过提取网页元素抓取网站数据并导出到excel中,今天就来说说如何通过获取json爬取数据并且保存到mysql数据库中。

本文主要涉及到三个知识点:

1.通过抓包工具获取网站接口api

2.通过python解析json数据

3.通过python与数据库进行连接,并将数据写入数据库。

抓包不是本文想说的主要内容,大家可以移步这里或者直接在百度搜索“fiddler手机抓包”去了解抓包的相关内容,对了,这篇简书中也公布了一些网站的接口,大家也可以直接去那儿获取。

ok,那直接切入正题,首先看看python是如何拿到json并且解析json的:

获取json数据:

  1. def getHtmlData(url):
  2. # 请求
  3. headers = {
  4. 'User-Agent': 'Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19'}
  5. request = urllib.request.Request(url, headers=headers)
  6. response = urllib.request.urlopen(request)
  7. data = response.read()
  8. # 设置解码方式
  9. data = data.decode('utf-8')
  10. return data
'
运行

解析json:

解析json之前,我们先来看看我们得到的json是怎样的(数据较多,相同结构的数据隐藏了一些):

  1. {
  2. "id": 1,
  3. "label": "头条",
  4. "prev": "https://api.dongqiudi.com/app/tabs/android/1.json?before=1658116800",
  5. "next": "https://api.dongqiudi.com/app/tabs/android/1.json?after=1500443152&page=2",
  6. "max": 1658116800,
  7. "min": 1500443152,
  8. "page": 1,
  9. "articles": [
  10. {
  11. "id": 375248,
  12. "title": "还记得他们吗?那些年,我们也有自己的留洋军团",
  13. "share_title": "还记得他们吗?那些年,我们也有自己的留洋军团",
  14. "description": "",
  15. "comments_total": 1026,
  16. "share": "https://www.dongqiudi.com/article/375248",
  17. "thumb": "http://img1.dongqiudi.com/fastdfs1/M00/97/55/180x135/crop/-/pIYBAFlkjm-AMc7AAAL4n-oihZs769.jpg",
  18. "top": true,
  19. "top_color": "#4782c4",
  20. "url": "https://api.dongqiudi.com/article/375248.html?from=tab_1",
  21. "url1": "https://api.dongqiudi.com/article/375248.html?from=tab_1",
  22. "scheme": "dongqiudi:///news/375248",
  23. "is_video": false,
  24. "new_video_detail": null,
  25. "collection_type": null,
  26. "add_to_tab": "0",
  27. "show_comments": true,
  28. "published_at": "2022-07-18 12:00:00",
  29. "sort_timestamp": 1658116800,
  30. "channel": "article",
  31. "label": "深度",
  32. "label_color": "#4782c4"
  33. },
  34. {
  35. "id": 382644,
  36. "title": "连续三年英超主场负于水晶宫,今晚克洛普的扑克牌怎么打呢?",
  37. "share_title": "连续三年英超主场负于水晶宫,今晚克洛普的扑克牌怎么打呢?",
  38. "comments_total": 0,
  39. "share": "https://www.dongqiudi.com/article/382644",
  40. "thumb": "",
  41. "top": false,
  42. "top_color": "",
  43. "url": "https://api.dongqiudi.com/article/382644.html?from=tab_1",
  44. "url1": "https://api.dongqiudi.com/article/382644.html?from=tab_1",
  45. "scheme": null,
  46. "is_video": true,
  47. "new_video_detail": "1",
  48. "collection_type": null,
  49. "add_to_tab": null,
  50. "show_comments": true,
  51. "published_at": "2017-07-19 14:55:25",
  52. "sort_timestamp": 1500447325,
  53. "channel": "video"
  54. },
  55. {
  56. "id": 382599,
  57. "title": "梦想不会褪色!慈善机构圆孟买贫民区女孩儿的足球梦",
  58. "share_title": "梦想不会褪色!慈善机构圆孟买贫民区女孩儿的足球梦",
  59. "comments_total": 9,
  60. "share": "https://www.dongqiudi.com/article/382599",
  61. "thumb": "http://img1.dongqiudi.com/fastdfs1/M00/9C/D3/180x135/crop/-/o4YBAFlu8F2AcFtwAACX_DJbrwo612.jpg",
  62. "top": false,
  63. "top_color": "",
  64. "url": "https://api.dongqiudi.com/article/382599.html?from=tab_1",
  65. "url1": "https://api.dongqiudi.com/article/382599.html?from=tab_1",
  66. "scheme": null,
  67. "is_video": true,
  68. "new_video_detail": "1",
  69. "collection_type": null,
  70. "add_to_tab": null,
  71. "show_comments": true,
  72. "published_at": "2017-07-19 14:45:20",
  73. "sort_timestamp": 1500446720,
  74. "channel": "video"
  75. }
  76. ],
  77. "hotwords": "JJ同学",
  78. "ad": [],
  79. "quora": [
  80. {
  81. "id": 182,
  82. "type": "ask",
  83. "title": "足坛历史上有哪些有名的更衣室故事?",
  84. "ico": "",
  85. "thumb": "http://img1.dongqiudi.com/fastdfs1/M00/9B/BE/pIYBAFlt3uyACqEnAADhb9FVavU28.jpeg",
  86. "answer_total": 222,
  87. "scheme": "dongqiudi:///ask/182",
  88. "position": 7,
  89. "sort_timestamp": 1500533674,
  90. "published_at": "2017-07-20 14:54:34"
  91. }
  92. ]
  93. }


好,我们现在就将articles这个数组中的数据解析出来,通过这个过程你就会知道为什么python会这么火了~:

先导入解析json的包:

imprt json

然后解析:

dataList = json.loads(data)['articles']

你没看错,就这一步便取出了articles这个json数组;

接下来取出articles中的对象并添加到python的list中,留待后面添加到数据库中使用:

  1. for index in range(len(dataList)):
  2. newsObj = dataList[index]
  3. #print(newsObj.get('title'))
  4. newsObjs = [newsObj.get('id'), newsObj.get('title'), newsObj.get('share_title'), newsObj.get('description'),
  5. newsObj.get('comments_total'), newsObj.get('share'), newsObj.get('thumb'), newsObj.get('top'),
  6. newsObj.get('top_color'), newsObj.get('url'), newsObj.get('url1'), newsObj.get('scheme'),
  7. newsObj.get('is_video'), newsObj.get('new_video_detail'), newsObj.get('collection_type'),
  8. newsObj.get('add_to_tab'), newsObj.get('show_comments'), newsObj.get('published_at'),
  9. newsObj.get('channel'), str(first_label), newsObj.get('comments_total')]

解析json的工作到这就完成了,接下来就是连接数据库了:

  1. #执行sql语句
  2. def executeSql(sql,values):
  3. conn = pymysql.connect(host=str(etAddress.get()), port=int(etPort.get()), user=str(etName.get()),
  4. passwd=str(etPassWd.get()), db=str(etDBName.get()))
  5. cursor = conn.cursor()
  6. conn.set_charset('utf8')
  7. effect_row = cursor.execute(sql, values)
  8. # 提交,不然无法保存新建或者修改的数据
  9. conn.commit()
  10. # 关闭游标
  11. cursor.close()
  12. # 关闭连接
  13. conn.close()
'
运行

是不是觉得很眼熟,的确python连接数据库和java等类似,也是建立连接,输入mysql的地址,端口号,数据库的用户名,密码然后通过cursor返回操作结果,当然最后要把连接,cursor都关掉。(python连接数据库需要导入pymysql的包,直接通过pip安装,然后import即可)sql语句的写法也和java等类似,整个过程是这样的:

  1. #插入新闻
  2. def insertNews(data):
  3. if len(data) > 2:
  4. dataList = json.loads(data)['articles']
  5. first_label = json.loads(data)['label']
  6. for index in range(len(dataList)):
  7. newsObj = dataList[index]
  8. #print(newsObj.get('title'))
  9. newsObjs = [newsObj.get('id'), newsObj.get('title'), newsObj.get('share_title'), newsObj.get('description'),
  10. newsObj.get('comments_total'), newsObj.get('share'), newsObj.get('thumb'), newsObj.get('top'),
  11. newsObj.get('top_color'), newsObj.get('url'), newsObj.get('url1'), newsObj.get('scheme'),
  12. newsObj.get('is_video'), newsObj.get('new_video_detail'), newsObj.get('collection_type'),
  13. newsObj.get('add_to_tab'), newsObj.get('show_comments'), newsObj.get('published_at'),
  14. newsObj.get('channel'), str(first_label), newsObj.get('comments_total')]
  15. sql = "insert into news(id,title,share_title,description,comments_total," \
  16. "share,thumb,top,top_color,url,url1,scheme,is_video,new_video_detail," \
  17. "collection_type,add_to_tab,show_comments,published_at,channel,label)" \
  18. "values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) " \
  19. "ON DUPLICATE KEY UPDATE comments_total = %s"
  20. executeSql(sql=sql,values=newsObjs)
  21. #执行sql语句
  22. def executeSql(sql,values):
  23. print(str(etPassWd.get()))
  24. conn = pymysql.connect(host=str(etAddress.get()), port=int(etPort.get()), user=str(etName.get()),
  25. passwd=str(etPassWd.get()), db=str(etDBName.get()))
  26. cursor = conn.cursor()
  27. conn.set_charset('utf8')
  28. effect_row = cursor.execute(sql, values)
  29. # 提交,不然无法保存新建或者修改的数据
  30. conn.commit()
  31. # 关闭游标
  32. cursor.close()
  33. # 关闭连接
  34. conn.close()
'
运行

最后在main里面:

  1. data = getHtmlData(url)
  2. insertNews(data=data)

调用即可,最后数据就存进了数据库里:

当然你也可以做一个界面出来玩玩:


如果大家有需要,我会把demo也传上来抛砖引玉!



'
运行
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/煮酒与君饮/article/detail/939804
推荐阅读
相关标签
  

闽ICP备14008679号