当前位置:   article > 正文

获取文档需要VIP?用python教你如何解决(内附源代码)_如何下载网站vip文件?

如何下载网站vip文件?

获取文档需要VIP?用python教你如何解决(内附源代码)

作为应届生找工作,简历是不可少的,目前大部分的简历模板都是需要VIP或者付费才能下载,我作为一个应届毕业生以及一个python学者,我有一个不成熟的小想法,那就是利用python爬取文库里的文档


  1. import requests
  2. import re
  3. import json
  4. import os
  5. session = requests.session()
  6. def fetch_url(url):
  7. return session.get(url).content.decode('gbk')
  8. def get_doc_id(url):
  9. return re.findall('view/(.*).html', url)[0]
  10. def parse_type(content):
  11. return re.findall(r"docType.*?\:.*?\'(.*?)\'\,", content)[0]
  12. def parse_title(content):
  13. return re.findall(r"title.*?\:.*?\'(.*?)\'\,", content)[0]
  14. def parse_doc(content):
  15. result = ''
  16. url_list = re.findall('(https.*?0.json.*?)\\\\x22}', content)
  17. url_list = [addr.replace("\\\\\\/", "/") for addr in url_list]
  18. for url in url_list[:-5]:
  19. content = fetch_url(url)
  20. y = 0
  21. txtlists = re.findall('"c":"(.*?)".*?"y":(.*?),', content)
  22. for item in txtlists:
  23. if not y == item[1]:
  24. y = item[1]
  25. n = '\n'
  26. else:
  27. n = ''
  28. result += n
  29. result += item[0].encode('utf-8').decode('unicode_escape', 'ignore')
  30. return result
  31. def parse_txt(doc_id):
  32. content_url = 'https://wenku.baidu.com/api/doc/getdocinfo?callback=cb&doc_id=' + doc_id
  33. content = fetch_url(content_url)
  34. md5 = re.findall('"md5sum":"(.*?)"', content)[0]
  35. pn = re.findall('"totalPageNum":"(.*?)"', content)[0]
  36. rsign = re.findall('"rsign":"(.*?)"', content)[0]
  37. content_url = 'https://wkretype.bdimg.com/retype/text/' + doc_id + '?rn=' + pn + '&type=txt' + md5 + '&rsign=' + rsign
  38. content = json.loads(fetch_url(content_url))
  39. result = ''
  40. for item in content:
  41. for i in item['parags']:
  42. result += i['c'].replace('\\r', '\r').replace('\\n', '\n')
  43. return result
  44. def parse_other(doc_id):
  45. content_url = "https://wenku.baidu.com/browse/getbcsurl?doc_id=" + doc_id + "&pn=1&rn=99999&type=ppt"
  46. content = fetch_url(content_url)
  47. url_list = re.findall('{"zoom":"(.*?)","page"', content)
  48. url_list = [item.replace("\\", '') for item in url_list]
  49. if not os.path.exists(doc_id):
  50. os.mkdir(doc_id)
  51. for index, url in enumerate(url_list):
  52. content = session.get(url).content
  53. path = os.path.join(doc_id, str(index) + '.jpg')
  54. with open(path, 'wb') as f:
  55. f.write(content)
  56. print("图片保存在" + doc_id + "文件夹")
  57. def save_file(filename, content):
  58. with open(filename, 'w', encoding='utf8') as f:
  59. f.write(content)
  60. print('已保存为:' + filename)
  61. def main():
  62. url = input('请输入要下载的文库URL地址_')
  63. content = fetch_url(url)
  64. doc_id = get_doc_id(url)
  65. type = parse_type(content)
  66. title = parse_title(content)
  67. if type == 'doc':
  68. result = parse_doc(content)
  69. save_file(title + '.txt', result)
  70. elif type == 'txt':
  71. result = parse_txt(doc_id)
  72. save_file(title + '.txt', result)
  73. else:
  74. parse_other(doc_id)
  75. if __name__ == "__main__":
  76. main()

结语

感谢阅读,欢迎评论与私聊

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Cpp五条/article/detail/156384
推荐阅读
相关标签
  

闽ICP备14008679号