当前位置:   article > 正文

scrapy遇见的坑

scrapy dict object has not attribute decode

0.windows安装scrapy

  1. 1、安装wheel:
  2. 在控制台输入pip install wheel即可自动完成安装
  3. 2、安装lxml:
  4. 到 https://www.lfd.uci.edu/~gohlke/pythonlibs/,往下拉找到 lxml,下载适合自己电脑
  5. 操作系统及python版本的.whl文件。cp27、cp35等代表python版本2.73.5,win32代表
  6. 32位windows操作系统,win_amd64代表64位操作系统。
  7. 下载完成后,右键点击文件-属性-安全-对象名称,可以复制到文件地址。复制完成后回到控制
  8. 台,输入“pip install 右键粘贴地址",然后按回车即可完成安装。
  9. 3、安装PyOpenssl
  10. 到 https://pypi.python.org/pypi/pyOpenSSL#downloads ,往下拉找到下面文件后下
  11. 载。下载完成后同安装lxml的方法,对该PyOpenssl的whl文件进行安装.
  12. windows系统下的scrapy框架安装及测试
  13. 4、安装Twisted
  14. 到 https://www.lfd.uci.edu/~gohlke/pythonlibs/#Twisted,往下拉找到Twisted,下载
  15. 适合自己电脑操作系统及python版本的.whl文件。同安装lxml的方法将Twisted安装完成。
  16. 5、安装Pywin32
  17. 到 https://sourceforge.net/projects/pywin32/files/pywin32/Build 220/,下载适合自
  18. 己电脑操作系统及python版本的文件,下载完成后双击即可开始安装。程序会自动定位
  19. python的目录,所以不用自己调整安装设置,一直下一步就行了。
  20. 6、安装scrapy
  21. 进行完1-5步后,安装scrapy就很简单了,在控制台输入 pip install scrapy即可完成安装。

 

1.No module named win32api

pip install pypiwin32

 

2.文件夹下的文件找不到,或者No module named 'scrapy.pipelines' 或者no module named ×××.items?

    

  1. scrapy项目处于pycharm项目的子项目,所以pycharm找不到items
  2. 。我的解决办法是在scrapy项目上右键-》make_directory as
  3. -->sources roo如果项目文件夹变这个颜色就可以了。

 

3.No module named PIL

pip install pillow

 

4.下载图片到本地、并提取本地保存地址

  1. 1)在settings.py中打开 ITEM_PIPELINES 的注释,在 ITEM_PIPELINES 中加入
  2. ITEM_PIPELINES = {
  3. 'spider_first.pipelines.SpiderFirstPipeline': 300,
  4. 'scrapy.pipelines.images.ImagesPipeline':5, #后面的数字代表执行优先级 ,当执行pipeine的时候会按照数字由小到大执行
  5. }
  6. 2)settings.py中加入
  7. IMAGES_URLS_FIELD ="image_url" #image_url是在items.py中配置的网络爬取得图片地址
  8. #配置保存本地的地址
  9. project_dir=os.path.abspath(os.path.dirname(__file__)) #获取当前爬虫项目的绝对路径
  10. IMAGES_STORE=os.path.join(project_dir,'images') #组装新的图片路径

 

5.python 安装mysql模块 

windows

pip install mysqlclient

ubuntu

sudo apt-get install libmysqlclent-dev

centos

sudo yum install python-devel mysql-devel

 

 

6. IndentationError: unindent does not match any outer indentation level

方法名开头的时候别用空格,用tab符号

 

7. 连接池的代码

  1. from twisted.enterprise import adbapi
  2. import MySQLdb.cursors
  3. class MysqlTwistedPipline(object):
  4. #通过连接池的方式
  5. def __init__(self,dbpool):
  6. self.dbpool = dbpool
  7. @classmethod
  8. def from_settings(cls, settings):
  9. dbparms = dict(
  10. host=settings["MYSQL_HOST"],
  11. db=settings["MYSQL_DBNAME"],
  12. user=settings["MYSQL_USER"],
  13. passwd=settings["MYSQL_PASSWORD"],
  14. charset=settings["MYSQL_CHARSET"],
  15. cursorclass=MySQLdb.cursors.DictCursor,
  16. use_unicode=settings["MYSQL_USE_UNICODE"],
  17. )
  18. dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)
  19. return cls(dbpool)
  20. def process_item(self, item, spider):
  21. # 使用twisted将mysql插入变成异步执行
  22. query = self.dbpool.runInteraction(self.do_insert, item)
  23. query.addErrback(self.handle_error, item, spider) # 处理异常
  24. def handle_error(self, failure, item, spider):
  25. # 处理异步插入的异常
  26. print(failure)
  27. def do_insert(self, cursor, item):
  28. insert_sql = """
  29. insert into jobbole(post_url_id,post_url,re_selector,img_url,img_path,zan,shoucang,pinglun,zhengwen,riqi,fenlei) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
  30. """
  31. cursor.execute(insert_sql, (
  32. item["post_url_id"], item["post_url"], item["re_selector"], item["img_url"][0], item["img_path"],
  33. item["zan"], item["shoucang"], item["pinglun"], item["zhengwen"], item["riqi"], item["fenlei"]))

 

8. 验证码爬虫的解决方案

 

  1. 1.如果一个网站有下载验证码的功能,对于我这种还没深入了解机器学习,只能通过人工的方式来进行解决验证码的问题,咱们就进行手动的验证码解析,通过下载图片,然后根据图片手动的输入坐标或者验证码来完成,本质的目的是为了能够完成登录打入敌人的内部,吸收他们里面的养分
  2. 2.如果是针对https://www.zhihu.com/captcha.gif?r=1514042860066&type=login&lang=cn
  3. 记住要跟登录的用户的获取的通过session来进行
  4. header这边直接github很多 我就不粘贴了
  5. session = requests.session()
  6. response = session.get("https://www.zhihu.com/",headers=header)
  7. 图片类别的下载我遇到的坑是
  8. with open(file_name, 'wb') as f:
  9. f.write(response.content)
  10. f.close()
  11. 记住如果网页打开直接是图片直接使用response.content 而不是response.text.ecode()
  12. 对于一些网站返回的是 unicode json格式看的时候很不爽
  13. 解决方案是:
  14. print(response.text.encode('latin-1').decode('unicode_escape'))

 

9. 出现主键冲突的时候解决ON DUPLICATE KEY UPDATE(只限mysql)

  1. insert into zhihu_question
  2. (zhihu_id,topics,url,title,content,creat_time,update_time,answer_num,comments_num,watch_user_num,click_num,crawl_time,crawl_update_time)
  3. VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
  4. ON DUPLICATE KEY UPDATE comments_num=VALUES(comments_num),watch_user_num=VALUES(watch_user_num),click_num=VALUES(click_num)

 

10. srcapy items方法很强大通过反向调用的方式就可以动态控制

items

 

class ZhihuQuestionItem(scrapy.Item):
    zhihu_id = scrapy.Field()
    topics = scrapy.Field()
    url = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
    creat_time = scrapy.Field()
    update_time = scrapy.Field()
    answer_num = scrapy.Field()
    comments_num = scrapy.Field()
    watch_user_num = scrapy.Field()
    click_num = scrapy.Field()

    def get_insert_sql(self):
        insert_sql = """
                      insert into zhihu_question
                      (zhihu_id,topics,url,title,content,creat_time,update_time,answer_num,comments_num,watch_user_num,click_num,crawl_time,crawl_update_time)
                       VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
                       ON DUPLICATE KEY UPDATE comments_num=VALUES(comments_num),watch_user_num=VALUES(watch_user_num),click_num=VALUES(click_num)
                  """
        zhihu_id = self["zhihu_id"][0]
        topics = ",".join(self["topics"])
        url = "".join(self["url"])
        title = "".join(self["title"])
        content = "".join(self["content"])
        creat_time =  datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)
        update_time =  datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)
        answer_num = self["answer_num"][0]
        comments_num = get_nums(self["comments_num"][0])
        watch_user_num =  self["watch_user_num"][0]
        click_num = self["watch_user_num"][1]
        crawl_time = datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)
        crawl_update_time =datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)
        params = (zhihu_id,topics,url,title,content,creat_time,update_time,answer_num,comments_num,watch_user_num,click_num,crawl_time,crawl_update_time)
        return insert_sql,params

 pipelines.py 

  1. class MysqlTwistedZhihuPipline(object):
  2. #通过连接池的方式
  3. def __init__(self,dbpool):
  4. self.dbpool = dbpool
  5. @classmethod
  6. def from_settings(cls, settings):
  7. dbparms = dict(
  8. host=settings["MYSQL_HOST"],
  9. db=settings["MYSQL_DBNAME"],
  10. user=settings["MYSQL_USER"],
  11. passwd=settings["MYSQL_PASSWORD"],
  12. charset=settings["MYSQL_CHARSET"],
  13. cursorclass=MySQLdb.cursors.DictCursor,
  14. use_unicode=settings["MYSQL_USE_UNICODE"],
  15. )
  16. dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)
  17. return cls(dbpool)
  18. def process_item(self, item, spider):
  19. # 使用twisted将mysql插入变成异步执行
  20. query = self.dbpool.runInteraction(self.do_insert, item)
  21. query.addErrback(self.handle_error, item, spider) # 处理异常
  22. def handle_error(self, failure, item, spider):
  23. # 处理异步插入的异常
  24. print(failure)
  25. def do_insert(self, cursor, item):
  26. insert_sql,params = item.get_insert_sql()
  27. cursor.execute(insert_sql, params)

 

 

9. 'dict' object has no attribute 'has_key'  Python3以后删除了has_key()方法

 

  1. if adict.has_key(key1):
  2. 修改为:
  3. if key1 in adict:

 

10.  爬虫Max retries exceeded with url

 

 页面中的requests不要直接都requests.post 尽量统一requests.session()然后运行完关闭。s.keep_alive = False

  1. s = requests.session()
  2. s.keep_alive = False

 

11.  阿里云centos安装python3最牛逼的教程(make编译的是深坑啊,建议用yum)

 

  1. sudo yum install epel-release
  2. sudo yum install python34
  3. wget --no-check-certificate https://bootstrap.pypa.io/get-pip.py
  4. python3 get-pip.py
  5. pip3 -V

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家自动化/article/detail/887410
推荐阅读
相关标签
  

闽ICP备14008679号