赞
踩
上面一篇介绍了scrapy抓取的大概架构,此篇文章针对一些技术问题进行展开说明。
我们还是深圳房地产信息系统为例,
因为之前一直是写ASP.NET的,.NET很多控件都是通过拖拽实现。很多代码可以省去编写过程,都是自动生成的。这里的下一页操作就是通过自动生成的js代码,scrapy框架是不能执行JS代码。但我们清楚他执行了_doPostBack函数,我们再看_doPostBack是怎么定义的。
- function __doPostBack(eventTarget, eventArgument) {
- if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
- theForm.__EVENTTARGET.value = eventTarget;
- theForm.__EVENTARGUMENT.value = eventArgument;
- theForm.submit();
- }
- }
来看看form1是怎么定义的
- <form name="Form1" method="post" action="index.aspx?javascript%3a_doPostBack('AspNetPager1'%2c'2')" id="Form1">
- <input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="">
- <input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="">
- <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="dghA4AVaavTQEp48h3UsU0pxBkQIXyBt5k9xMnYc7lAPiW4jhB6DeTz+bLHqNiicT340W3E8oWRZP/u/oso9HuTTmRjM7qcpd2VbxKKgRY9CT0ZM3xlJEZLNtaTdBldZpfmozLsjBqdp/jzVFyHqrtybakbNR4CK2KXFmDJkIkynac0a5GIVwD0w2VYDmh40llArcntW86hbqVAUcLnh7aybgU5zdn0uuBbpxHJ/e8POBnkadJvDPV/zThfxPqsZs2wP9+NJL//WDUQQ6/exRnA3mdSsrOfeT1O6Tpl7Z0blgxzhBpOthPwgPdYBx8nLqeEoObijxvjFqQm8F8YaaElSJYJbsa8VRhd2dreBqDXItev6MlKHFy/TYA8CizrBIozDGzvVvZsBLNVxLvi8kVoYH7FvWF5Bf8WJ40ADdXwR6DSdbFFi8EdaQe2AVkZKo5pmNFSmKAJQgE6CnN3itTkSld5SHI3CXHlYluq6FONEnBdRnUH73OjURAWdcrAmK9gGicsfygadkCOG4QS5WVZ+9GEj0uFpzbk0G3nqtali/aEuZWmMz6Kz6bgMhlKzGGNzZs3RPHk5CtoZXlpnUCCVkqvxsg70xawRoqf/J//ETdGSwEtzUeDu7MS9k9fGkKJkIpgBqEdektdEu4ZOH1XaYjNk4/wX6KLASaBpd45JVaA4nShBtvinzME2vBtp4dJNlx9e0evmmmGaZ6g6rr84rgASw/QR3BR5BS5ND0QZ0lhhWLP521XcfTbtg8eWDDb4bFrvqkI5dYsBhtg4QjkQ+3DAyzWqO1nmpalAuzkLemqkSzWmBVrbx1pbhkLF/bR2dBnGdNWJPNKiKId2F+y8+QorPEkK7dykHUVw1wos5CQXYRsbXHJ7WMiP6Gm5Zz+llsbrfKz4pVJxxLjpKQbvq+pNseUV1prTxOOxPDhfl52c7ZNJNzsvjTpKu5Z7WiLUEVRK/71AAawaIMfzyaKKs+qCcn0l0b8F9FcHtHWgJgpNtmgB7Sv86b+cfF412X3puqMzzGdFYaQwDdT3M9AEOn7rTLbwAjTPA4mSmtfT7qyVnve4/MEZ15xjnvgUx36F6UnwSa0WKdmpNbdpTxquJE2/y8jLQtbJcHv0P0eCMl6NAJmZWI0s6wa8iowWRC2nuEWnTlK/RYpwwCfaiyh0iHKEKnCmWaeLMLsuXPO6k9w/V6xvtL/ndjE4q9YF0RV3oB5+2Af+4GCOWdELqgjH+/sf3uQNDlCOUmNlFDMZ4/7WXKGGwpbDK/KlSHyyRhY6RUB/ZvBSQ/xcYwUB5o2i8WUhcU6ZHMS21Tz5qkeHkp1nO0TMmpcyd/ehfZF7Mv8zI2A0u3kNZut7DRaARHiXXeZg25TeZrQIGe8TudSAV3k+vdg/ISEa8imP3tVQMV1OqHOGshgzpvrxY5mnSFJOeHxIeBFxCDIyuLKVbNQ19VPZLGqxgJSNQHdD0o6pETQor/IJhwVMhCXcV5JYmfx9vE+3SZy26GFxRck47ExVAgzSq4zFqkRoD8Q1jW5roRlwrxg4g46Wr1AbuLeSWmXTq2addliIVbI3W7oyGqQy9ox5ranMcupogFCQVgl8VshD7l9AqW7F8FvOVRHM9I4haQBm3sQz0GciNzPeoPEavJsMLJdn4zHTaZGhaMWCj6ADIXrfdpWFQ9AWzmAJ/FQdcrXDh2wohvydjPBB+uhv2cUY++wx/yp1ylfa9DA/Dlrbi3We9wJ74BiBB2uNe0rUjwzGGZ418u5yHdCrLme57Af75aXiQkhLjeXgRcE1N+kQ5aJhqv6+TFsG0cW/Rwu3cUW9cdOUjyXDmPk//GCizaGAoHlu2AzSlLpKtWC+GdUe6WtEmpkPlxsC6kecYSbx0cZBVBhdNNcG16b6mpg+WnIJROMJAfv+VIyRmMochjwssPL+0uXh/UqQlasRv8u4POVJWls/OKASqxfUVae2k79/6SYFb82buISfV5AZlw8YnVDCNQ7bPubKUnDyMzcN9fxjZbnTFOdVaQBAzjd2DB/OTAR+/tIDvWEFK+YpNaiyFl+HrJFfWGQ+9+ryPGSwQJKa4uVsMrsF2/jQa8i2c0ce+S7ya9YX6P4PRhEkyswvLua51VkFJlnkLerbN/C7yLuvtAy5gXAhSUsbSuSYXDP+RJZIiiefWMKPQdAjIIXwt2YiHNgoMHy+2L2TROXkTR2pxx+m7s4n71NiMOU1rVsEh2vLq7JocQKNjWPj2ZEFZyOQ7FnfHl98LBKrCGT6Y2DFz9xAZ/VGzeJrVVE2vHu+qLmf9l3exE1sGFI9NlmNjIWGhYJMbQWbEBfqgvjXtPQ8W6RSrGm1sBuPc+vGOsDRwNXC4sPSVRvG5yXLEExMNdvdSCQaX5hUPXewDDXZi0COpaEofhP4esF4UQL4p+P+16QV9J37cI0rNgAHNmG6124VxnGFFO6vvZeAXW5Zo9ymyDCvM7fHoM3sLsAwTnvzgv2mGrtKGlAd8v+QGWkP1BkH6i9Lu3mZ3bi0aq1e1dts3ASWzwGJPP8eED1wlinCNAcZ+q9WiGujKADo4dhPbTBBEc4mn/yCdQlZ5/v2Dw27a6C8qnb3dSmFSKLrso0h2zPXgJSWTsz2ZH3MjTi+wFk+GOU7EdmJwxNYSqOFapOJfKZjd8QNEVmR27PEY1bigK0UgLWmBNdZ8xhiBwzQIP/WCvqVsS2fPQj3OvEHfB44c9hGLggKt3u5ybcGmcp2NCSDQQao3Xu2e4q2jKpRw2BlluCJoX7sIDZ9fpRZbQ9Mb5Ik0FSuxve2317J4R9nHQuvKXcacom8/vciaCoBlDBPeOmdNxa6JW7AyWjX4rBB6kDj+Jrj2+oo3BubmNtf4yl3HovYev+FZ6GhBOzze0EMJBIufW8AZ6yUHwOi4VeGg2qCXC9YtVjZ7PiZjwk+tv+t3BxWvJ6XvmyOfVt3FX+ufHueKTxY/HnCburCuU66A6I4rpZp1DOOuak5XY7pS9BLaHFu1KoZLuLUnynNP8pPK9dXtMMlbMzR2p+Xn53C8Jfjq+rMKq9zn/386cAEwVlhdQ1fFBkhJ5BK0whGaGAscoq7NCWkJn+SmyZTES8HgIea+QQnDCPIYt+ie/SWdZx+BrOqGnCMdhivhRgdi+3fdli2mBUfaioyCeU4YUwGX76Rfdp+OqNr8jqryrGJVl5z08YMqtNbKPHTPVxFAqhgda/c6iBOiXZhkS7TA6d17NfVQ1Oc8db1oEEaq5Kb2y7Y20/tAMYo5jnwgcO7ceSpkdbMQY1O6OL9tMb3OKl0T9J1F7FaB2iRQN7qZW9KMWbT4Dd2h2gIlQgpNxAplxJgmOxysLMwS4nzWHeSH6nG4Bt2Z4S5nOrF+pVCpLaVBpnezDha6l6Sw3AmncG+Wl6GFwe86EDRyYkxuvDX0rN+K6J5J+L81XHvusq+NhoN7duYdlIFuhMoIkHSmUKWJxwDBCd9Npf0XNpkC0Etbsj3GNvE3GrsQqtE5HJfwfp8Vc6ndGWjZ27/X19Xy6XC4augblHJZPf3A9jFL1pt+Uw0XnmBvf2L/Z0amKeoD/dCRLp7QMfY2W15U5b63FzVJ9VtVl5tZcfUn0duiH/Mu9QGFKfLELXtnZp72rfeau0rjd7mQVpxkOLfuKMeYIekZJwz24CaA+5ariA3RTxj/ei900JfT168vn53wr3Wl3U5K5PJeDmMOF2fbAq4pUTgcUqNK9zwVGmFxefkF+NjtP+s7vBqqePLY7Ak6gzUbfptrzR3VUmZBCp3yNQkD1znBRk/JtI7RPhgNUbBZHDNBWGjNF3h/UDa86wWuiChVk7A9lm7Wh0X1zTnOA+98j9NJtPzPqHNaP8SvnGCYFZp7av5ORCCa/gP1UwmhZ642pMwD37qOkeldvVJ2Q6FcUmPWNIBdk/n3XexbbVXv+6YxHizWw8Lczw5WGPCw1nxnR6qa8eNdXG9+1FQqFkj4vg4rdfPdFEJzELjZivAJrp2LMS9CMm4ENdsv5iH2wHZWkQA7qRgcoPEtOoRpbI4OJFXX7qN+jGMGljosu8Ouc5hRo7iwWcqvuUtQ7QJjB9K4AcmcLM3JZUuehHWtSU+VFuzWhiApt5CNab/+QoyRKitAX/2nbQ0CVdkBULeTUtdlInjLN83LZphMU5XIlCZkAl8MJkQrWUKn+wQECjbg/QynXKrhYqqQ1PBdSMKNUcGjdtwJyhuXJKV1uWjJqvhkuQPYNN9WYykR+9vtUu8EsFUzuhNIn0bKiLtsBpmsby22Z8NJAXzOSnm+ofCrUzXY8cvZquK40ZV9IDN/DA/0n7a6uhxwxIDAAM9kJiRlWZTHTVMUln/xqWl2ZSO2rqycIwX6mtO/g19KRSiagfpo2QspLvhstpxDbeYWme6+bzSGNgGNQe5nc1HqD0zAqEOgOJq7AHP7Aut4HPlf/1VyhPJkyVH1XMmNBK/ys6qwEyBG3a6eicXwuA+ulH7sWoLC2TQ2RwbWIvQoAbKMrrqVKucDJAAccMKZKFHwmF1zbHQl6YOu6Js0cujCMn2v4PuOcoP5HfZTwP1sik1TVQ8ffhS+hO9XLruD4shuiPhNyEEGc2AiCoa/bm6HCupa+XKcq+MUV6J69G0abtGRmhsCaV0cTPjcPop73VpinFtHwKsxqiTAnFU44K5MbNOueH7Fg5iS6SaY/qBbFS1T8A0eg0eiKGZO/C7whMyUJufNnxeGcC6558Kn78Fdrl0xQxGwMUSrlxPKtdaEoWnSbbGmGjVE3MFJZwMtZ5t9xWA6TxmYpeGxgK2Kz9FO2VUqcu1EzOPKkqyXMqXEM02U8jz/QNbaR2w7f6O64ivT5PilQOi+fgUJUvMt0TAva94W3BWZEb6BPQxDcX3OvOR1KkNsChxOfLFVxrUlltx5sSPNhP/8eTBkNp+UJ0QPEhsjKyktgguazWGEk9nc5n7K24cXOIF+CVRW8O5UWwuyMl7P0S8z8MvtnHLtk0jfGFYpIHaOTW8v/4G5ohPH8JYETBAXTaaq6zIYCriq3/KRuRFZ1sJ0MRX43ryax7H51St07rQeDerdoiSqEnHnMxqKxMow8Bo37Cis/rtnyPyqsdaCIxKMnrEi0o/ak//I1sVmNovicpeAQguv9rs8Jqq/OC3hNa0ePqV87TjoICDBezoqjd0qGOkAZ8tCb63pklF+QlyEol25zWoRSwmm7ngLIV/DeMp6aVPj+5NCQdz0uyL50eT2BnEf0fx4kKKvqjMI736r3g82ap+VlCRpzyGwE9xD12JdAKI45PpKvG+mbLSbQZYsskfhWT9C0WU5u4NVhC2ETPqdGdzZGhY41J49O4MSIrA2k8XaJ4nb4jaOoT6KD1HFKS2p0N/HNvNNw0N+taMFXgtT7Gjaz20Yim5A0Ltqypt2dvHm/NEXrf3Or5N0HEvUgvroOHwq432ZLdssutaw/883bmGwla1fKmzI+w04jVt+fnw2XfEYBrZNPE5ckYah4Eo+qUdji6/wNaR5vFQNrfPDOBoME7TZEEYUdBsm83KAIvdHtpHp+eMJ2+Xx+j/oqOIrCdrMliQL5xjZ4nymztxBL/g3WFlv4jXiugNa8rItrSWh6x3thsdNY7XE1a4UPxvUsuih7CKi/ahD+6jvzKn8Wvm7uw2QOyupcstQBwwg2WeM+heJFEgkmVCzeWnc608mGfUjKPISs+f5ck1uwjh4ciNhu7rjG7RKDURAMHFrV99893NEM+V5eOItuvjgrzEWQiESlJiPQ4MhsWZGIZD72UWvNzricvkAcrf3SYjvVd+AKBFpEHyCR4r+VvuP6OlRZqbTHHG45J+ZB92BA6eO2LgY41uVe2exi7gWvjP0FlXoFa6EDozxNZrj+jrxJ9K6w4r8VWZ/i3HnDHjNykDG/v9E2uR5ydnM0QSxuFPqDOrDZhFdmPgLjj1xY7o6sLUxXU9zdY1W4in+2Xa2pNrHgzsFdoKfM6p2TNtCp2HdLhijUGkm5NYjepBHpvyRGSVYNffwX8F/F3XcsXk5y4KOvaGeSi2Dy2AJUKtgbiy5tFE9inzZcOh+OBFlYFAwA5EEIQU6gTSYZ8zaZ0Ybw+7s9EUooqam9Dp4rv13AhAsA5WODEPHt25eXhrl7/8bGUggMojGU8sApfi6L/lfbsqZAh1hI6QHgLfwHLRXgnDxfylHgCs3FZXYWY58iOscfxt0gK1mU9EHySaQFNRzw8vUQ+jT/wfFIrEafPsMREPc1/eEHbKAdZVG92G/MDelrrj">
里面有很多hidden的控件,里面的值很多而且都是自动产生的,具体我也不太清楚什么意思。
但是我们清楚,他的原理就是向本页面再次发送一次POST请求,进行分页。所以从URL地址上看没有变化。既然有POST操作,具体传了些什么参数。灵光一现,与其研究JS代码怎么封装,不如直接抓包看看传递什么参数。
这里使用经常用到的wireshark,简单好用,清晰明了。
当我们点击下一步以后,我们可以看见发送了一个HTTP请求,执行了POST方法。
进一步展开来看。所有的参数,一一罗列。传递参数也就是页面上隐藏的。
回头再来看我写的分页代码,进行递归查询。
- def parse(self, response):
- context = response.xpath('//tr[@bgcolor="#F5F9FC"]/td[3]')
- dbhelp=RishomePipeline()
- for item in context:
- title=item.xpath('a/text()').extract_first()
- idstr=item.xpath('a/@href').extract_first()
- idstr=idstr[idstr.find('=')+1:]
- if dbhelp.ispropertyexits(idstr):
- return
- request=scrapy.Request(url='http://ris.szpl.gov.cn/bol/projectdetail.aspx?id='+idstr, method='GET',callback=self.showdetailpage)
- yield request
- '''以下是分页代码,组合post_data结构体,
- POST请求要使用 yield scrapy.FormRequest(url=response.url,formdata =post_data,callback=self.parse,dont_filter=True)函数。
- '''
- next_page = response.xpath('//*[@id="AspNetPager1"]/div[2]/a[3]/@href')
- pnum=next_page.extract_first().split(',')[1].replace("'","").replace(")","")
- post_data = {
- "__EVENTTARGET" : "AspNetPager1",
- "__EVENTARGUMENT" :pnum,
- "__VIEWSTATEENCRYPTED" : "",
- "tep_name" : "",
- "organ_name" : "",
- "site_address" : "",
- "AspNetPager1_input" : "1"}
-
- a = response.xpath('//*[@id="__VIEWSTATE"]/@value')
- post_data['__VIEWSTATE']=a.extract_first()
- b=response.xpath('//*[@id="__VIEWSTATEGENERATOR"]/@value')
- post_data['__VIEWSTATEGENERATOR']=b.extract_first()
- c = response.xpath('//*[@id="__EVENTVALIDATION"]/@value')
- post_data['__EVENTVALIDATION'] = c.extract_first()
-
- '''分页到最后一页,‘下一页’的按钮就不是链接了,页面没有href参数了,此时判断分页结束,即 递归结束'''
- if pnum is not None and pnum!="":
- yield scrapy.FormRequest(url=response.url,formdata =post_data,callback=self.parse,dont_filter=True)
有些时候需要将两个页面的内容合并到一个item里面,这时候就需要在yield scrapy.Request的同时,传递一些参数到一下页面中。这时候可以这样操作。
- request=scrapy.Request(houseurl,method='GET',callback=self.showhousedetail)
- request.meta['biid']=biid
- yield request
-
-
- def showhousedetail(self,response):
- house=HouseItem()
- house['bulidingid']=response.meta['biid']
各个页面都会封装items并将item传递给pipelines来处理,而pipelines接收的入口只有一个就是
def process_item(self, item, spider)函数
用来区分item的办法。
- def process_item(self, item, spider):
- if str(type(item))=="<class 'rishome.items.RishomeItem'>":
- self.saverishome(item)
- if str(type(item))=="<class 'rishome.items.BulidingItem'>":
- self.savebuliding(item)
- if str(type(item))=="<class 'rishome.items.HouseItem'>":
- self.savehouse(item)
- return item # 必须实现返回
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。