赞
踩
文章仅供学习使用!!
严禁做违法违纪的事情,责任自负
Selenium 是最广泛使用的开源 Web UI(用户界面)自动化测试套件之一。
与java集成,本质上是通过Java代码调用浏览器驱动 进行模拟人工的操作.
selenium支持不同的浏览器,本文以谷歌为例 !
selenium驱动有两种下载方式.任选其一即可
①首先需要确认浏览器版本: 在浏览器界面输入chrome://settings/
② 下面网址任选其一,选择对应的版本下载 ( 此处如未有完全一致版本,则选择最大版本 例如本文中是104.0.5112.102 可选的版本是104开头 最优选为104版本中最大版号)
http://chromedriver.storage.googleapis.com/index.html
http://npm.taobao.org/mirrors/chromedriver/
package com.mengkeng.selenium_demo.test; import org.openqa.selenium.By; import org.openqa.selenium.WebElement; import org.openqa.selenium.chrome.ChromeDriver; import org.openqa.selenium.chrome.ChromeOptions; import java.util.concurrent.TimeUnit; public class BaiduDemo { public static void main(String[] args) throws Exception { //D://chromedriver.exe 以实际存储路径为准 System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe"); ChromeOptions chromeOptions = new ChromeOptions(); ChromeDriver driver = new ChromeDriver(chromeOptions); try { // 窗口最大化 driver.manage().window().maximize(); driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS); Thread.sleep(1000); //进入百度首页 driver.get("https://www.baidu.com/"); //找到输入框 WebElement text = driver.findElement(By.id("kw")); //找到百度一下按钮 WebElement button = driver.findElement(By.id("su")); text.sendKeys("123"); button.click(); } finally { sleep(10000); driver.quit(); } } public static void sleep(int time) { try { Thread.sleep(1000); } catch (InterruptedException e) { e.printStackTrace(); } } }
通过几行代码实现了打开网页搜索 ‘123’ , 接下来看看常用的api , 理解即可 随用随查
// 注意修改实际驱动存储位置
System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe");
WebDriver driver = new ChromeDriver();
driver.get("https://www.baidu.com/");
注意: 页面出现相同属性的元素, 则需要使用xpath定位方式进行指定获取
driver.findElement(By.id("pnum"));
driver.findElement(By.name("name"));
driver.findElement(By.className("pgo"));
driver.findElement(By.linkText("link"));
driver.findElement(By.xpath("//div[@id='1']/div/div/h3/a[1]"))
方法 | 描述 |
---|---|
sendKey() | 模拟输入指定内容 |
clear() | 清楚输入内容 |
text() | 获取文本信息 |
getAttribute() | 获取指定属性 |
ok掌握这一部分就可以书写简单爬虫了 , 有兴趣的童鞋试着做一下如下案例:
需求:
登录qq邮箱,并打开收件箱页面
以下是实现代码
package com.mengkeng.selenium_demo.test; import org.openqa.selenium.By; import org.openqa.selenium.WebDriver; import org.openqa.selenium.WebElement; import org.openqa.selenium.chrome.ChromeDriver; import java.util.Objects; public class QQEmaIlLoginDemo { public static void main(String[] args) throws InterruptedException { //定义使用什么版本的驱动,注意替换你的路径 System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe"); ChromeDriver driver = new ChromeDriver(); driver.manage().window().maximize(); try { Thread.sleep(1000); driver.get("https://mail.qq.com/"); driver.switchTo().frame("login_frame"); WebElement username = driver.findElement(By.id("u")); WebElement password = driver.findElement(By.id("p")); username.sendKeys("xxxxxx@qq.com"); password.sendKeys("xxxxxx"); WebElement submit = driver.findElement(By.id("login_button")); submit.click(); Thread.sleep(1000); driver.switchTo().defaultContent(); WebElement element = validElement("//a[@id='folder_1']", driver); if (Objects.nonNull(element)){ WebElement folder_1 = driver.findElement(By.xpath("//a[@id='folder_1']")); folder_1.click(); }else{ System.out.println("打开收件箱失败"); } } finally { Thread.sleep(10000); driver.close(); driver.quit(); } } public static WebElement validElement(String str, WebDriver driver) { try { WebElement element = driver.findElement(By.xpath(str)); return element; } catch (Exception e) { System.out.println("这个元素不存在" + str); } return null; } }
上述只是简单案例 有鼠标,多页面跳转的怎么办呢 . 别急 这就来
注意 鼠标操作方法需要以perform()方法结尾 如未使用该方法结尾则操作不生效
方法 | 描述 |
---|---|
click() | 单击左键 |
context_click() | 单击右键 |
double_click() | 双击 |
drag_and_drop() | 拖动 |
move_to_element() | 鼠标悬停 |
perform() | 执行所有ActionChains中存储的动作 |
当点击页面元素 浏览器创建新窗口后需要切换到最新页面.
driver.switchTo().window(frontHandle) // 此处的frontHandle是页面对象 可以使用driver.getWindowHandle(); 获取后暂存
模拟滑动页面
driver.executeScript(“window.scrollTo(0,300)”);当页面元素无法点击的时候(反爬虫拦截)
driver.executeScript(“arguments[0].click();”, element);// 其中element为按钮或元素
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER); // 急速加载模式
chromeOptions.addArguments("--incognito"); // 隐私窗口模式
chromeOptions.addArguments("--blink-settings=imagesEnabled=false"); // 不加载图片
chromeOptions.addArguments("--headless"); // 无头模式
chromeOptions.addArguments("--no-sandbox"); // 禁用沙箱模式
chromeOptions.addArguments("--disable-gpu");// 禁用gpu加速
chromeOptions.addArguments("--proxy-server=" + proxy); // 添加代理
ChromeDriver driver = new ChromeDriver(chromeOptions);
// 设置全局等待时间 driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS); // 最大化页面 driver.manage().window().maximize(); // 去除sesenium标志 String js1="Object.defineProperties(navigator, {webdriver:{get:()=>undefined}});"; ((ChromeDriver) driver).executeScript(js1); // 添加UA请求头 String[] arr = {"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50", "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"}; chromeOptions.addArguments("User-Agent=" + arr[random.nextInt(7)]);
在解析列表页 创建浏览器对象执行解析
private void parsePagePre(SetOperations ops) { ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor(2, 8, 30L, TimeUnit.SECONDS, new LinkedBlockingQueue<>()); List<BuildAreaUrlLj> buildAreaUrlLjs = buildAreaUrlLjMapper.selectList(null); for (BuildAreaUrlLj buildAreaUrlLj : buildAreaUrlLjs1) { pagepoolExecutor.execute(() -> parsePage(ops, opsForHash, buildAreaUrlLj)); } } private void parsePage(SetOperations ops, HashOperations<String, Object, Object> opsForHash, BuildAreaUrlLj buildAreaUrlLj) { ChromeDriver driver = getChromeDriver(); driver.get(buildAreaUrlLj.getAreaUrl()); // 业务代码 } private ChromeDriver getChromeDriver() { String[] arr = {"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50", "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"}; ChromeOptions chromeOptions = new ChromeOptions(); chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER); chromeOptions.addArguments("--incognito"); chromeOptions.addArguments("--blink-settings=imagesEnabled=false"); chromeOptions.addArguments("--headless"); chromeOptions.addArguments("--no-sandbox"); chromeOptions.addArguments("--disable-gpu"); if ("用代理") { chromeOptions.addArguments("--proxy-server=" + nextProxy); } HashMap<String, Object> map = new HashMap<>(); map.put("webrtc.ip_handling_policy", "disable_non_proxied_udp"); map.put("webrtc.multiple_routes_enabled", false); map.put("webrtc.nonproxied_udp_enabled", false); chromeOptions.setExperimentalOption("prefs", map); Random random = new Random(); chromeOptions.addArguments("User-Agent=" + arr[random.nextInt(7)]); ChromeDriver driver = new ChromeDriver(chromeOptions); driver.manage().window().maximize(); return driver; }
package com.mengkeng.selenium_demo.test; import com.alibaba.fastjson.JSON; import com.mengkeng.selenium_demo.config.RestTemplateConfig; import com.mengkeng.selenium_demo.entity.TkBuildingsPriceAjk; import lombok.extern.slf4j.Slf4j; import org.openqa.selenium.By; import org.openqa.selenium.WebDriver; import org.openqa.selenium.WebElement; import org.openqa.selenium.chrome.ChromeDriver; import org.openqa.selenium.chrome.ChromeOptions; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.data.redis.core.RedisTemplate; import org.springframework.data.redis.core.SetOperations; import org.springframework.http.*; import org.springframework.util.CollectionUtils; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RestController; import org.springframework.web.client.RestTemplate; import java.math.BigDecimal; import java.util.*; import java.util.concurrent.TimeUnit; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * * Date: 2022-07-10 13:50 * Description: */ @RestController @RequestMapping("fang") @Slf4j public class FangtianxiaDemo { @Autowired private RedisTemplate redisTemplate; private static LinkedList<String> pages = new LinkedList<>(); /** * 基础页面 */ public static final String PRICE_URL = "https://pinggun.fang.com/RunChartNew/MakeChartData/"; /** * redis 记录页面 */ public static final String SKIP_URLS = "SKIP_URLS"; /** * 成功标识 */ public static String TEMP_FLAG = "fail"; @RequestMapping("sync") public String sync() { while (!TEMP_FLAG.equals("success")) { System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe"); ChromeOptions chromeOptions = new ChromeOptions(); chromeOptions.addArguments("--headless"); chromeOptions.addArguments("--no-sandbox"); chromeOptions.addArguments("--disable-gpu"); chromeOptions.addArguments("--disable-dev-shm-usage"); WebDriver driver = new ChromeDriver(chromeOptions); driver.manage().window().maximize(); driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS); driver.get("https://esf.fang.com/housing/"); sleep(2000); try { parseFTX(driver); } catch (Exception e) { try { Thread.sleep(10000); } catch (InterruptedException interruptedException) { interruptedException.printStackTrace(); } } finally { sleep(10000); driver.quit(); } } return "ok"; } /** * 解析fangtianxia */ private void parseFTX(WebDriver driver) { SetOperations ops = redisTemplate.opsForSet(); List<WebElement> elements = driver.findElements(By.xpath("//div[@class='qxName']/a")); // 区域 for (int i = 2; i <= elements.size() - 3; i++) { WebElement element = driver.findElement(By.xpath("//div[@class='qxName']/a[" + i + "]")); element.click(); sleep(800); //商圈 List<WebElement> elementsShangquan = driver.findElements(By.xpath("//p[@id='shangQuancontain']/a")); for (int sq = 2; sq <= elementsShangquan.size(); sq++) { WebElement elementsq = driver.findElement(By.xpath("//p[@id='shangQuancontain']/a[" + sq + "]")); String tempHref = elementsq.getAttribute("href"); // if (ops.isMember(SKIP_URLS, tempHref)) { // System.out.println("跳过了当前链接" + tempHref); // continue; // } elementsq.click(); parsePage(driver); ops.add(SKIP_URLS, tempHref); sleep(800); } } TEMP_FLAG = "success"; //正常跑一圈 结束 } /** * 解析分页 * * @param driver */ private void parsePage(WebDriver driver) { // 分页 try { driver.findElement(By.className("txt")).getText(); } catch (Exception e) { log.info("该分类下无数据 url是" + driver.getCurrentUrl()); return; } String pageTotal = driver.findElement(By.className("txt")).getText().replaceAll("共", "").replaceAll("页", ""); for (int page = 0; page < Integer.parseInt(pageTotal); page++) { List<WebElement> houseList = driver.findElements(By.xpath("//div[@class='houseList']/div")); for (int i = 1; i < houseList.size(); i++) { String communityName = driver.findElement(By.xpath("//div[@class='houseList']/div[" + i + "]/dl/dd/p[1]/a[1]")).getText(); String communityCode = driver.findElement(By.xpath("//div[@class='houseList']/div[" + i + "]/dl/dd/p[1]/a[2]")).getAttribute("projcode"); String areaName = driver.findElement(By.xpath("//div[@class='houseList']/div[" + i + "]/dl/dd/p[2]/a[1]")).getText(); // 跳转到详情页 pages.addAll(driver.getWindowHandles()); driver.findElement(By.xpath("//div[@class='houseList']/div[" + i + "]/dl/dd/p[1]/a[1]")).click(); sleepAndCutoverNewPage(800, driver); parseDetail(communityCode, communityName, areaName); driver.close(); driver.switchTo().window(pages.getLast()); sleep(1000); } if (page + 1 == Integer.parseInt(pageTotal)) { break; } String pageNow = driver.findElement(By.xpath("//div[@id='houselist_B14_01']/a[last()-1]")).getAttribute("href"); System.out.println("下一页是------------" + pageNow + "----" + pageTotal); driver.findElement(By.xpath("//div[@id='houselist_B14_01']/a[last()-1]")).click(); sleep(600); } } /** * 解析详情 * * @param communityCode * @param communityName * @param areaName */ public void parseDetail(String communityCode, String communityName, String areaName) { HashMap<String, Object> map = new HashMap<>(); map.put("newcode", communityCode); map.put("city", cnToUnicode("北京")); map.put("district", cnToUnicode(areaName)); HttpHeaders headers = new HttpHeaders(); headers.setContentType(MediaType.APPLICATION_JSON_UTF8); HttpEntity<String> entity = new HttpEntity<>(JSON.toJSONString(map), headers); RestTemplate restTemplate = null; try { restTemplate = new RestTemplate(RestTemplateConfig.generateHttpRequestFactory()); } catch (Exception e) { e.printStackTrace(); } ResponseEntity<String> stringResponseEntity = restTemplate.exchange(PRICE_URL, HttpMethod.POST, entity, String.class); Pattern compile = Pattern.compile(",(\\w+)]"); Matcher matcher = compile.matcher(stringResponseEntity.getBody()); Pattern compileMonth = Pattern.compile("年(\\w+)月"); Matcher matcherMonth = compileMonth.matcher(stringResponseEntity.getBody()); ArrayList<String> list = new ArrayList<>(); while (matcherMonth.find()) { list.add(matcherMonth.group(1)); } Pattern compileYear = Pattern.compile("&(\\w+)年"); Matcher matcherYear = compileYear.matcher(stringResponseEntity.getBody()); int year = 2020; while (matcherYear.find()) { year = Integer.parseInt(matcherYear.group(1)); } ArrayList months = null; if (!CollectionUtils.isEmpty(list)) { months = getMonths(year, Integer.parseInt(list.get(0)), Integer.parseInt(list.get(1))); } while (matcher.find()) { TkBuildingsPriceAjk ajk = new TkBuildingsPriceAjk(); ajk.setDataOrigin("fangtianxia"); ajk.setCommunityCode(communityCode); ajk.setCommunity(communityName); ajk.setAvgPrice(new BigDecimal(matcher.group(1))); System.out.println("持久化=======================================" + ajk); } } private static void sleep(int millis) { try { Thread.sleep(millis); } catch (InterruptedException e) { e.printStackTrace(); } } /** * 切换页面 * * @param millis * @param driver * @return */ private static String sleepAndCutoverNewPage(int millis, WebDriver driver) { try { Thread.sleep(millis); for (String handle : driver.getWindowHandles()) { if (!pages.contains(handle)) { driver.switchTo().window(handle); } } } catch (InterruptedException e) { e.printStackTrace(); } return null; } /** * 获取对象unionCode值 * * @param cn * @return */ private static String cnToUnicode(String cn) { char[] chars = cn.toCharArray(); StringBuilder returnStr = new StringBuilder(); for (int i = 0; i < chars.length; i++) { returnStr.append("\\u").append(Integer.toString(chars[i], 16)); } return returnStr.toString(); } /** * 获取年份列表-只支持今年至下一年 * * @param year 开始年份 * @param start 开始月份 * @param end 结束月份 * @return */ private static ArrayList getMonths(int year, int start, int end) { ArrayList res = new ArrayList(); for (int i = start; i <= (end == 12 ? 12 : end + 12); i++) { if (i > 12) { res.add((year + 1) + String.format("%02d", i - 12)); } else { res.add(year + String.format("%02d", i)); } } return res; } }
package com.mengkeng.selenium_demo.test; import com.alibaba.fastjson.JSON; import com.mengkeng.selenium_demo.entity.BuildAreaUrlLj; import com.mengkeng.selenium_demo.entity.IdAndNamePO; import com.mengkeng.selenium_demo.entity.TkBuildingsAreaInfolj; import com.mengkeng.selenium_demo.entity.TkBuildingsMonthPriceLj; import com.mengkeng.selenium_demo.mapper.BuildAreaUrlLjMapper; import com.mengkeng.selenium_demo.service.ProxyService; import lombok.extern.slf4j.Slf4j; import org.apache.commons.lang3.StringUtils; import org.apache.commons.lang3.time.DateFormatUtils; import org.openqa.selenium.By; import org.openqa.selenium.PageLoadStrategy; import org.openqa.selenium.WebDriver; import org.openqa.selenium.WebElement; import org.openqa.selenium.chrome.ChromeDriver; import org.openqa.selenium.chrome.ChromeOptions; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.data.redis.core.HashOperations; import org.springframework.data.redis.core.SetOperations; import org.springframework.data.redis.core.StringRedisTemplate; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RestController; import java.time.LocalDate; import java.time.LocalDateTime; import java.util.*; import java.util.concurrent.LinkedBlockingQueue; import java.util.concurrent.ThreadPoolExecutor; import java.util.concurrent.TimeUnit; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * * Date: 2022-09-05 13:58 * Description: 小区 */ @RestController @RequestMapping("areaInfo") @Slf4j public class LianjiaAreaInfoDemo { @Autowired private StringRedisTemplate redisTemplate; @Autowired private BuildAreaUrlLjMapper buildAreaUrlLjMapper; @Autowired private ProxyService proxyService; public static final String SKIP_URLS = "SKIP_URLS_AREAINFO_LIANJIA"; public static final String URLS = "URLS_AREAINFO_LIANJIA"; public static final String AREA_INFO_COMMUNITY_CODE_LJ = "AREA_INFO_COMMUNITY_CODE_LJ"; private static LinkedList<String> pages = new LinkedList<>(); ThreadPoolExecutor pagepoolExecutor = new ThreadPoolExecutor(2, 10, 30L, TimeUnit.SECONDS, new LinkedBlockingQueue<>()); @RequestMapping("sync") public void sync() throws InterruptedException { System.setProperty("webdriver.chrome.driver", "D://chromedriver.exe"); boolean flag = false; while (!flag) { try { ChromeDriver driver = getChromeDriver(); SetOperations ops = redisTemplate.opsForSet(); try { getUrls(driver, ops); parsePagePre(ops); } finally { sleep(1000); driver.quit(); } } catch (Exception e) { Thread.sleep(10000); continue; } flag = true; } System.out.println("完成"); } /** * 获取浏览器对象 * @return */ private ChromeDriver getChromeDriver() { String nextProxy = proxyService.getNextProxy(); System.out.println("当前ip是" + nextProxy); String[] arr = {"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 Edg/103.0.1264.37", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50", "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)"}; ChromeOptions chromeOptions = new ChromeOptions(); chromeOptions.setPageLoadStrategy(PageLoadStrategy.EAGER); chromeOptions.addArguments("--incognito"); chromeOptions.addArguments("--blink-settings=imagesEnabled=false"); chromeOptions.addArguments("--headless"); chromeOptions.addArguments("--no-sandbox"); chromeOptions.addArguments("--disable-gpu"); if (StringUtils.isNotBlank(nextProxy) && !nextProxy.equals("local")) { chromeOptions.addArguments("--proxy-server=" + nextProxy); } HashMap<String, Object> map = new HashMap<>(); map.put("webrtc.ip_handling_policy", "disable_non_proxied_udp"); map.put("webrtc.multiple_routes_enabled", false); map.put("webrtc.nonproxied_udp_enabled", false); chromeOptions.setExperimentalOption("prefs", map); Random random = new Random(); chromeOptions.addArguments("User-Agent=" + arr[random.nextInt(7)]); ChromeDriver driver = new ChromeDriver(chromeOptions); driver.manage().window().maximize(); return driver; } private void parsePagePre(SetOperations ops) { HashOperations<String, Object, Object> opsForHash = redisTemplate.opsForHash(); List<BuildAreaUrlLj> buildAreaUrlLjs = buildAreaUrlLjMapper.selectList(null); List<BuildAreaUrlLj> buildAreaUrlLjs1 = buildAreaUrlLjs.subList(1,3500); for (BuildAreaUrlLj buildAreaUrlLj : buildAreaUrlLjs1) { if (ops.isMember(SKIP_URLS, buildAreaUrlLj.getAreaUrl())) { System.out.println("跳过当前区域" + buildAreaUrlLj.getCityName() + "-" + buildAreaUrlLj.getCountyName()); continue; } pagepoolExecutor.execute(() -> parsePage(ops, opsForHash, buildAreaUrlLj)); } } /** * 解析列表 * @param ops * @param opsForHash * @param buildAreaUrlLj */ private void parsePage(SetOperations ops, HashOperations<String, Object, Object> opsForHash, BuildAreaUrlLj buildAreaUrlLj) { ChromeDriver driver = getChromeDriver(); try { driver.get(buildAreaUrlLj.getAreaUrl()); String windowHandlePage = driver.getWindowHandle(); WebElement totalNumStr = validElement("//h2[@class='total fl']/span", driver); if (null != totalNumStr) { Integer total = Integer.valueOf(totalNumStr.getText()); // 有数据 if (total > 1) { String pageData = driver.findElement(By.xpath("//div[@class='page-box house-lst-page-box']")).getAttribute("page-data"); Integer pageNumStr = Integer.valueOf(JSON.parseObject(pageData).getString("totalPage")); System.out.println("当前区域页数" + pageNumStr + "---" + buildAreaUrlLj.getAreaUrl()); for (int x = 1; x <= pageNumStr; x++) { List<WebElement> elements = driver.findElements(By.xpath("//ul[@class='listContent']/li/div[1]/div[1]/a")); for (int i = 0; i < elements.size(); i++) { WebElement item = elements.get(i); String code = ""; Pattern compile1 = Pattern.compile("xiaoqu/(\\w+)/"); Matcher matcher1 = compile1.matcher(item.getAttribute("href")); while (matcher1.find()) { code = matcher1.group(1); } driver.executeScript("arguments[0].click();", item); sleepAndCutoverNewPage(300, driver); // 如果有 则不解析详情 if (!opsForHash.hasKey(AREA_INFO_COMMUNITY_CODE_LJ, code)) { parseDetail(driver, code, buildAreaUrlLj, opsForHash); } else { System.out.println("当前code redis 存在" + code); //更新 // new TkBuildingsMonthPriceLj(); } driver.close(); driver.switchTo().window(windowHandlePage); sleep(200); elements = driver.findElements(By.xpath("//ul[@class='listContent']/li/div[1]/div[1]/a")); } if (x != pageNumStr) { String nextPage = buildAreaUrlLj.getAreaUrl() + "pg" + (x + 1) + "/"; driver.get(nextPage); System.out.println("下一页是" + nextPage); sleep(200); } } } } ops.add(SKIP_URLS, buildAreaUrlLj.getAreaUrl()); } catch (NumberFormatException e) { throw new RuntimeException("多线程发生异常"+e.getMessage()); }finally { driver.quit(); } } /** * 解析详情 * @param driver * @param communityCode * @param buildAreaUrlLj * @param opsForHash */ private void parseDetail(ChromeDriver driver, String communityCode, BuildAreaUrlLj buildAreaUrlLj, HashOperations<String, Object, Object> opsForHash) { LocalDateTime now1 = LocalDateTime.now(); if (null != validElement("//span[@class='xiaoquUnitPrice']", driver)) { TkBuildingsMonthPriceLj lj = new TkBuildingsMonthPriceLj(); lj.setCommunityCode(communityCode); String year = String.valueOf(LocalDate.now().getYear()); if (driver.findElement(By.className("xiaoquUnitPriceDesc")).getText().equals("挂牌均价")){ lj.setYearmonth(DateFormatUtils.format(new Date(),"yyyyMM")); }else{ String monthStr = driver.findElement(By.className("xiaoquUnitPriceDesc")).getText().replace("月参考均价", ""); String month = String.format("%02d", Integer.parseInt(monthStr)); lj.setYearmonth(year + month); } lj.setAvgPrice(Integer.valueOf(driver.findElement(By.className("xiaoquUnitPrice")).getText())); lj.setGenerateType("0"); lj.setCreateBy("1"); lj.setCreateDate(new Date()); lj.setUpdateBy("1"); lj.setUpdateDate(new Date()); lj.setDelFlag("0"); System.out.println("持久化价格"+lj); } LocalDateTime now2 = LocalDateTime.now(); TkBuildingsAreaInfolj infolj = new TkBuildingsAreaInfolj(); infolj.setDataOrigin("lianjia"); infolj.setGenerateType("0"); infolj.setProvince(buildAreaUrlLj.getProvinceId()); infolj.setCity(buildAreaUrlLj.getCityId()); infolj.setArea(buildAreaUrlLj.getCountyId()); infolj.setCommunity(validElement("//h1[@class='detailTitle']", driver) == null ? "" : driver.findElement(By.xpath("//h1[@class='detailTitle']")).getText()); infolj.setCommunityCode(communityCode); infolj.setBuildingYear(validElement("//span[text()='建筑年代']", driver) == null ? "" : driver.findElement(By.xpath("//span[text()='建筑年代']/parent::div/span[2]")).getText()); infolj.setBuildingType(validElement("//span[text()='建筑类型']", driver) == null ? "" : driver.findElement(By.xpath("//span[text()='建筑类型']/parent::div/span[2]")).getText()); infolj.setManageCost(validElement("//span[text()='物业费用']", driver) == null ? "" : driver.findElement(By.xpath("//span[text()='物业费用']/parent::div/span[2]")).getText()); infolj.setManageCompany(validElement("//span[text()='物业公司']", driver) == null ? "" : driver.findElement(By.xpath("//span[text()='物业公司']/parent::div/span[2]")).getText()); infolj.setManageDevlop(validElement("//span[text()='开发商']", driver) == null ? "" : driver.findElement(By.xpath("//span[text()='开发商']/parent::div/span[2]")).getText()); infolj.setBuildingCount(validElement("//span[text()='楼栋总数']", driver) == null ? "" : driver.findElement(By.xpath("//span[text()='楼栋总数']/parent::div/span[2]")).getText()); infolj.setRoomCount(validElement("//span[text()='房屋总数']", driver) == null ? "" : driver.findElement(By.xpath("//span[text()='房屋总数']/parent::div/span[2]")).getText()); infolj.setCreateBy("1"); infolj.setCreateDate(new Date()); infolj.setUpdateBy("1"); infolj.setUpdateDate(new Date()); infolj.setDelFlag("0"); System.out.println("持久化小区"+infolj); } /** * 爬取链接 * @param driver * @param ops */ private void getUrls(ChromeDriver driver, SetOperations ops) { driver.get("https://www.lianjia.com/city/"); int count = 0; List<WebElement> elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a")); for (int i = 0; i < elements.size(); i++) { WebElement element = elements.get(i); String provinceName = element.findElement(By.xpath("./parent::li/parent::ul/parent::div/div")).getText(); String areaName = element.getText(); Boolean memberFlag = ops.isMember(URLS, areaName); if (memberFlag) { System.out.println("已跑过当前区域 跳过" + areaName); continue; } driver.executeScript("arguments[0].click();", element); String frontPage = driver.getWindowHandle(); WebElement ershoufang = null; try { ershoufang = driver.findElement(By.linkText("小区")); } catch (Exception e) { ops.add(URLS, areaName); sleep(200); System.out.println(areaName + " 没有小区===="); driver.get("https://www.lianjia.com/city/"); elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a")); continue; } driver.executeScript("arguments[0].click();", ershoufang); sleepAndCutoverNewPage(500, driver); List<WebElement> citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a")); citys.forEach(e -> System.out.println("市级============" + e.getText() + "==" + e.getAttribute("href"))); for (int j = 0; j < citys.size(); j++) { String countyName = citys.get(j).getText(); driver.executeScript("arguments[0].click();", citys.get(j)); sleep(200); if (validElement("//h2[@class='total fl']/span", driver) != null) { String text = driver.findElement(By.xpath("//h2[@class='total fl']/span")).getText(); count += Integer.parseInt(text); System.out.println(countyName + text + "个"); System.out.println("当前总数是" + count); } List<WebElement> areas = null; try { areas = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[2]/a")); } catch (Exception e) { citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a")); saveDataCity(countyName, areaName, provinceName, citys); break; } if (areas.size() == 0) { citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a")); saveDataCity(countyName, areaName, provinceName, citys); break; } saveDataCounty(countyName, areaName, provinceName, areas); sleep(100); citys = driver.findElements(By.xpath("//div[@data-role='ershoufang']/div[1]/a")); } ops.add(URLS, areaName); driver.close(); driver.switchTo().window(frontPage); driver.get("https://www.lianjia.com/city/"); sleep(200); elements = driver.findElements(By.xpath("//ul[@class='city_list_ul']/li/div[2]/div/ul/li/a")); } System.out.println("总数是" + count); } private void saveDataCounty(String countyName, String areaName, String provinceName, List<WebElement> list) { for (WebElement element : list) { String url = element.getAttribute("href"); BuildAreaUrlLj buildAreaUrlLj = new BuildAreaUrlLj(); IdAndNamePO provincepo = queryProvinceCityArea(1, provinceName, null); buildAreaUrlLj.setProvinceName(provincepo.getBusinessName()); buildAreaUrlLj.setProvinceId(provincepo.getBusinessId()); IdAndNamePO areapo = queryProvinceCityArea(2, areaName, provincepo.getBusinessId()); buildAreaUrlLj.setCityName(areapo.getBusinessName()); buildAreaUrlLj.setCityId(areapo.getBusinessId()); IdAndNamePO countypo = queryProvinceCityArea(3, countyName, areapo.getBusinessId()); buildAreaUrlLj.setCountyName(countypo.getBusinessName()); buildAreaUrlLj.setCountyId(countypo.getBusinessId()); buildAreaUrlLj.setAreaUrl(url); buildAreaUrlLj.setCreateTime(new Date()); buildAreaUrlLj.setUpdateTime(new Date()); System.out.println("持久化链接"+buildAreaUrlLj); } } private void saveDataCity(String countyName, String areaName, String provinceName, List<WebElement> list) { for (WebElement element : list) { String url = element.getAttribute("href"); BuildAreaUrlLj buildAreaUrlLj = new BuildAreaUrlLj(); IdAndNamePO provincepo = queryProvinceCityArea(1, provinceName, null); buildAreaUrlLj.setProvinceName(provinceName); buildAreaUrlLj.setProvinceId(provincepo.getBusinessId()); buildAreaUrlLj.setCityName(areaName); IdAndNamePO areapo = queryProvinceCityArea(2, areaName, provincepo.getBusinessId()); buildAreaUrlLj.setCityId(areapo.getBusinessId()); IdAndNamePO countypo = queryProvinceCityArea(3, countyName, areapo.getBusinessId()); buildAreaUrlLj.setCountyName(countypo.getBusinessName()); buildAreaUrlLj.setCountyId(countypo.getBusinessId()); buildAreaUrlLj.setAreaUrl(url); buildAreaUrlLj.setCreateTime(new Date()); buildAreaUrlLj.setUpdateTime(new Date()); System.out.println("持久化链接"+buildAreaUrlLj); } } /** * 根据名称查询省市县信息 * @param type 1/省 2/市 3/区 * @param businessName 名称 * @param parentId 父id * @return */ private IdAndNamePO queryProvinceCityArea(Integer type, String businessName, String parentId) { if (StringUtils.isNotBlank(parentId)) { ArrayList<String> citys = new ArrayList<>(8); citys.add("50"); citys.add("11"); citys.add("31"); citys.add("12"); if (citys.contains(parentId)) { businessName = "市辖区"; } } IdAndNamePO po = null; try { if (type == 1) { // po = buildingsAvgMapper.queryProvinceIdByName(businessName); } else if (type == 2) { // po = buildingsAvgMapper.queryCityIdByName(businessName, parentId); } else if (type == 3) { // po = buildingsAvgMapper.querycountyIdByName(businessName, parentId); } } catch (Exception e) { e.printStackTrace(); } if (null == po) { po = new IdAndNamePO(); po.setBusinessId("-1"); po.setBusinessName(businessName); } return po; } private static String sleepAndCutoverNewPage(int millis, WebDriver driver) { try { Thread.sleep(millis); for (String handle : driver.getWindowHandles()) { if (!pages.contains(handle)) { driver.switchTo().window(handle); } } } catch (InterruptedException e) { } return null; } private static void sleep(int millis) { try { Thread.sleep(millis); } catch (InterruptedException e) { } } public static WebElement validElement(String str, WebDriver driver) { try { WebElement element = driver.findElement(By.xpath(str)); return element; } catch (Exception e) { System.out.println("这个元素不存在" + str); } return null; } }
1. driver.close 是关闭当前页 driver.quit是退出进程 循环跑列表的不退出进程的话浏览器会把内存吃满
2. 跳转页面尽量显示等待一下 以防元素未加载导致查找错误
3. 请求不可太频繁 特殊需求请加代理
上述案例源码
https://download.csdn.net/download/DoAsOnePleases/86772623
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。