当前位置:   article > 正文

大数据-基于拉勾网职位数据的可视化实验-小白教程(四、获取数据)_java爬取拉勾网关于大数据职位的信息

java爬取拉勾网关于大数据职位的信息

配置eclipse

我们需要使用eclipse通过编写java代码,爬取拉勾网中职位的数据

准备hadoop-2.7.3.tar.gz,eclipse,hadoop-eclipse-plugin-2.7.3.jar,hadoop.dllwinutile.exe

 hadoop.dll,winutile.exe等等相关文件

链接:https://pan.baidu.com/s/1DnTw3lChFJy_fRfkKXInBg 
提取码:xzyp

hadoop-2.7.3.tar.gz:链接:https://pan.baidu.com/s/1I1FvgICCyeBURzGx62l4QA 
提取码:xzyp

  1. 把hadoop.dll和winutile.exe放到hadoop的bin文件夹里
  2. jar包倒在eclipse安装路径的plugins

 配置好后配置maven

下载解压好maven

配置环境变量自己在网上随便一搜就ok不在阐述

在目录下新建一个仓库repository

 

在conf的settings.xml中配置localRepository

Win+R输入cmd

mvn help:system

输入这条命令,如果成功后你的repository会多一些文件

 在eclipse中配置maven

打开eclipse点击window->preferences

 

 

 

 

 这样maven就配置好了

爬取数据

创建一个maven项目

File->new->other->maven->maven project

 

 这是创建完成后的文件目录

 

 在pom.xml中添加一下代码

  1. <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  2. <modelVersion>4.0.0</modelVersion>
  3. <groupId>com.itcast.jobcase</groupId>
  4. <artifactId>jobcase-l</artifactId>
  5. <version>0.0.1-SNAPSHOT</version>
  6. <dependencies>
  7. <dependency>
  8. <groupId>org.apache.httpcomponents</groupId>
  9. <artifactId>httpclient</artifactId>
  10. <version>4.5.4</version>
  11. </dependency>
  12. <dependency>
  13. <groupId>jdk.tools</groupId>
  14. <artifactId>jdk.tools</artifactId>
  15. <version>1.8</version>
  16. <scope>system</scope>
  17. <systemPath>C:/Program Files/Java/jdk1.8.0_341/lib/tools.jar</systemPath>
  18. </dependency>
  19. <dependency>
  20. <groupId>org.apache.hadoop</groupId>
  21. <artifactId>hadoop-common</artifactId>
  22. <version>2.7.3</version>
  23. </dependency>
  24. <dependency>
  25. <groupId>org.apache.hadoop</groupId>
  26. <artifactId>hadoop-client</artifactId>
  27. <version>2.7.3</version>
  28. </dependency>
  29. </dependencies>
  30. </project>

 完成后的项目目录:

在src/main/java下创建一个com.positon.l的包

创建四个类

第一个:HttpClientData.java

  1. package com.position.l;
  2. import java.util.HashMap;
  3. import java.util.Map;
  4. public class HttpClientData {
  5. public static void main(String[] args) throws Exception {//设置请求头
  6. Map<String, String> headers = new HashMap<String, String>();
  7. headers.put("Cookie","RECOMMEND_TIP=true; user_trace_token=20230509172245-850b8329-0db6-49d5-8ee5-788463473366; LGUID=20230509172245-ee291504-af55-4823-8b8f-da7830adea64; _ga=GA1.2.1941570256.1683624167; index_location_city=%E5%85%A8%E5%9B%BD; _gid=GA1.2.744431736.1684134362; privacyPolicyPopup=false; __lg_stoken__=00ef87c190275da025cc19a93d14d5da80c4c3ff29516c88d738dd7350f8601ae184994af7785dc2260517aa65b80ae0048d5bdb5ea64e76bf2b4df769b1de46bfa3cc6bd487; SEARCH_ID=c6e9d66fa6f64d48874952a58bf47660; gate_login_token=v1####da9e29af0db73d825a22a9a882bf9ddbb316eae052443e729b53cab3f19a8e70; LG_HAS_LOGIN=1; hasDeliver=0; __SAFETY_CLOSE_TIME__26120270=1; JSESSIONID=ABAAABAABEIABCI02041966E510E7120309F7B2F34013BF; WEBTJ-ID=20230515193746-1881f33c337109-00aed6e8f8a02b-7b515477-1327104-1881f33c33814dd; _putrc=743692222AE66441123F89F2B170EADC; login=true; unick=%E7%94%A8%E6%88%B77560; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1683624167,1684134362,1684150667; sensorsdata2015session=%7B%7D; X_HTTP_TOKEN=d5afe4428dfdf76486605148610ad9240e30d415e3; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1684150668; TG-TRACK-CODE=index_zhaopin; LGRID=20230515193751-230d34f5-3f31-4ddd-8bb0-23d58400b756; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2226120270%22%2C%22first_id%22%3A%22187ffd2094fb06-0d35cf63f3901a-7b515477-1327104-187ffd20950ca3%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E8%87%AA%E7%84%B6%E6%90%9C%E7%B4%A2%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fcn.bing.com%2F%22%2C%22%24os%22%3A%22Windows%22%2C%22%24browser%22%3A%22Chrome%22%2C%22%24browser_version%22%3A%22113.0.0.0%22%7D%2C%22%24device_id%22%3A%22187ffd2094fb06-0d35cf63f3901a-7b515477-1327104-187ffd20950ca3%22%7D");
  8. headers.put("Connection","keep-alive");
  9. headers.put("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7");
  10. headers.put("Accept-Language","zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6");
  11. headers.put("User-Agent","Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N)"
  12. +"ppleWebKit/537.36 (KHTML, like Gecko) "
  13. +"Chrome/113.0.0.0 Mobile Safari/537.36 Edg/113.0.1774.42");
  14. headers.put("Content-Type", "text/html; charset=utf-8");
  15. headers.put("Referer","https://www.lagou.com/jobs/list_%E5%A4%A7%E6%95%B0%E6%8D%AE/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput=");
  16. headers.put("Origin", "https://www.lagou.com");
  17. headers.put("X-Requested-With", "XMLHttpRequest");
  18. headers.put("X-Anit-Forge-Token", "None");
  19. headers.put("Cache-Control","no-cache");
  20. headers.put("X-Anit-Forge-Code","0");
  21. headers.put("Host","www.lagou.com");
  22. Map<String, String> params = new HashMap<String, String>();
  23. params.put("kd", "大数据");
  24. params.put("city", "全国");
  25. for (int i = 1; i < 31; i++) {
  26. params.put("pn", String.valueOf(i));
  27. }
  28. for (int i = 1; i < 31; i++){
  29. params.put("pn", String.valueOf(i));
  30. HttpClientResp result = HttpClientUtils.doPost("https://www.lagou.com/jobs/positionAjax.json?"+ "needAddtionalResult=false&first=true&px=default",headers,params);
  31. HttpClientHdfsUtils.createFileBySysTime("hdfs://192.168.25.128:9000","page"+i,result.toString());
  32. Thread.sleep(1 * 500);}}
  33. }

 如果数据爬不出来,那就是请求头有问题,你需要去拉勾网注册一个账号,然后获取自己的请求头即可爬取数据;

第二个:HttpClientHdfsUtils .java

  1. package com.position.l;
  2. import java.io.ByteArrayInputStream;
  3. import java.io.IOException;
  4. import java.net.URI;
  5. import java.text.SimpleDateFormat;
  6. import java.util.Calendar;
  7. import java.util.Date;
  8. import org.apache.hadoop.conf.Configuration;
  9. import org.apache.hadoop.fs.FSDataOutputStream;
  10. import org.apache.hadoop.fs.FileSystem;
  11. import org.apache.hadoop.fs.Path;
  12. import org.apache.hadoop.io.IOUtils;
  13. public class HttpClientHdfsUtils {
  14. public static void createFileBySysTime(String url,String fileName,String data) {
  15. //指定操作HDFS的用户
  16. System.setProperty("HADOOP_USER_NAME", "root");
  17. Path path = null;
  18. //读取系统时间
  19. Calendar calendar = Calendar.getInstance();
  20. Date time = calendar.getTime();
  21. //格式化系统时间为年月日的形式
  22. SimpleDateFormat format = new SimpleDateFormat("yyyyMMdd");
  23. //获取系统当前的时间并将其转化为String类型,filename即存储数据的文件夹名称
  24. String filePath = format.format(time);
  25. //构造Configuration对象,配置hadoop参数
  26. Configuration conf = new Configuration();
  27. //实例化URI引入uri
  28. URI uri = URI.create(url);
  29. //实例化FileSystem对象,处理文件和目录相关的事务
  30. FileSystem fileSystem;
  31. try {
  32. //获取文件系统对象
  33. fileSystem = FileSystem.get(uri,conf);
  34. //定义文件路径
  35. path = new Path("/success/"+filePath);
  36. //判断路径是否为空
  37. if (!fileSystem.exists(path)) {
  38. //创建目录
  39. fileSystem.mkdirs(path);
  40. }
  41. //在指定目录下创建文件
  42. FSDataOutputStream fsDataOutputStream = fileSystem.create(
  43. new Path(path.toString()+"/"+fileName));
  44. //向文件中写入数据
  45. IOUtils.copyBytes(new ByteArrayInputStream(data.getBytes()),
  46. fsDataOutputStream, conf, true);
  47. //关闭连接释放资源
  48. fileSystem.close();
  49. } catch (IOException e) {
  50. e.printStackTrace();}}}

 第三个:HttpClientResp.java

  1. package com.position.l;
  2. import java.io.Serializable;
  3. public class HttpClientResp implements Serializable {
  4. private static final long serialVersionUID = -2224539827395038194L;
  5. //响应状态码
  6. private int code;
  7. //响应数据
  8. private String content;
  9. //空参构造 快捷键是alt+shift+S
  10. public HttpClientResp() {
  11. }
  12. public HttpClientResp(int code) {
  13. super();
  14. this.code = code;
  15. }
  16. public HttpClientResp(String content) {
  17. this.content = content;
  18. }
  19. public HttpClientResp(int code, String content) {
  20. this.code = code;
  21. this.content = content;
  22. }
  23. //getter和setter方法
  24. public int getCode() {
  25. return code;
  26. }
  27. public void setCode(int code) {
  28. this.code = code;
  29. }
  30. public String getContent() {
  31. return content;
  32. }
  33. public void setContent(String content) {
  34. this.content = content;
  35. }
  36. //重写tostring方法
  37. @Override
  38. public String toString() {
  39. return "HttpClientResp [code=" + code + ", content=" + content + "]";
  40. }
  41. }

 第四个:HttpClientUtils .

  1. package com.position.l;
  2. import java.io.IOException;
  3. import java.io.UnsupportedEncodingException;
  4. import java.util.ArrayList;
  5. import java.util.List;
  6. import java.util.Map;
  7. import java.util.Map.Entry;
  8. import java.util.Set;
  9. import org.apache.commons.httpclient.HttpStatus;
  10. import org.apache.http.NameValuePair;
  11. import org.apache.http.client.config.RequestConfig;
  12. import org.apache.http.client.entity.UrlEncodedFormEntity;
  13. import org.apache.http.client.methods.CloseableHttpResponse;
  14. import org.apache.http.client.methods.HttpEntityEnclosingRequestBase;
  15. import org.apache.http.client.methods.HttpPost;
  16. import org.apache.http.client.methods.HttpRequestBase;
  17. import org.apache.http.impl.client.CloseableHttpClient;
  18. import org.apache.http.impl.client.HttpClients;
  19. import org.apache.http.message.BasicNameValuePair;
  20. import org.apache.http.util.EntityUtils;
  21. public class HttpClientUtils {
  22. // 编码格式,发送编码格式统一用UTF-8  
  23. private static final String ENCODING ="UTF-8";
  24. // 设置连接超时时间,单位毫秒  
  25. private static final int CONNECT_TIMEOUT =6000 ;
  26. // 请求获取数据的超时时间(即响应时间),单位毫秒 
  27. private static final int SOCKET_TIMEOUT = 6000 ;
  28. //用于封装HTTP请求头
  29. public static void packageHeader(Map<String, String> params,HttpRequestBase httpMethod){
  30. //封装请求头
  31. if(params!=null){
  32. //通过entryset()方法从params中返回所有键值对的集合,并保存在entryset中
  33. //通过foreach()方法每次取出一个键值对保存在一个entry中
  34. Set<Entry<String, String>> entrySet = params.entrySet();//alt+1选择第一个
  35. for (Entry<String, String> entry : entrySet) {//alt+/ 选择第一个
  36. //通过entry分别获键-值,将键-值参数设置到请求头HttpRequestBase对象中
  37. httpMethod.setHeader(entry.getKey(), entry.getValue());
  38. }
  39. }
  40. }
  41. //用于封装HTTP请求参数 ;ctrl+shift+o导入带http的、util的
  42. public static void packageParam(Map<String, String> params,HttpEntityEnclosingRequestBase httpMethod)
  43. throws UnsupportedEncodingException{
  44. if (params != null) {
  45. //NameValuePair是简单名称值对节点类型。多用于Java向url发送post请求,在发送post请求时用该list来存放参数
  46. List<NameValuePair> nvps = new ArrayList<NameValuePair>();
  47. //通过entrySet()方法从params中返回所有键值对的集合,并保存在entrySet中,通过foreach方法每次取出一个键值对保存在一个entry中。
  48. Set<Entry<String, String>> entrySet = params.entrySet();
  49. for(Entry<String,String> entry : entrySet){
  50. //分别提取entry中的keyvalue放入nvps数组中
  51. nvps.add(new BasicNameValuePair(entry.getKey(),
  52. entry.getValue()));
  53. }
  54. //设置到请求的http对象中,这里的ENCODING为之前创建的编码常量。
  55. httpMethod.setEntity(new UrlEncodedFormEntity(nvps,ENCODING));}}
  56. //用于获取HTTP响应内容
  57. public static HttpClientResp getHttpClientResult(CloseableHttpResponse httpResponse,
  58. CloseableHttpClient httpClient,HttpRequestBase httpMethod)
  59. throws Exception {
  60. //通过请求参数httpMethod执行http请求
  61. httpResponse =httpClient.execute(httpMethod);
  62. //获取http的响应结果
  63. if (httpResponse != null && httpResponse.getStatusLine() != null){
  64. String content = "";
  65. if (httpResponse.getEntity() != null){
  66. //将响应结果转化为String类型,并设置编码格式
  67. content= EntityUtils.toString(httpResponse.getEntity(),ENCODING);
  68. }
  69. //返回HttpClientResp实体类的对象,这两个参数分别代表实体类中的code属性和content属性,分别代表响应代码和响应内容
  70. return new HttpClientResp(httpResponse.getStatusLine().getStatusCode(),content);
  71. }
  72. //如果没有接收到响应内容则返回响应的错误信息
  73. return new HttpClientResp(HttpStatus.SC_INTERNAL_SERVER_ERROR);}
  74. //在工具类中定义doPost()方法通过HttpClient Post方式提交请求头和请求参数,从服务端返回状态码和json数据内容。
  75. public static HttpClientResp doPost(String url,Map<String,String>headers,Map<String,String> params)
  76. throws Exception {
  77. //创建HttpClient对象
  78. CloseableHttpClient httpClient= HttpClients.createDefault();
  79. //创建HttpPost对象
  80. HttpPost httpPost = new HttpPost(url);
  81. //setConnectTimeout:设置连接超时时间,单位毫秒
  82. //setConnectionRequestTimeout:设置从connet Manager(连接池)获取Connection超时时间,单位毫秒。
  83. //这个属性是新加的属性,因为目前版本是可以共享连接池的。
  84. //setSocketTimeout:请求获取数据的超时时间(即响应时间),单位是毫秒。
  85. //如果访问一个接口,多少时间内无法返回数据,就直接放弃此次调用。
  86. //封装请求配置项
  87. RequestConfig requestConfig = RequestConfig.custom()
  88. .setConnectTimeout(CONNECT_TIMEOUT)
  89. .setSocketTimeout(SOCKET_TIMEOUT).build();
  90. //设置post请求配置项
  91. httpPost.setConfig(requestConfig);
  92. //通过创建packageHeader()方法设置请求头
  93. packageHeader(headers, httpPost);
  94. //通过创建packageParam()方法设置请求参数
  95. packageParam(params, httpPost);
  96. //创建HttpResponse对象获取响应内容
  97. CloseableHttpResponse httpResponse = null;
  98. try {
  99. //执行请求并获得响应结果
  100. return getHttpClientResult(httpResponse, httpClient, httpPost);
  101. } finally {//释放资源
  102. release(httpResponse, httpClient);}
  103. }//alt+1
  104. //工具类中定义release ()方法用于用于释放httpclient(HTTP请求)对象资源和httpResponse(HTTP响应)对象资源
  105. private static void release(CloseableHttpResponse httpResponse, CloseableHttpClient httpClient)
  106. throws IOException {
  107. // TODO Auto-generated method stub
  108. if (httpResponse != null) {
  109. httpResponse.close();
  110. }
  111. if (httpClient != null) {
  112. httpClient.close();}}}

右键项目名称->run as->java application

运行后你会在你的Hadoop上看到数据如图

 需要注意的是,在爬取数据的过程中,你需要将自己的hadoop启动起来,否则程序会报错,这样我们就获取到了数据,下回我们将对这些数据进行处理。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/654845
推荐阅读
相关标签
  

闽ICP备14008679号