当前位置:   article > 正文

4步实现Java爬取酷狗音乐,so easy。_java webmagic爬取酷狗音乐

java webmagic爬取酷狗音乐

jar包:包括:jsoup、HttpClient、net.sf.json大家可以自行去下载

 

1、分析是否能获得TOP500歌单

首先,打开酷狗首页查看酷狗TOP500,

image.png

是真的只让看这些还是能找到其余的呢,于是我就看了下这TOP500的链接:

https://www.kugou.com/yy/rank/home/1-6666.html?from=rank

可以看的出home后边有个1,难道这是代表第一页的意思?于是我就把1改成2,进入,果然进入了第二页, 至此可以知道我们可以在网页里获取这500首的歌单。

2、分析找到真正的mp3下载地址(这个有点绕)

点一个歌曲进入播放页面,使用谷歌浏览器的控制台的Elements,搜一下mp3,很轻松就定位到了MP3的位置。

image.png

但是使用java访问的时候爬取的html里却没有该mp3的文件地址,那么这肯定是在该页面的位置使用了js来加载mp3,那么刷新下网页,看网页加载了哪些东西,加载的东西有点多,着重看一下js、php的请求,主要是看里面有没有mp3的地址,分析细节就不用说了。

最终我在列表的

 

https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery191027067069941080546_1546235744250&hash=667939C6E784265D541DEEE65AE4F2F8&album_id=0&_=1546235744251

 

 

这个请求里发现了mp3的完整地址:

 

 

"play_url": "http:\/\/fs.w.kugou.com\/201812311325\/dcf5b6449160903c6ee48035e11434bb\/G128\/M08\/02\/09\/IIcBAFrZqf2ANOadADn94ubOmaU995.mp3",

 

 

那这个js是怎么判断是哪首歌的呢,那么只可能是hash这个参数来决定歌曲的,然后到播放页面里找到这个hash的位置,是在下面的js里:

 

 

var dataFromSmarty = [{"hash":"667939C6E784265D541DEEE65AE4F2F8","timelength":"237051","audio_name":"\u767d\u5c0f\u767d - \u6700\u7f8e\u5a5a\u793c","author_name":"\u767d\u5c0f\u767d","song_name":"\u6700\u7f8e\u5a5a\u793c","album_id":0}],//当前页面歌曲信息           playType = "search_single";//当前播放   </script>

 

 

在去java爬取该网页,查看能否爬到这个hash,果然,爬取的html里有这段js,到现在mp3的地址也找到了,歌单也找到了,那么下一步就用程序实现就可以了。

3、java实现爬取酷狗mp3

先看一下爬取结果:

image.png

找到了资源,程序实现就好说了,其中使用到了自己写的几个工具类,自己整理点自己的工具类,然后就出来了。

 

代码阶段:

介绍:

SpiderKugou.java            #主程序启动类

HttpGetConnect.java      #httpclient接口工具类

HtmlManage.java            #jsonp解析标签扩展类

FileDownload.java          #文件下载

 

1.主程序启动类

  1. package com.wurao;
  2. import java.io.IOException;
  3. import java.util.regex.Matcher;
  4. import java.util.regex.Pattern;
  5. import net.sf.json.JSONObject;
  6. import org.jsoup.nodes.Document;
  7. import org.jsoup.nodes.Element;
  8. import org.jsoup.select.Elements;
  9. /**
  10.  * @说明:主启动类
  11.  * @author: 勿扰
  12.  * @CreateTime:2019年1月10日
  13.  * @ModifyTime:2019年1月10日
  14.  */
  15. public class SpiderKugou {
  16.     public static String filePath = "D:/music/";
  17.     public static String mp3 = "https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery191027067069941080546_1546235744250&"
  18.             + "hash=HASH&album_id=0&_=TIME";
  19.     public static String LINK = "https://www.kugou.com/yy/rank/home/PAGE-8888.html?from=rank";
  20.     //"https://www.kugou.com/yy/rank/home/PAGE-23784.html?from=rank";
  21.     public static void main(String[] args) throws IOException {
  22.         for(int i = 1 ; i < 23 ; i++){
  23.             String url = LINK.replace("PAGE", i + "");
  24.             getTitle(url);
  25.             //download("https://www.kugou.com/song/mfy6je5.html");
  26.         }
  27.     }
  28.     public static String getTitle(String url) throws IOException{
  29.         HttpGetConnect connect = new HttpGetConnect();
  30.         String content = connect.connect(url, "utf-8");
  31.         HtmlManage html = new HtmlManage();
  32.         Document doc = html.manage(content);
  33.         Element ele = doc.getElementsByClass("pc_temp_songlist").get(0);
  34.         Elements eles = ele.getElementsByTag("li");
  35.         for(int i = 0 ; i < eles.size() ; i++){
  36.             Element item = eles.get(i);
  37.             String title = item.attr("title").trim();
  38.             String link = item.getElementsByTag("a").first().attr("href");
  39.             download(link,title);
  40.         }
  41.         return null;
  42.     }
  43.     public static String download(String url,String name) throws IOException{
  44.         String hash = "";
  45.         HttpGetConnect connect = new HttpGetConnect();
  46.         String content = connect.connect(url, "utf-8");
  47.         HtmlManage html = new HtmlManage();
  48.         String regEx = "\"hash\":\"[0-9A-Z]+\"";
  49.         // 编译正则表达式
  50.         Pattern pattern = Pattern.compile(regEx);
  51.         Matcher matcher = pattern.matcher(content);
  52.         if (matcher.find()) {
  53.             hash = matcher.group();
  54.             hash = hash.replace("\"hash\":\"""");
  55.             hash = hash.replace("\"""");
  56.         }
  57.         String item = mp3.replace("HASH", hash);
  58.         item = item.replace("TIME", System.currentTimeMillis() + "");
  59.         System.out.println(item);
  60.         String mp = connect.connect(item, "utf-8");
  61.         mp = mp.substring(mp.indexOf("(") + 1, mp.length() - 3);
  62.         JSONObject json = JSONObject.fromObject(mp);
  63.         String playUrl = json.getJSONObject("data").getString("play_url");
  64.         System.out.print(playUrl + "  ==  ");
  65.         FileDownload down = new FileDownload();
  66.         down.download(playUrl, filePath + name + ".mp3");
  67.         System.out.println(name + ",下载完成");
  68.         return playUrl;
  69.     }
  70. }

 

 

2.httpclient接口工具类

  1. package com.wurao;
  2. import java.io.BufferedReader;
  3. import java.io.IOException;
  4. import java.io.InputStream;
  5. import java.io.InputStreamReader;
  6. import org.apache.commons.logging.Log;
  7. import org.apache.commons.logging.LogFactory;
  8. import org.apache.http.HttpEntity;
  9. import org.apache.http.client.config.RequestConfig;
  10. import org.apache.http.client.methods.CloseableHttpResponse;
  11. import org.apache.http.client.methods.HttpGet;
  12. import org.apache.http.impl.client.CloseableHttpClient;
  13. import org.apache.http.impl.client.HttpClients;
  14. import org.apache.http.impl.conn.BasicHttpClientConnectionManager;
  15. import org.apache.http.conn.ClientConnectionManager;
  16. import org.apache.http.conn.scheme.Scheme;
  17. import org.apache.http.conn.scheme.SchemeRegistry;
  18. import org.apache.http.conn.ssl.SSLSocketFactory;
  19. import org.apache.http.impl.client.BasicResponseHandler;
  20. import org.apache.http.client.ClientProtocolException;
  21. import org.apache.http.client.HttpClient;
  22. import org.apache.http.client.ResponseHandler;
  23. import org.apache.http.impl.client.DefaultHttpClient;
  24. import org.apache.http.params.HttpParams;
  25. import java.security.NoSuchAlgorithmException;
  26. import java.security.cert.CertificateException;
  27. import java.security.cert.X509Certificate;
  28. import javax.net.ssl.SSLContext;
  29. import javax.net.ssl.TrustManager;
  30. import javax.net.ssl.X509TrustManager;
  31. /**
  32. * @说明:httpclent接口
  33. * @author: 勿扰
  34. * @CreateTime:2019年1月10日
  35. * @ModifyTime:2019年1月10日
  36. */
  37. public class HttpGetConnect {
  38. /**
  39. * 获取html内容
  40. * @param url
  41. * @param charsetName UTF-8、GB2312
  42. * @return
  43. * @throws IOException
  44. */
  45. public static String connect(String url,String charsetName) throws IOException{
  46. BasicHttpClientConnectionManager connManager = new BasicHttpClientConnectionManager();
  47. CloseableHttpClient httpclient = HttpClients.custom()
  48. .setConnectionManager(connManager)
  49. .build();
  50. String content = "";
  51. try{
  52. HttpGet httpget = new HttpGet(url);
  53. RequestConfig requestConfig = RequestConfig.custom()
  54. .setSocketTimeout(5000)
  55. .setConnectTimeout(50000)
  56. .setConnectionRequestTimeout(50000)
  57. .build();
  58. httpget.setConfig(requestConfig);
  59. httpget.setHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
  60. httpget.setHeader("Accept-Encoding", "gzip,deflate,sdch");
  61. httpget.setHeader("Accept-Language", "zh-CN,zh;q=0.8");
  62. httpget.setHeader("Connection", "keep-alive");
  63. httpget.setHeader("Upgrade-Insecure-Requests", "1");
  64. httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36");
  65. //httpget.setHeader("Hosts", "www.oschina.net");
  66. httpget.setHeader("cache-control", "max-age=0");
  67. CloseableHttpResponse response = httpclient.execute(httpget);
  68. int status = response.getStatusLine().getStatusCode();
  69. if (status >= 200 && status < 300) {
  70. HttpEntity entity = response.getEntity();
  71. InputStream instream = entity.getContent();
  72. BufferedReader br = new BufferedReader(new InputStreamReader(instream,charsetName));
  73. StringBuffer sbf = new StringBuffer();
  74. String line = null;
  75. while ((line = br.readLine()) != null){
  76. sbf.append(line + "\n");
  77. }
  78. br.close();
  79. content = sbf.toString();
  80. } else {
  81. content = "";
  82. }
  83. }catch(Exception e){
  84. e.printStackTrace();
  85. }finally{
  86. httpclient.close();
  87. }
  88. //log.info("content is " + content);
  89. return content;
  90. }
  91. private static Log log = LogFactory.getLog(HttpGetConnect.class);
  92. }

 

3.jsonp解析标签扩展类

  1. package com.wurao;
  2. import java.io.IOException;
  3. import java.util.ArrayList;
  4. import java.util.List;
  5. import org.apache.commons.logging.Log;
  6. import org.apache.commons.logging.LogFactory;
  7. import org.jsoup.Jsoup;
  8. import org.jsoup.nodes.Document;
  9. import org.jsoup.nodes.Element;
  10. import org.jsoup.select.Elements;
  11. /**
  12. * @说明:jsonp解析标签扩展类
  13. * @author: 勿扰
  14. * @CreateTime:2019年1月10日
  15. * @ModifyTime:2019年1月10日
  16. */
  17. public class HtmlManage {
  18. public Document manage(String html){
  19. Document doc = Jsoup.parse(html);
  20. return doc;
  21. }
  22. /**
  23. * 管理链接
  24. * @param url
  25. * @return
  26. * @throws IOException
  27. */
  28. public Document manageDirect(String url) throws IOException{
  29. Document doc = Jsoup.connect( url ).get();
  30. return doc;
  31. }
  32. /**
  33. * 管理查找当前元素的标签
  34. * @param doc
  35. * @param tag
  36. * @return
  37. */
  38. public List<String> manageHtmlTag(Document doc,String tag ){
  39. List<String> list = new ArrayList<String>();
  40. Elements elements = doc.getElementsByTag(tag);
  41. for(int i = 0; i < elements.size() ; i++){
  42. String str = elements.get(i).html();
  43. list.add(str);
  44. }
  45. return list;
  46. }
  47. /**
  48. * 管理元素节点,或节点下面是否有class
  49. * @param doc
  50. * @param clas
  51. * @return
  52. */
  53. public List<String> manageHtmlClass(Document doc,String clas ){
  54. List<String> list = new ArrayList<String>();
  55. Elements elements = doc.getElementsByClass(clas);
  56. for(int i = 0; i < elements.size() ; i++){
  57. String str = elements.get(i).html();
  58. list.add(str);
  59. }
  60. return list;
  61. }
  62. /**
  63. * 管理属性和属性值, 获取所有元素
  64. * @param doc
  65. * @param key
  66. * @param value
  67. * @return
  68. */
  69. public List<String> manageHtmlKey(Document doc,String key,String value ){
  70. List<String> list = new ArrayList<String>();
  71. Elements elements = doc.getElementsByAttributeValue(key, value);
  72. for(int i = 0; i < elements.size() ; i++){
  73. String str = elements.get(i).html();
  74. list.add(str);
  75. }
  76. return list;
  77. }
  78. private static Log log = LogFactory.getLog(HtmlManage.class);
  79. }

 

4.文件下载

  1. package com.wurao;
  2. import java.io.BufferedInputStream;
  3. import java.io.BufferedOutputStream;
  4. import java.io.File;
  5. import java.io.FileOutputStream;
  6. import org.apache.commons.logging.Log;
  7. import org.apache.commons.logging.LogFactory;
  8. import org.apache.http.client.config.RequestConfig;
  9. import org.apache.http.client.methods.CloseableHttpResponse;
  10. import org.apache.http.client.methods.HttpGet;
  11. import org.apache.http.impl.client.CloseableHttpClient;
  12. import org.apache.http.impl.client.HttpClients;
  13. /**
  14. * @说明:文件下载
  15. * @author: 勿扰
  16. * @CreateTime:2019年1月10日
  17. * @ModifyTime:2019年1月10日
  18. */
  19. public class FileDownload {
  20. /**
  21. * 文件下载
  22. * @param url 链接地址
  23. * @param path 要保存的路径及文件名
  24. * @return
  25. */
  26. public static boolean download(String url,String path){
  27. boolean flag = false;
  28. CloseableHttpClient httpclient = HttpClients.createDefault();
  29. RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(2000)
  30. .setConnectTimeout(2000).build();
  31. HttpGet get = new HttpGet(url);
  32. get.setConfig(requestConfig);
  33. BufferedInputStream in = null;
  34. BufferedOutputStream out = null;
  35. try{
  36. for(int i=0;i<3;i++){
  37. CloseableHttpResponse result = httpclient.execute(get);
  38. System.out.println(result.getStatusLine());
  39. if(result.getStatusLine().getStatusCode() == 200){
  40. in = new BufferedInputStream(result.getEntity().getContent());
  41. File file = new File(path);
  42. out = new BufferedOutputStream(new FileOutputStream(file));
  43. byte[] buffer = new byte[1024];
  44. int len = -1;
  45. while((len = in.read(buffer,0,1024)) > -1){
  46. out.write(buffer,0,len);
  47. }
  48. flag = true;
  49. break;
  50. }else if(result.getStatusLine().getStatusCode() == 500){
  51. continue ;
  52. }
  53. }
  54. }catch(Exception e){
  55. e.printStackTrace();
  56. flag = false;
  57. }finally{
  58. get.releaseConnection();
  59. try{
  60. if(in != null){
  61. in.close();
  62. }
  63. if(out != null){
  64. out.close();
  65. }
  66. }catch(Exception e){
  67. e.printStackTrace();
  68. flag = false;
  69. }
  70. }
  71. return flag;
  72. }
  73. private static Log log = LogFactory.getLog(FileDownload.class);
  74. }

 

5.pom的jar包

  1. <dependencies>
  2. <!-- commons 包-->
  3. <dependency>
  4. <groupId>commons-lang</groupId>
  5. <artifactId>commons-lang</artifactId>
  6. <version>2.6</version>
  7. </dependency>
  8. <dependency>
  9. <groupId>commons-collections</groupId>
  10. <artifactId>commons-collections</artifactId>
  11. <version>3.2.1</version>
  12. </dependency>
  13. <dependency>
  14. <groupId>commons-io</groupId>
  15. <artifactId>commons-io</artifactId>
  16. <version>2.4</version>
  17. </dependency>
  18. <dependency>
  19. <groupId>commons-net</groupId>
  20. <artifactId>commons-net</artifactId>
  21. <version>3.3</version>
  22. </dependency>
  23. <dependency>
  24. <groupId>commons-beanutils</groupId>
  25. <artifactId>commons-beanutils</artifactId>
  26. <version>1.8.3</version>
  27. </dependency>
  28. <!-- json 处理 -->
  29. <dependency>
  30. <groupId>net.sf.json-lib</groupId>
  31. <artifactId>json-lib</artifactId>
  32. <version>2.4</version>
  33. <classifier>jdk15</classifier>
  34. </dependency>
  35. <dependency>
  36. <groupId>commons-io</groupId>
  37. <artifactId>commons-io</artifactId>
  38. <version>2.6</version>
  39. </dependency>
  40. <!--httpclient -->
  41. <dependency>
  42. <groupId>org.apache.httpcomponents</groupId>
  43. <artifactId>httpclient</artifactId>
  44. <version>4.5.3</version>
  45. </dependency>
  46. <!--jsoup -->
  47. <dependency>
  48. <groupId>org.jsoup</groupId>
  49. <artifactId>jsoup</artifactId>
  50. <version>1.8.3</version>
  51. </dependency>
  52. </dependencies>
  53. <build>
  54. <plugins>
  55. <!-- 资源文件拷贝插件 -->
  56. <plugin>
  57. <groupId>org.apache.maven.plugins</groupId>
  58. <artifactId>maven-resources-plugin</artifactId>
  59. <version>2.7</version>
  60. <configuration>
  61. <encoding>UTF-8</encoding>
  62. </configuration>
  63. </plugin>
  64. <!-- java编译插件 -->
  65. <plugin>
  66. <groupId>org.apache.maven.plugins</groupId>
  67. <artifactId>maven-compiler-plugin</artifactId>
  68. <version>3.2</version>
  69. <configuration>
  70. <source>1.8</source>
  71. <target>1.8 </target>
  72. <encoding>UTF-8</encoding>
  73. </configuration>
  74. </plugin>
  75. </plugins>
  76. </build>

 

主要代码基本成功,已经能运行并下载歌曲了。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/2023面试高手/article/detail/608988
推荐阅读
相关标签
  

闽ICP备14008679号