当前位置:   article > 正文

网络编程 —— Http使用httpClient实现页面爬虫

网络编程 —— Http使用httpClient实现页面爬虫

先去找类型的a标签 取出图片所在网址 取出https://desk.3gbizhi.com/deskMV/438.html

搭建Form界面

Http类 

  1. public static HttpClient Client { get; }
  2. static Http()
  3. {
  4. HttpClientHandler handler = new HttpClientHandler();//处理消息对象
  5. //ServerCertificateCustomValidationCallback 是否开启免验证策略,有的网站不安全,
  6. //浏览器阻止你访问,需要把验证忽略掉
  7. handler.ServerCertificateCustomValidationCallback = (message, cart, chain, error) => { return true; };
  8. Client = new HttpClient(handler);//请求对象
  9. }

图片所在页面网址的正则

  1. Regex imgHtml = new Regex(@"<a href=""(https://[a-zA-Z0-9/\.]+\.html)"" class=""[a-zA-Z0-9]* imgw"" target=""_blank"">" );
  2. //< a href = "https://pic.3gbizhi.com/uploadmark/20231006/c54bae39ffc4a10b023fc5c7adfee803.jpg" class="arrows" target="_blank"><i class="fa fa-search-plus fa-fw"></i></a>
  3. Regex picReg = new Regex(@"<a href=""(https://pic\.3gbizhi\.com/uploadmark/\d+/[a-zA-Z0-9]+\.(jpg|png))"" class=""arrows"" target=""_blank"">");
按钮的点击事件
  1. string url = this.textBox1.Text;// 获取爬虫的url index_23.html
  2. int start = int.Parse(this.textBox3.Text); //开始页数 index_1.html
  3. int end = int.Parse(this.textBox4.Text); //结束页数 index_2.html
  4. Regex reg = new Regex(@"index_\d+\.html$");
  5. url = reg.Replace(url,""); //Replace =替换,把后面替换前面类型的字符串https://desk.3gbizhi.com/deskMV/
  1. for (int i = start; i <=end; i++)
  2. {
  3. string nowURL = $"{url}/index_{i}.html";
  4. HttpResponseMessage res = await Http.Client.GetAsync(nowURL);
  5. string data = await res.Content.ReadAsStringAsync();
  6. // 整体html字符串
  7. // 从data所有字符串匹配满足正则的字符串 返回结果是MatchCollection的数据集合
  8. MatchCollection maths = imgHtml.Matches(data);
  9. foreach (Match item in maths)
  10. {
  11. //下面需要根据html 匹配类型以下格式图片
  12. var res1 = await Http.Client.GetAsync(picURL);
  13. string data1 = await res1.Content.ReadAsStringAsync();
  14. string picURL1 = picReg.Match(data1).Groups[1].Value;
  15. Console.WriteLine(picURL1);
  16. downLoad(picURL1);
  17. }
  18. }
  1. public async void downLoad(string url)
  2. {
  3. var res = await Http.Client.GetAsync(url);
  4. byte[] b1 = await res.Content.ReadAsByteArrayAsync();
  5. //C:\Users\Administrator\Desktop
  6. File.WriteAllBytes(@"C:\Users\Administrator\Desktop\PP\"+Path.GetFileName(url), b1);
  7. }

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Gausst松鼠会/article/detail/680497
推荐阅读
相关标签
  

闽ICP备14008679号