赞
踩
0.需要的知识点
正则表达式
java多线程线程池池知识
httpclient网络库及json和html结构
1.获取主话题
在知乎中一共有33个主话题,在33个主话题下又有15776个子话题,因此我们首先要获取到33个主话题
(ps:一开始打算用HttpURLConnection进行网络请求,由于后面需要以post形式访问并且提交form因此后面的代码改用了httpclient进行网络请求)
图中的data-id就是主话题的id
public static void getTopicId(){
new Thread(new Runnable() {
@Override
public void run() {
Connection connection;
int id = 1;
try {
URL url = new URL("https://www.zhihu.com/topics");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.connect();
BufferedReader bfr = new BufferedReader(new InputStreamReader(conn.getInputStream(),"UTF-8"));
String line = null;
StringBuilder sb = new StringBuilder();
while ((line = (bfr.readLine())) != null){
sb.append(line);
}
String result = sb.toString();
String regex = "data-id=\"[0-9]{0,6}\"";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(result);
String regx = "href=\"#.*?\"";
Pattern p = Pattern.compile(regx)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。