赞
踩
准备数据
利用Jsoup爬取数据, Jsoup是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。详情可参考:Jsoup中文使用手册
以京东搜索的页面为例,检查网页源代码,可以发现,信息设置在如下div中
代码如下:
@Data
@AllArgsConstructor
@NoArgsConstructor
public class JdContent{
private String img;
private String price;
private String title;
}
复制代码
@Component
public class HtmlParseUtil{
//京东搜索关键词Java的API
private final String url = "https://search.jd.com/Search?keyword=";
public List parseJD(String keyword) throws IOException{
//解析网页,返回的Document就是JS页面对象
Document docunment = Jsoup.parse(new URL(url+keyword), 3000);
//获取需要的标签ID
Element element = docunment.getElementById("J_goodsList");
//获取所有的li元素
Elements elements = element.getElementsByTag("li");
List list = new ArrayList();
for (Element el : elements) {
String img = el.getElementsByTag("img").eq(0).attr("src");
String price = el.getElementsByClass("p-price").eq(0).text();
String title = el.getElementsByClass("p-name").eq(0).text();
list.add(new JdContent(img,price,title));
}
return list;
}
}
复制代码
但是需要注意,比如受限于网速,图片也有可能会获取不到,为了提高访问速度,对于图片一般使用懒加载,再次观察网页源代码,可以看到img标签含有source-data-lazy-img属性,可以通过它来访问
String img = el.getElementsByTag("img").eq(0).attr("source-data-lazy-img");
复制代码
业务编写
插入数据
像已经建立的jd索引中插入数据
@Autowired
private RestHighLevelClient restHighLevelClient;
@Autowired
private HtmlParseUtil htmlParseUtil;
public boolean parseContent(String keyword) throws IOException{
List jdContents = htmlParseUtil.parseJD(keyword);
BulkRequest bulkRequest = new BulkRequest();
bulkRequest.timeout("1m");
for (JdContent jdContent : jdContents) {
bulkRequest.add(new IndexRequest("jd").source(JSON.toJSONString(jdContent), XContentType.JSON));
}
BulkResponse bulkResponse= restHighLevelClient.bulk(bulkRequest, RequestOptions.DEFAULT);
return !bulkResponse.hasFailures();
}
复制代码
提供查询
public List> search(String keyword, int pageNo, int pageSize) throws IOException {
if (pageNo<1){
pageNo = 1;
}
//条件搜索
SearchRequest searchRequest = new SearchRequest("jd");
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
//分页
sourceBuilder.from(pageNo);
sourceBuilder.size(pageSize);
//匹配数据
MatchQueryBuilder matchQueryBuilder = QueryBuilders.matchQuery("title", keyword);
sourceBuilder.query(matchQueryBuilder);
sourceBuilder.timeout(new TimeValue(50, TimeUnit.SECONDS));
//执行搜索
searchRequest.source(sourceBuilder);
SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
//解析结果
ArrayList> list = new ArrayList>();
for (SearchHit hit : searchResponse.getHits().getHits()) {
list.add(hit.getSourceAsMap());
}
return list;
}
复制代码
前端交互
el:'#app',
data:{
keyword:'',//搜索的关键字
results:[]搜索的结果
},
methods: {
searchKey() {
var keyword = this.keyword;
// console.log(keyword);
axios.get('search/'+keyword+"/1/10").then(response=>{
// console.log(response);
this.results = response.data;
})
}
}
})
复制代码
其中,app是标签ID,keyword为输入栏绑定的model名称,searchKey触发事件
搜索
复制代码
遍历返回值即可
复制代码
关键字高亮
与普通查询大体逻辑相同,只需要设置自定义的高亮逻辑,并在ES的返回值中用高亮的内容替换原内容即可。
public List> searchHignLight(String keyword, int pageNo, int pageSize) throws IOException {
if (pageNo<1){
pageNo = 1;
}
//条件搜索
SearchRequest searchRequest = new SearchRequest("jd");
SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
//分页
sourceBuilder.from(pageNo);
sourceBuilder.size(pageSize);
//匹配数据
MatchQueryBuilder matchQueryBuilder = QueryBuilders.matchQuery("title", keyword);
sourceBuilder.query(matchQueryBuilder);
sourceBuilder.timeout(new TimeValue(50, TimeUnit.SECONDS));
//高亮
HighlightBuilder highlightBuilder = new HighlightBuilder();
highlightBuilder.field("title");
highlightBuilder.preTags("");
highlightBuilder.postTags("");
//设置关键字高亮一次
highlightBuilder.requireFieldMatch(false);
sourceBuilder.highlighter(highlightBuilder);
//执行搜索
searchRequest.source(sourceBuilder);
SearchResponse searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
//解析结果
ArrayList> list = new ArrayList>();
for (SearchHit hit : searchResponse.getHits().getHits()) {
//解析高亮
Map highlightFields = hit.getHighlightFields();
HighlightField title = highlightFields.get("title");
//将原来的字段替换为高亮的字段设置
Map sourceAsMap = hit.getSourceAsMap();
if (title!=null){
Text[] fragments = title.fragments();
String temValue = "";
for (Text fragment : fragments) {
temValue+=fragment;
}
sourceAsMap.put("title",temValue);//替换
}
list.add(sourceAsMap);
}
return list;
}
复制代码
前端页面解析返回的HTML即可
复制代码
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。