赞
踩
请多多留言指教
ElasticSearch7.6.2服务器配置节点不在重复介绍,可查看文章。
Ingest-Attachment是一个开箱即用的插件,可以实现对(PDF,DOC等)主流格式文件的文本抽取及自动导入。
安装(可以手动下载插件包放入到es plugin目录下):
cmd 进入到elasticsearch bin目录下,执行以下命令,等待安装插件
elasticsearch-plugin install ingest-attachment
卸载:
cmd 进入到elasticsearch bin目录下,执行以下命令
elasticsearch-plugin remove ingest-attachment
建立ElasticSearch文件存储,用于检索文件名称、文件内容
1、建立文本抽取管道pipeline(全局执行一次即可使用)
- PUT _ingest/pipeline/attachment
- {
- "description": "Extract attachment information",
- "processors": [
- {
- "attachment": {
- "field": "data",
- "indexed_chars": -1,
- "ignore_missing": true
- }
- },
- {
- "remove": {
- "field": "data"
- }
- }
- ]
- }
data:image/s3,"s3://crabby-images/deb9d/deb9d52e6c78f73fbfaadc6e519fd00d286664e1" alt=""
2、建立索引filedata
属性列:文件名称,文件扩展名,文件路径,读取的文件内容
- PUT /filedata
-
- {
- "mappings": {
- "properties": {
- "filename": {
- "type": "text",
- "analyzer": "ik_max_word"
- },
- "fileext": {
- "type": "keyword"
- },
- "filepath": {
- "type": "keyword"
- },
- "attachment.data": {
- "type": "text",
- "analyzer": "ik_max_word"
- }
- }
- }
- }
data:image/s3,"s3://crabby-images/deb9d/deb9d52e6c78f73fbfaadc6e519fd00d286664e1" alt=""
3、kibana tool批量创建数据
- PUT /filedata/_bulk?pipeline=attachment&pretty=true
- {"index":{}}
- {"filename":"小黑","fileext":"txt","filepath":"d:/tempfile", "data":"5LiJ5aSp5LiN5omT5LiK5oi/5o+t55OmIOS9oOivtOeahOWvueS4jeWvuQ=="}
- {"index":{}}
- {"filename":"小白","fileext":"txt","filepath":"d:/tempfile","data":"5Lit5Y2O5Lq65ZCN5YWx5ZKM5Zu9IOaIkeeahOelluWbvQ=="}
存储的数据如下:
4、通过IK分词插件查询
term根据IK分词查询,highlight高亮显示,此查询为根据文件名称查询
- GET /filedata/_search
- {
- "query": {
- "term": {
- "filename": {
- "value": "小"
- }
- }
- },
- "highlight": {
- "fragment_size": 40,
- "fields": {
- "filename": { }
- }
- }
- }
data:image/s3,"s3://crabby-images/deb9d/deb9d52e6c78f73fbfaadc6e519fd00d286664e1" alt=""
查询结果如下图:
5、ingest-attachment 通过管道pipeline提取文本数据,根据文本内容查询
match根据属性查询,highlight高亮显示
- GET /filedata/_search
- {
- "query": {
- "match": {
- "attachment.content": "共和国"
- }
- },
- "highlight": {
- "fragment_size": 40,
- "fields": {
- "attachment.content": { }
- }
- }
- }
查询结果如下
6、Elasticsearch bool过滤查询,match和term联合查询
- GET /filedata/_search
- {
- "query": {
- "bool": {
- "should": [
- {
- "term": {
- "filename": {
- "value": "黑"
- }
- }
- },
- {
- "match": {
- "attachment.content": "共和国"
- }
- }
- ]
- }
- },
- "highlight": {
- "fragment_size": 100,
- "fields": {
- "attachment.content": { }
- }
- }
- }
data:image/s3,"s3://crabby-images/deb9d/deb9d52e6c78f73fbfaadc6e519fd00d286664e1" alt=""
查询结果如下
到此ingest-attachment插件,安装,应用已完成(ingest-attachment应用体现在管道pipeline提取文本数据)。
注:在使用时,需要将其文本数据转成base64的编码,使用管道将其base64编码放入es 即可,ingest-attachment 会自动从你添加的base64的编码中提取文本放入 attament.content 中。
nodejs中@elastic/elasticsearch 读取word、pdf等文件内容存储到ES,并对其文本索引检索
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。