赞
踩
关系型数据库范式化(Normalize)设计的主要目标是减少不必要的更新,往往会带来一些副作用:
反范式化(Denormalize)的设计不使用关联关系,而是在文档中保存冗余的数据拷贝。
关系型数据库,一般会考虑Normalize 数据;在Elasticsearch,往往考虑Denormalize 数据。
Elasticsearch并不擅长处理关联关系,一般会采用以下四种方法处理关联:
对象类型:
DELETE blog
# 设置blog的 Mapping
PUT /blog
{
"mappings": {
"properties": {
"content": {
"type": "text"
},
"time": {
"type": "date"
},
"user": {
"properties": {
"city": {
"type": "text"
},
"userid": {
"type": "long"
},
"username": {
"type": "keyword"
}
}
}
}
}
}
# 插入一条 blog信息
PUT /blog/_doc/1
{
"content":"I like Elasticsearch",
"time":"2022-01-01T00:00:00",
"user":{
"userid":1,
"username":"Test",
"city":"Beijing"
}
}
# 查询 blog信息
POST /blog/_search
{
"query": {
"bool": {
"must": [
{"match": {"content": "Elasticsearch"}},
{"match": {"user.username": "Test"}}
]
}
}
}
DELETE /my_movies
# 电影的Mapping信息
PUT /my_movies
{
"mappings" : {
"properties" : {
"actors" : {
"properties" : {
"first_name" : {
"type" : "keyword"
},
"last_name" : {
"type" : "keyword"
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
# 写入一条电影信息
POST /my_movies/_doc/1
{
"title":"Speed",
"actors":[
{
"first_name":"Keanu",
"last_name":"Reeves"
},
{
"first_name":"Dennis",
"last_name":"Hopper"
}
]
}
# 查询电影信息
POST /my_movies/_search
{
"query": {
"bool": {
"must": [
{"match": {"actors.first_name": "Keanu"}},
{"match": {"actors.last_name": "Hopper"}}
]
}
}
}
response
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.723315,
"hits" : [
{
"_index" : "my_movies",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.723315,
"_source" : {
"title" : "Speed",
"actors" : [
{
"first_name" : "Keanu",
"last_name" : "Reeves"
},
{
"first_name" : "Dennis",
"last_name" : "Hopper"
}
]
}
}
]
}
}
"type": "nested"
存储时,内部对象的边界并没有考虑在内,JSON格式被处理成扁平式键值对的结构。当对多个字段进行查询时,导致了意外的搜索结果。可以用Nested Data Type解决这个问题。
"title":"Speed"
"actor".first_name: ["Keanu","Dennis"]
"actor".last_name: ["Reeves","Hopper"]
"type": "nested"
什么是Nested Data Type
Nested数据类型: 允许对象数组中的对象被独立索引
使用nested 和properties 关键字,将所有actors索引到多个分隔的文档
在内部, Nested文档会被保存在两个Lucene文档中,在查询时做Join处理
DELETE /my_movies
# 创建 Nested 对象 Mapping
PUT /my_movies
{
"mappings" : {
"properties" : {
"actors" : {
"type": "nested",
"properties" : {
"first_name" : {"type" : "keyword"},
"last_name" : {"type" : "keyword"}
}},
"title" : {
"type" : "text",
"fields" : {"keyword":{"type":"keyword","ignore_above":256}}
}
}
}
}
POST /my_movies/_doc/1
{
"title":"Speed",
"actors":[
{
"first_name":"Keanu",
"last_name":"Reeves"
},
{
"first_name":"Dennis",
"last_name":"Hopper"
}
]
}
nested query
# Nested 查询
POST /my_movies/_search
{
"query": {
"bool": {
"must": [
{"match": {"title": "Speed"}},
{
"nested": {
"path": "actors",
"query": {
"bool": {
"must": [
{"match": {
"actors.first_name": "Keanu"
}},
{"match": {
"actors.last_name": "Hopper"
}}
]
}
}
}
}
]
}
}
}
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
# Nested Aggregation
POST /my_movies/_search
{
"size": 0,
"aggs": {
"actors": {
"nested": {
"path": "actors"
},
"aggs": {
"actor_name": {
"terms": {
"field": "actors.first_name",
"size": 10
}
}
}
}
}
}
response
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"actors" : {
"doc_count" : 2,
"actor_name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Dennis",
"doc_count" : 1
},
{
"key" : "Keanu",
"doc_count" : 1
}
]
}
}
}
}
# 普通 aggregation不工作
POST /my_movies/_search
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"field": "actors.first_name",
"size": 10
}
}
}
}
对象和Nested对象的局限性: 每次更新,可能需要重新索引整个对象(包括根对象和嵌套对象)
ES提供了类似关系型数据库中Join 的实现。使用Join数据类型实现,可以通过维护Parent/ Child的关系,从而分离两个对象
父文档和子文档是两个独立的文档
更新父文档无需重新索引子文档。子文档被添加,更新或者删除也不会影响到父文档和其他的子文档
设定 Parent/Child Mapping
DELETE /my_blogs
# 设定 Parent/Child Mapping
PUT /my_blogs
{
"settings": {
"number_of_shards": 2
},
"mappings": {
"properties": {
"blog_comments_relation": {
"type": "join",
"relations": {
"blog": "comment"
}
},
"content": {
"type": "text"
},
"title": {
"type": "keyword"
}
}
}
}
#索引父文档
PUT /my_blogs/_doc/blog1
{
"title":"Learning Elasticsearch",
"content":"learning ELK ",
"blog_comments_relation":{
"name":"blog"
}
}
#索引父文档
PUT /my_blogs/_doc/blog2
{
"title":"Learning Hadoop",
"content":"learning Hadoop",
"blog_comments_relation":{
"name":"blog"
}
}
索引子文档
#索引子文档
PUT /my_blogs/_doc/comment1?routing=blog1
{
"comment":"I am learning ELK",
"username":"DaDa",
"blog_comments_relation":{
"name":"comment",
"parent":"blog1"
}
}
#索引子文档
PUT /my_blogs/_doc/comment2?routing=blog2
{
"comment":"I like Hadoop!!!!!",
"username":"MiaoMiao",
"blog_comments_relation":{
"name":"comment",
"parent":"blog2"
}
}
#索引子文档
PUT /my_blogs/_doc/comment3?routing=blog2
{
"comment":"Hello Hadoop",
"username":"XiaoXiao",
"blog_comments_relation":{
"name":"comment",
"parent":"blog2"
}
}
注意:
父文档和子文档必须存在相同的分片上,能够确保查询join的性能
当指定子文档时候,必须指定它的父文档ld。使用routing参数来保证,分配到相同的分片
# 查询所有文档
POST /my_blogs/_search
#根据父文档ID查看
GET /my_blogs/_doc/blog2
# Parent Id 查询
POST /my_blogs/_search
{
"query": {
"parent_id": {
"type": "comment",
"id": "blog2"
}
}
}
# Has Child 查询,返回父文档
POST /my_blogs/_search
{
"query": {
"has_child": {
"type": "comment",
"query" : {
"match": {
"username" : "MiaoMiao"
}
}
}
}
}
# Has Parent 查询,返回相关的子文档
POST /my_blogs/_search
{
"query": {
"has_parent": {
"parent_type": "blog",
"query" : {
"match": {
"title" : "Learning Hadoop"
}
}
}
}
}
#通过ID ,访问子文档
GET /my_blogs/_doc/comment3
#通过ID和routing ,访问子文档
GET /my_blogs/_doc/comment3?routing=blog2
#更新子文档
PUT /my_blogs/_doc/comment3?routing=blog2
{
"comment": "Hello Hadoop??",
"blog_comments_relation": {
"name": "comment",
"parent": "blog2"
}
}
优点 | 缺点 | 场景 | |
---|---|---|---|
Nested Object | 文档存储在一起,读取性能高 | 更新嵌套的子文档时,需要更新整个文档 | 需要额外的内存维护关系。读取性能相对差 |
Parent / Child | 父子文档可以独立更新 | 子文档偶尔更新,以查询为主 | 子文档更新频繁 |
应用场景: 修复与增强写入数据
案例
需求:Tags字段中,逗号分隔的文本应该是数组,而不是一个字符串。后期需要对Tags进行Aggregation统计
修复与增强写入数据
#Blog数据,包含3个字段,tags用逗号间隔
PUT tech_blogs/_doc/1
{
"title":"Introducing big data......",
"tags":"hadoop,elasticsearch,spark",
"content":"You konw, for big data"
}
Elasticsearch 5.0后,引入的一种新的节点类型。默认配置下,每个节点都是Ingest Node:
无需Logstash,就可以进行数据的预处理,例如:
Pipeline & Processor
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/ingest-processors.html
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "to split blog tags",
"processors": [
{
"split": {
"field": "tags",
"separator": ","
}
}
]
},
"docs": [
{
"_index": "index",
"_id": "id",
"_source": {
"title": "Introducing big data......",
"tags": "hadoop,elasticsearch,spark",
"content": "You konw, for big data"
}
},
{
"_index": "index",
"_id": "idxx",
"_source": {
"title": "Introducing cloud computering",
"tags": "openstack,k8s",
"content": "You konw, for cloud"
}
}
]
}
#同时为文档,增加一个字段。blog查看量
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "to split blog tags",
"processors": [
{
"split": {
"field": "tags",
"separator": ","
}
},
{
"set":{
"field": "views",
"value": 0
}
}
]
},
"docs": [
{
"_index":"index",
"_id":"id",
"_source":{
"title":"Introducing big data......",
"tags":"hadoop,elasticsearch,spark",
"content":"You konw, for big data"
}
},
{
"_index":"index",
"_id":"idxx",
"_source":{
"title":"Introducing cloud computering",
"tags":"openstack,k8s",
"content":"You konw, for cloud"
}
}
]
}
PUT _ingest/pipeline/blog_pipeline
{
"description": "a blog pipeline",
"processors": [
{
"split": {
"field": "tags",
"separator": ","
}
},
{
"set":{
"field": "views",
"value": 0
}
}
]
}
#查看Pipleline
GET _ingest/pipeline/blog_pipeline
#不使用pipeline更新数据
PUT tech_blogs/_doc/1
{
"title":"Introducing big data......",
"tags":"hadoop,elasticsearch,spark",
"content":"You konw, for big data"
}
#使用pipeline更新数据
PUT tech_blogs/_doc/2?pipeline=blog_pipeline
{
"title": "Introducing cloud computering",
"tags": "openstack,k8s",
"content": "You konw, for cloud"
}
#update_by_query 会导致错误
POST tech_blogs/_update_by_query?pipeline=blog_pipeline
{
}
#增加update_by_query的条件
POST tech_blogs/_update_by_query?pipeline=blog_pipeline
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "views"
}
}
}
}
}
GET tech_blogs/_search
response
{
"took" : 5,
"timed_out" : false,
"total" : 2,
"updated" : 1,
"deleted" : 0,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0,
"failures" : [
{
"index" : "tech_blogs",
"type" : "_doc",
"id" : "2",
"cause" : {
"type" : "illegal_argument_exception",
"reason" : "field [tags] of type [java.util.ArrayList] cannot be cast to [java.lang.String]"
},
"status" : 400
}
]
}
Logstash | Ingest Node | |
---|---|---|
数据输入与输出 | 支持从不同的数据源读取,并写入不同的数据源 | 支持从ES REST API获取数据,并且写入Elasticsearch |
数据缓冲 | 实现了简单的数据队列,支持重写 | 不支持缓冲 |
数据处理 | 支持大量的插件,也支持定制开发 | 内置的插件,可以开发Plugin进行扩展(Plugin更新需要重启) |
配置和使用 | 增加了一定的架构复杂度 | 无需额外部署默认支持 |
自Elasticsearch 5.x后引入,专门为Elasticsearch 设计,扩展了Java的语法。6.0开始,ES只支持 Painless。Groovy,JavaScript和 Python 都不再支持。Painless支持所有Java 的数据类型及Java API子集。
Painless Script具备以下特性:
Painless的用途:
通过Painless脚本访问字段
上下文 | 语法 |
---|---|
Ingestion | ctx.field_name |
Update | ctx._source.field_name |
Search & Aggregation | doc[“field_name”] |
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "to split blog tags",
"processors": [
{
"split": {
"field": "tags",
"separator": ","
}
},
{
"script": {
"source": """
if(ctx.containsKey("content")){
ctx.content_length = ctx.content.length();
}else{
ctx.content_length=0;
}
"""
}
},
{
"set":{
"field": "views",
"value": 0
}
}
]
},
"docs": [
{
"_index":"index",
"_id":"id",
"_source":{
"title":"Introducing big data......",
"tags":"hadoop,elasticsearch,spark",
"content":"You konw, for big data"
}
},
{
"_index":"index",
"_id":"idxx",
"_source":{
"title":"Introducing cloud computering",
"tags":"openstack,k8s",
"content":"You konw, for cloud"
}
}
]
}
response
{
"docs" : [
{
"doc" : {
"_index" : "index",
"_type" : "_doc",
"_id" : "id",
"_source" : {
"title" : "Introducing big data......",
"content" : "You konw, for big data",
"content_length" : 22,
"views" : 0,
"tags" : [
"hadoop",
"elasticsearch",
"spark"
]
},
"_ingest" : {
"timestamp" : "2022-07-28T02:35:41.221266994Z"
}
}
},
{
"doc" : {
"_index" : "index",
"_type" : "_doc",
"_id" : "idxx",
"_source" : {
"title" : "Introducing cloud computering",
"content" : "You konw, for cloud",
"content_length" : 19,
"views" : 0,
"tags" : [
"openstack",
"k8s"
]
},
"_ingest" : {
"timestamp" : "2022-07-28T02:35:41.221275922Z"
}
}
}
]
}
DELETE tech_blogs
PUT tech_blogs/_doc/1
{
"title":"Introducing big data......",
"tags":"hadoop,elasticsearch,spark",
"content":"You konw, for big data",
"views":0
}
POST tech_blogs/_update/1
{
"script": {
"source": "ctx._source.views += params.new_views",
"params": {
"new_views":100
}
}
}
# 查看views计数
POST tech_blogs/_search
#保存脚本在 Cluster State
POST _scripts/update_views
{
"script":{
"lang": "painless",
"source": "ctx._source.views += params.new_views"
}
}
POST tech_blogs/_update/1
{
"script": {
"id": "update_views",
"params": {
"new_views":1000
}
}
}
GET tech_blogs/_search
{
"script_fields": {
"rnd_views": {
"script": {
"lang": "painless",
"source": """
java.util.Random rnd = new Random();
doc['views'].value+rnd.nextInt(1000);
"""
}
}
},
"query": {
"match_all": {}
}
}
脚本编译的开销较大,Elasticsearch会将脚本编译后缓存在Cache 中
Inline scripts和 Stored Scripts都会被缓存
默认缓存100个脚本
参数 | 说明 |
---|---|
script.cache.max_size | 设置最大缓存数 |
script.cache.expire | 设置缓存超时 |
script.max_compilations_rate | 默认5分钟最多75次编译(75/5m) |
Object: 优先考虑反范式(Denormalization)
Nested: 当数据包含多数值对象,同时有查询需求
Child/Parent:关联文档更新非常频繁时
index.mapping.total_fields.limit
限定最大字段数生产环境中,尽量不要打开 Dynamic,可以使用Strict控制新增字段的加入
true :未知字段会被自动加入
false :新字段不会被索引,但是会保存在_source
strict :新增字段不会被索引,文档写入失败
对于多属性的字段,比如cookie,商品属性,可以考虑使用Nested
正则,通配符查询,前缀查询属于Term查询,但是性能不够好。特别是将通配符放在开头,会导致性能的灾难
# 将字符串转对象
PUT softwares/
{
"mappings": {
"properties": {
"version": {
"properties": {
"display_name": {
"type": "keyword"
},
"hot_fix": {
"type": "byte"
},
"marjor": {
"type": "byte"
},
"minor": {
"type": "byte"
}
}
}
}
}
}
#通过 Inner Object 写入多个文档
PUT softwares/_doc/1
{
"version":{
"display_name":"7.1.0",
"marjor":7,
"minor":1,
"hot_fix":0
}
}
PUT softwares/_doc/2
{
"version":{
"display_name":"7.2.0",
"marjor":7,
"minor":2,
"hot_fix":0
}
}
PUT softwares/_doc/3
{
"version":{
"display_name":"7.2.1",
"marjor":7,
"minor":2,
"hot_fix":1
}
}
# 通过 bool 查询,
POST softwares/_search
{
"query": {
"bool": {
"filter": [
{
"match":{
"version.marjor":7
}
},
{
"match":{
"version.minor":2
}
}
]
}
}
}
# Not Null 解决聚合的问题
DELETE /scores
PUT /scores
{
"mappings": {
"properties": {
"score": {
"type": "float",
"null_value": 0
}
}
}
}
PUT /scores/_doc/1
{
"score": 100
}
PUT /scores/_doc/2
{
"score": null
}
POST /scores/_search
{
"size": 0,
"aggs": {
"avg": {
"avg": {
"field": "score"
}
}
}
}
PUT /my_index
{
"mappings": {
"_meta": {
"index_version_mapping": "1.1"
}
}
}
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。