ElasticSearch(ES) 搜索入门笔记

ElasticSearch(ES) 搜索入门笔记


You know, for search (and analysis)

Elasticsearch is the distributed search and analytics engine at the heart of the Elastic Stack. Logstash and Beats facilitate collecting, aggregating, and enriching your data and storing it in Elasticsearch. Kibana enables you to interactively explore, visualize, and share insights into your data and manage and monitor the stack. Elasticsearch is where the indexing, search, and analysis magic happens.




  1. 在官方网站下载所需版本 https://www.elastic.co/cn/downloads/elasticsearch

  2. 将下载的文件解压到指定目录

    tar -xzf /Users/cc/Downloads/elasticsearch-8.11.3-darwin-aarch64.tar.gz -C /Applications
  3. 然后进入安装目录执行 ./bin/elasticsearch 以启动ES(注:较高的ES版本是以安全模式启动的; Windows上的启动命令为./bin/elasticsearch.bat

  4. 验证是否正常启动 curl -k -u elastic:password https://localhost:9200 (注:以前未以安全模式启动时不需要输入用户名和密码 curl 'http://localhost:9200/?pretty'


  1. 在官方网站下载所需版本 https://www.elastic.co/downloads/kibana

  2. 将下载的文件解压到指定目录

    tar -xzf /Users/cc/Downloads/kibana-8.11.3-darwin-aarch64.tar.gz -C /Applications
  3. 然后进入安装目录执行 ./bin/kibana 以启动kinaba( Windows上的启动命令为./bin/kibana.bat

  4. kibana启动成功后需要去配置ES,可以在终端打印出的链接 http://localhost:5601/?code=971215 去配置,将ES启动时生成的enrollment token粘贴确认即可。(注:生成enrollment token的有效期是30分钟,过期后可以通过bin/elasticsearch-create-enrollment-token -s kibana --url https://localhost:9200 重新生成(命令里的–url必须指定,不然会报错ERROR: Failed to determine the health of the cluster. , with exit code 69) )

  5. 成功配置ES后,用ES的用户名密码登录后就可以正常使用kibina了,在kibana的首页左侧菜单栏-Management-Dev Tools 就可以看到图形化调试界面Console。(在ES的官方文档中的示例里的Console就是这个工具,使用它相比于使用curl来开发调试更方便)




 PUT /my-index-000001
  "mappings": {
    "properties": {
      "age":    { "type": "integer" }, 
      "email":  { "type": "keyword"  },
      "name":   { "type": "text"  }    
  • text 是默认会被分词的字段类型,如果不指定分词器,ES会用标准分词器切分文本。

  • keyword 适用于保存不需要分词的原始文本,比如邮箱地址、id、标签、主机名等。

  • 数字类型有 long、integer、short、byte、double、float、half_float、scaled_float、unsigned_long。对整数类型(byte、short、integer、long)应选择满足业务场景范围的最小的整数类型。而对于浮点类型优先选择scaled_float会更高效,它有一个属性scaling_factor,用它转换后将数据存储为整型;当scaled_float无法满足要求时尽量选择满足业务场景的精度最小的类型。

  • date 日期类型,格式可以是格式化的日期字符串如"2024-01-01" or "2024/01/01 12:10:30"、毫秒时间戳等。默认情况下,索引中的日期为UTC时间格式,其比北京时间晚8h,所以在使用date类型时务必注意时区。

  • boolean 布尔类型,存储true和false,也支持"false",“”(空字符,表示False) , "true"字符串。

  • binary, 可以存储如Base64编码字符,默认不被索引和搜索。

  • geo_point,可以存储经纬度相关信息,可以用来实现诸如查找在指定地理区域内相关的文档、根据距离来聚合文档、根据距离排序、根据地理位置修改评分规则等需求。

  • object 对象类型,字段本身也可以是一个object。


     PUT my-index-000001
      "mappings": {
        "properties": {
          "region": {
            "type": "keyword"
          "manager": {
            "properties": {
              "age":  { "type": "integer" },
              "name": {
                "properties": {
                  "first": { "type": "text" },
                  "last":  { "type": "text" }
     PUT my-index-000001/_doc/1
      "region": "US",
      "manager": { 
        "age":     30,
        "name": { 
          "first": "John",
          "last":  "Smith"
      "region":             "US",
      "manager.age":        30,
      "manager.name.first": "John",
      "manager.name.last":  "Smith"
     PUT my-index-000001
      "mappings": {
        "properties": {
          "group": {
            "type": "keyword"
          "user": {
            "properties": {
                  "first": { "type": "text" },
                  "last":  { "type": "text" }
    PUT my-index-000001/_doc/1
      "group" : "fans",
      "user" : [ 
          "first" : "John",
          "last" :  "Smith"
          "first" : "Alice",
          "last" :  "White"
    因为object类型存储时会被ES 展平,所以数据存储的形式如下

      "group" :        "fans",
      "user.first" : [ "alice", "john" ],
      "user.last" :  [ "smith", "white" ]
    GET my-index-000001/_search
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }}
     PUT my-index-000001
      "mappings": {
        "properties": {
          "group": {
            "type": "keyword"
          "user": {
            "properties": {
                  "first": { "type": "text" },
                  "last":  { "type": "text" }
    PUT my-index-000001/_doc/1
      "group" : "fans",
      "user" : [
          "first" : "John",
          "last" :  "Smith"
          "first" : "Alice",
          "last" :  "White"
    GET my-index-000001/_search
      "query": {
        "nested": {
          "path": "user",
          "query": {
            "bool": {
              "must": [
                { "match": { "user.first": "Alice" }},
                { "match": { "user.last":  "Smith" }} 
mapping 参数


  • dynamic: 控制一个字段是可以被动态地加入,比如说写入的数据里是否可以包含未定义的字段。其取值默认是true,也就是支持动态新增。我们可以定义整个索引的dynamic,字段是继承整个索引的dynamic,字段也可以再指定与索引不一样的取值。(通常企业里的ES会要求将dynamic设置为strict)
trueNew fields are added to the mapping (default).
runtimeNew fields are added to the mapping as runtime fields. These fields are not indexed, and are loaded from _source at query time.
falseNew fields are ignored. These fields will not be indexed or searchable, but will still appear in the _source field of returned hits. These fields will not be added to the mapping, and new fields must be added explicitly.
strictIf new fields are detected, an exception is thrown and the document is rejected. New fields must be explicitly added to the mapping.
  • index:控制字段值是否会被索引,取值为true或false,默认是true。
  • store: ES默认不会存储字段的原始值,设置store为true可以存储原始值,并且在查询时,可以用stored_fields来获取字段的值。
  • enabled:设置为false时存储但是不会索引,默认为true。
  • copy_to: 允许将多个字段的值拷贝到一个组合字段中去,这个组合字段就能够像单个字段那样检索。

比如下面例子first_name 和last_name 可以被拷贝到full_name中用来查询

PUT my-index-000001
  "mappings": {
    "properties": {
      "first_name": {
        "type": "text",
        "copy_to": "full_name" 
      "last_name": {
        "type": "text",
        "copy_to": "full_name" 
      "full_name": {
        "type": "text"

PUT my-index-000001/_doc/1
  "first_name": "John",
  "last_name": "Smith"

GET my-index-000001/_search
  "query": {
    "match": {
      "full_name": { 
        "query": "John Smith",
        "operator": "and"
  • fields 有时候我们想对一个字段用不同的方式来索引实现不同的目的,这就是multi-fields的目标。比如对于一个字符串字段我们可以定义为text类型实现全文检索,但是也想以keyword的形式来进行精确匹配或者聚合。我们甚至可以定义multi-field时都是text类型,但是使用不同的analyzer。


PUT my-index-000001
  "mappings": {
    "properties": {
      "city": {
        "type": "text",
        "fields": {
          "raw": { 
            "type":  "keyword"
ES中定义了8种内置analyzer(分析器),如果不对text 字段指定分析器,默认使用的是standard Analyzer。

Standard Analyzer

The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation, lowercases terms, and supports removing stop words.

Simple Analyzer

The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms.

Whitespace Analyzer

The whitespace analyzer divides text into terms whenever it encounters any whitespace character. It does not lowercase terms.

Stop Analyzer

The stop analyzer is like the simple analyzer, but also supports removal of stop words.

Keyword Analyzer

The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term.

Pattern Analyzer

The pattern analyzer uses a regular expression to split the text into terms. It supports lower-casing and stop words.

Language Analyzers

Elasticsearch provides many language-specific analyzers like english or french.

Fingerprint Analyzer

The fingerprint analyzer is a specialist analyzer which creates a fingerprint which can be used for duplicate detection.

内置analyzer可以无需配置就直接使用,一些analyzer也可以通过配置来改变其行为,比如standard analyzer 可以配置以支持停用词

## 定义一个mapping,其支持了停用词
PUT my-index-000001
  "settings": {
    "analysis": {
      "analyzer": {
        "std_english": { 
          "type":      "standard",
          "stopwords": "_english_"
  "mappings": {
    "properties": {
      "my_text": {
        "type":     "text",
        "analyzer": "standard", 
        "fields": {
          "english": {
            "type":     "text",
            "analyzer": "std_english" 

# 测试标准分析器的效果
POST my-index-000001/_analyze
  "field": "my_text", 
  "text": "The old brown cow"

# 测试使用配置停用词后的标准分析器的效果
POST my-index-000001/_analyze
  "field": "my_text.english", 
  "text": "The old brown cow"
  • 索引时:当索引映射中存在text字段时,默认会使用标准分析器进行文本分析,如果不喜欢默认的分析器,也可以在mapping中指定某个text类型字段使用其他分析器。
  • 全文检索时:对一个索引的text类型字段做全文检索时也会触发文本分析,这时文本分析的对象是搜索的内容。默认的分析器也是标准分析器,如果需要改变分析器,可以通过搜索参数analyzer进行设置。为了保持搜索效果的一致性,索引时的分析器和全文检索时的分析器一般会设置成相同的,但中文一般会在索引时设置更细的粒度的分词器,在搜索使用粒度更粗的分词器。


PUT my-index-000001
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "whitespace",
        "search_analyzer": "simple"
Elasticsearch规定,一个完整的文本分析过程需要经过大于等于零个character filters(字符过滤器)、一个tokenizers(分词器)、大于等于零个token filters(分词过滤器)的处理过程。文本分析的顺序是先进行字符过滤器的处理,然后是分词器的处理,最后是分词过滤器的处理。

  • character filters:用于对原始文本做简单的字符过滤和转换,例如ES内置的HTML strip字符过滤器可以用于方便地剔除文本中的HTML标签。ES 中定义了html_strip、mapping、pattern_replace三种内置character filters。
  • tokenizers:分词器的功能就是把原始的文本按照一定的规则切分成一个个单词,比如内置的 whitespace 分词器根据空格符来切分单词,会将 "Quick brown fox!" 变成 [Quick, brown, fox!]。分词器还会保留每个关键词在原始文本中出现的位置数据。Elasticsearch内置的分词器有几十种,通常针对不同语言的文本需要使用不同的分词器,当然也可以安装一些第三方的分词器来扩展分词的功能,比如中文分词常用ik分词器。
  • token filters:对用分词器切词后的单词做进一步过滤和转换,例如,停用词分词过滤器(stop token filter)可以把分词器切分出来的冠词a、介词of等无实际意义的单词直接丢弃,避免它们影响搜索结果。ES中也有几十种内置token filter,在自定义我们的分析器时可以使用。


typeAnalyzer type. Accepts built-in analyzer types. For custom analyzers, use custom or omit this parameter.
tokenizerA built-in or customised tokenizer. (Required)
char_filterAn optional array of built-in or customised character filters.
filterAn optional array of built-in or customised token filters.
position_increment_gapWhen indexing an array of text values, Elasticsearch inserts a fake “gap” between the last term of one value and the first term of the next value to ensure that a phrase query doesn’t match two terms from different array elements. Defaults to 100. See position_increment_gap for more.


PUT my-index-000001
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": { 
          "char_filter": [
          "tokenizer": "punctuation",
          "filter": [
      "tokenizer": {
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
      "char_filter": {
        "emoticons": { 
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
      "filter": {
        "english_stop": { 
          "type": "stop",
          "stopwords": "_english_"

# 测试效果
POST my-index-000001/_analyze
  "analyzer": "my_custom_analyzer",
  "text": "I'm a :) person, and you?"
我们可以使用ES 提供的analyze API 来测试分析器的效果

POST _analyze
  "analyzer": "whitespace",
  "text":     "I'm studying ElasticSearch"
analyze API也可以测试tokenizer、token filter、character filter的组合效果

POST _analyze
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "I'm studying ElasticSearch"
对于我们在创建索引时自定义的分析器,也可以在指定索引上用analyze API来测试自定义分析器的效果。下面例子在创建mapping时定义了std_folded这个自定分析器,字段my_text使用自定义分析器,我们在指定索引名称后依然可以使用测试api:

## 创建索引,定义了std_folded这个自定分词器
PUT my-index-000001
  "settings": {
    "analysis": {
      "analyzer": {
        "std_folded": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
  "mappings": {
    "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "std_folded" 

## 在索引my-index-000001上测试自定义分词器std_folded的效果
GET my-index-000001/_analyze 
  "analyzer": "std_folded", 
  "text":     "Is this déjà vu?"

## 在索引my-index-000001上测试指定自定义分词器std_folded的字段my_text的效果
GET my-index-000001/_analyze 
  "field": "my_text", 
  "text":  "Is this déjà vu?"
安装方法:在ik_max github 主页下载与ES版本一致的ik_max压缩包,将下载的压缩包解压,将解压后的文件放入ES安装目录/plugins/ik 文件夹下,重新启动ES,就可以使用ik_max提供的分词器ik_max_word和 ik_smart 了。

  • ik_max_word 是细粒度分词 (一般用于索引)

  • ik_smart 粗粒度分词(一般用于搜索)

(可以用analyze API来测试ik_max_word 和 ik_smart的区别)


Normalizer 与 analyzer有点类似但只作用于单个token,所以它不包括tokenizer,只包括部分char filters 和token filters。

只有在单个字符维度处理的filter才能用于Normalizer,比如可以小写转换filter可以使用,但stemming filter不可以。Normalizer支持的filter有:arabic_normalization, asciifolding, bengali_normalization, cjk_width, decimal_digit, elision, german_normalization, hindi_normalization, indic_normalization, lowercase, pattern_replace, persian_normalization, scandinavian_folding, serbian_normalization, sorani_normalization, trim, uppercase.



UT index
  "settings": {
    "analysis": {
      "char_filter": {
        "quote": {
          "type": "mapping",
          "mappings": [
            "« => \"",
            "» => \""
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": ["quote"],
          "filter": ["lowercase", "asciifolding"]
  "mappings": {
    "properties": {
      "foo": {
        "type": "keyword",
        "normalizer": "my_normalizer"
  • 一旦创建好索引的mapping后,可以继续给mapping添加新的字段,但是旧的字段无法删除和修改

ES 搜索

Elasticsearch提供了领域特定语言(Domain Specific Language,DSL)查询语句,使用JSON字符串来定义每个查询请求。(ES查询语句有很多内容,这里只记录一下用过的查询语句,遇到具体场景再去看是否有其他适合的查询用法)

match all 查询


GET my-index-000001/_search
  "query": {
    "match_all": {
  • term 查询,直接返回包含搜索内容的文档,常用来查询索引中某个类型为keyword的文本字段,类似于SQL的“=”查询。
POST my-index-000001/_search
 "query": {
   "term": {
     "name.keyword": {
       "value": "张三"
    POST my-index-000001/_search
      "query": {
        "terms": {
          "name.keyword": {
            "value": ["张三", "李四"]
    POST my-index-000001/_search
      "query": {
        "ids" : {
          "values" : ["1", "4", "100"]
    GET /_search
      "query": {
        "exists": {
          "field": "user"
    GET /_search
      "query": {
        "prefix": {
          "user.id": {
            "value": "ki"
GET /_search
  "query": {
    "regexp": {
      "user.id": {
        "value": "k.*y",
        "flags": "ALL",
        "case_insensitive": true,
        "max_determinized_states": 10000,
        "rewrite": "constant_score_blended"

GET /_search
  "query": {
    "wildcard": {
      "user.id": {
        "value": "ki*y",
        "boost": 1.0,
        "rewrite": "constant_score_blended"
  • match 查询比较搜索词和每个文档的相似度,只要搜索词能命中文档的分词就会被搜索到,主要用于对指定text类型字段做全文搜索,是很常用的一个查询。
GET /_search
  "query": {
    "match": {
      "message": {
        "query": "this is a test"
match查询时可以指定一些参数,boost 参数是指相比于检索字段,权重的大小,其默认值为1

  "query": {
    "match": {
      "title": {
        "query": "quick brown fox",
        "boost": 2
operator 参数用来控制查询内容之间的逻辑关系,是否要全部检索(AND)到或者部分检索(OR)到就可以,默认是OR。

  • match_phrase 会对搜索文本进行文本分析,然后到索引中寻找搜索的每个分词并要求分词相邻,可以通过调整slop参数设置分词出现的最大间隔距离。match_phrase的分词结果必须在被检索字段的分词中都包含,而且**顺序必须相同,**而且默认必须都是连续的(slot=0)。
GET /_search
  "query": {
    "match_phrase": {
      "message": {
        "query": "this is a test",
bool query

The default query for combining multiple leaf or compound query clauses, as must, should, must_not, or filter clauses. The must and should clauses have their scores combined — the more matching clauses, the better — while the must_not and filter clauses are executed in filter context.

boosting query

Return documents which match a positive query, but reduce the score of documents which also match a negative query.

constant_score query

A query which wraps another query, but executes it in filter context. All matching documents are given the same “constant” _score.

dis_max query

A query which accepts multiple queries, and returns any documents which match any of the query clauses. While the bool query combines the scores from all matching queries, the dis_max query uses the score of the single best- matching query clause.

function_score query

Modify the scores returned by the main query with functions to take into account factors like popularity, recency, distance, or custom algorithms implemented with scripting.


mustThe clause (query) must appear in matching documents and will contribute to the score.
filterThe clause (query) must appear in matching documents. However unlike must the score of the query will be ignored. Filter clauses are executed in filter context, meaning that scoring is ignored and clauses are considered for caching.
shouldThe clause (query) should appear in the matching document.
must_notThe clause (query) must not appear in the matching documents. Clauses are executed in filter context meaning that scoring is ignored and clauses are considered for caching. Because scoring is ignored, a score of 0 for all documents is returned.

使用时可以用minimum_should_match 参数,它是一个文档被召回需要满足的最小匹配的should语句数量,取值有几种不同的写法。如果布尔查询存在must或filter子句,则该值默认为1;否则,该值默认为0。

POST _search
  "query": {
    "bool" : {
      "must" : {
        "term" : { "user.id" : "kimchy" }
      "filter": {
        "term" : { "tags" : "production" }
      "must_not" : {
        "range" : {
          "age" : { "gte" : 10, "lte" : 20 }
      "should" : [
        { "term" : { "tags" : "env1" } },
        { "term" : { "tags" : "deployed" } }
      "minimum_should_match" : 1,
      "boost" : 1.0
ES 搜索时的分数是如何计算的

ES的score 是如何计算的

script score 和 function score 获取自定义分数 script score function score


当我们想知道为什么一个文档在搜索结果中没有出现,或者为什么它出现了,可以使用explain api来显示原因

GET /my-index-000001/_explain/0
  "query" : {
    "match" : { "message" : "elasticsearch" }
python ES 客户端

ES提供了python客户端, 安装:pip install elasticsearch

import json
from elasticsearch import Elasticsearch

query_dsl = {
    "match": {
      "message": {
        "query": "this is a test"

# 建立连接
elastic_search = Elasticsearch(es_host, http_auth=(es_username, es_password), port=es_port)
query = elastic_search.search(index="my-index-000001",
# 搜索结果
res = query.get("hits", {}).get("hits", [])

