Elasticsearch基础概念与架构 - 在线学习与练习平台

学习目标

通过本章学习，您将掌握： - Elasticsearch的核心概念和术语 - Elasticsearch的架构设计原理 - 分布式搜索引擎的工作机制 - Elasticsearch与传统数据库的区别

1. Elasticsearch简介

1.1 什么是Elasticsearch

Elasticsearch是一个基于Apache Lucene构建的分布式、RESTful搜索和分析引擎。它能够解决不断涌现出的各种用例：

全文搜索：快速、准确的文本搜索
结构化搜索：数字、日期、地理位置等结构化数据搜索
分析：聚合数据以生成复杂的分析和统计
实时性：近实时的数据索引和搜索

1.2 核心特性

分布式：天然支持分布式架构，可水平扩展
高可用：自动故障转移和数据复制
RESTful API：简单易用的HTTP API接口
Schema-free：动态映射，无需预定义结构
多租户：支持多索引操作

1.3 应用场景

graph TD
    A[Elasticsearch应用场景] --> B[搜索引擎]
    A --> C[日志分析]
    A --> D[监控系统]
    A --> E[商业智能]
    A --> F[安全分析]
    
    B --> B1[网站搜索]
    B --> B2[企业搜索]
    B --> B3[电商搜索]
    
    C --> C1[ELK Stack]
    C --> C2[应用日志]
    C --> C3[系统日志]
    
    D --> D1[APM监控]
    D --> D2[基础设施监控]
    D --> D3[业务指标监控]

2. 核心概念

2.1 基本术语对比

Elasticsearch	关系型数据库	说明
Index	Database	索引，类似数据库
Type	Table	类型，类似表（7.x后废弃）
Document	Row	文档，类似行记录
Field	Column	字段，类似列
Mapping	Schema	映射，类似表结构
Query DSL	SQL	查询语言

2.2 文档（Document）

文档是Elasticsearch中的基本信息单元，以JSON格式表示：

{
  "_index": "products",
  "_type": "_doc",
  "_id": "1",
  "_source": {
    "name": "iPhone 14",
    "brand": "Apple",
    "price": 999.99,
    "category": "smartphone",
    "description": "Latest iPhone with advanced features",
    "tags": ["mobile", "apple", "smartphone"],
    "created_at": "2024-01-15T10:30:00Z"
  }
}

2.3 索引（Index）

索引是具有相似特征的文档集合：

# 索引命名规则
- 只能包含小写字母
- 不能包含 \, /, *, ?, ", <, >, |, 空格, 逗号, #
- 不能以 -, _, + 开头
- 不能是 . 或 ..
- 长度不能超过255字节

2.4 映射（Mapping）

映射定义了文档及其字段的存储和索引方式：

{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "standard"
      },
      "price": {
        "type": "double"
      },
      "created_at": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ss'Z'"
      },
      "location": {
        "type": "geo_point"
      },
      "tags": {
        "type": "keyword"
      }
    }
  }
}

2.5 字段类型

核心数据类型

{
  "mappings": {
    "properties": {
      "title": {"type": "text"},           // 全文搜索
      "status": {"type": "keyword"},       // 精确匹配
      "age": {"type": "integer"},          // 整数
      "price": {"type": "double"},         // 浮点数
      "published": {"type": "boolean"},    // 布尔值
      "publish_date": {"type": "date"},    // 日期
      "content": {"type": "binary"}        // 二进制
    }
  }
}

复杂数据类型

{
  "mappings": {
    "properties": {
      "location": {"type": "geo_point"},   // 地理位置点
      "area": {"type": "geo_shape"},       // 地理形状
      "ip_addr": {"type": "ip"},           // IP地址
      "completion": {"type": "completion"}, // 自动补全
      "tags": {"type": "keyword"},         // 数组
      "user": {                             // 对象
        "properties": {
          "name": {"type": "text"},
          "email": {"type": "keyword"}
        }
      }
    }
  }
}

3. Elasticsearch架构

3.1 集群架构

graph TB
    subgraph "Elasticsearch Cluster"
        subgraph "Node 1 (Master)"
            N1["Node 1<br/>Master Eligible<br/>Data Node"]
            S1["Shard 1 (Primary)"]
            S2R["Shard 2 (Replica)"]
        end
        
        subgraph "Node 2 (Data)"
            N2["Node 2<br/>Data Node"]
            S2["Shard 2 (Primary)"]
            S3R["Shard 3 (Replica)"]
        end
        
        subgraph "Node 3 (Data)"
            N3["Node 3<br/>Data Node"]
            S3["Shard 3 (Primary)"]
            S1R["Shard 1 (Replica)"]
        end
    end
    
    Client["Client Application"] --> N1
    Client --> N2
    Client --> N3

3.2 节点类型

Master节点

# elasticsearch.yml
node.master: true
node.data: false
node.ingest: false

职责： - 集群状态管理 - 索引创建和删除 - 分片分配 - 节点加入和离开

Data节点

# elasticsearch.yml
node.master: false
node.data: true
node.ingest: false

职责： - 存储数据 - 执行搜索和聚合 - 索引和删除文档

Ingest节点

# elasticsearch.yml
node.master: false
node.data: false
node.ingest: true

职责： - 数据预处理 - 文档转换 - 数据丰富

Coordinating节点

# elasticsearch.yml
node.master: false
node.data: false
node.ingest: false

职责： - 请求路由 - 结果聚合 - 负载均衡

3.3 分片机制

主分片（Primary Shard）

{
  "settings": {
    "number_of_shards": 3,      // 主分片数量（创建后不可修改）
    "number_of_replicas": 1     // 副本分片数量（可动态修改）
  }
}

副本分片（Replica Shard）

提供高可用性
提高搜索性能
数据冗余备份

3.4 分片分配策略

graph LR
    subgraph "Index: products (3 shards, 1 replica)"
        subgraph "Node A"
            P0["Primary 0"]
            R1["Replica 1"]
        end
        
        subgraph "Node B"
            P1["Primary 1"]
            R2["Replica 2"]
        end
        
        subgraph "Node C"
            P2["Primary 2"]
            R0["Replica 0"]
        end
    end

4. 数据流程

4.1 索引流程

sequenceDiagram
    participant C as Client
    participant CN as Coordinating Node
    participant PN as Primary Node
    participant RN as Replica Node
    
    C->>CN: Index Document
    CN->>CN: Route to Primary Shard
    CN->>PN: Forward Request
    PN->>PN: Index Document
    PN->>RN: Replicate to Replica
    RN->>PN: Acknowledge
    PN->>CN: Success Response
    CN->>C: Return Response

4.2 搜索流程

sequenceDiagram
    participant C as Client
    participant CN as Coordinating Node
    participant N1 as Node 1
    participant N2 as Node 2
    participant N3 as Node 3
    
    C->>CN: Search Request
    CN->>CN: Determine Target Shards
    
    par Query Phase
        CN->>N1: Query
        CN->>N2: Query
        CN->>N3: Query
    end
    
    par Response
        N1->>CN: Document IDs + Scores
        N2->>CN: Document IDs + Scores
        N3->>CN: Document IDs + Scores
    end
    
    CN->>CN: Merge and Sort Results
    
    par Fetch Phase
        CN->>N1: Fetch Documents
        CN->>N2: Fetch Documents
    end
    
    par Response
        N1->>CN: Document Content
        N2->>CN: Document Content
    end
    
    CN->>C: Final Results

5. 与传统数据库对比

5.1 数据模型对比

特性	Elasticsearch	关系型数据库
数据模型	文档型（JSON）	关系型（表格）
Schema	动态映射	固定Schema
查询语言	Query DSL	SQL
事务支持	有限支持	完整ACID
扩展性	水平扩展	垂直扩展为主
搜索能力	强大的全文搜索	基础文本匹配

5.2 性能特点

# 传统数据库查询
SELECT * FROM products 
WHERE name LIKE '%phone%' 
AND price BETWEEN 500 AND 1000
ORDER BY relevance_score DESC;

# Elasticsearch查询
{
  "query": {
    "bool": {
      "must": [
        {"match": {"name": "phone"}},
        {"range": {"price": {"gte": 500, "lte": 1000}}}
      ]
    }
  },
  "sort": [{"_score": {"order": "desc"}}]
}

5.3 使用场景选择

选择Elasticsearch的场景：

全文搜索需求
大量非结构化数据
实时分析和聚合
日志分析
地理位置搜索

选择关系型数据库的场景：

强一致性要求
复杂事务处理
严格的数据完整性
复杂的关联查询

6. 实践示例

6.1 创建索引

# 创建产品索引
PUT /products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "my_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "brand": {"type": "keyword"},
      "price": {"type": "double"},
      "category": {"type": "keyword"},
      "description": {
        "type": "text",
        "analyzer": "my_analyzer"
      },
      "tags": {"type": "keyword"},
      "created_at": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ss'Z'"
      },
      "location": {"type": "geo_point"},
      "specs": {
        "type": "object",
        "properties": {
          "color": {"type": "keyword"},
          "storage": {"type": "keyword"},
          "weight": {"type": "double"}
        }
      }
    }
  }
}

6.2 索引文档

# 索引单个文档
POST /products/_doc/1
{
  "name": "iPhone 14 Pro",
  "brand": "Apple",
  "price": 1099.99,
  "category": "smartphone",
  "description": "Professional iPhone with advanced camera system",
  "tags": ["mobile", "apple", "smartphone", "pro"],
  "created_at": "2024-01-15T10:30:00Z",
  "location": {
    "lat": 37.7749,
    "lon": -122.4194
  },
  "specs": {
    "color": "Space Black",
    "storage": "256GB",
    "weight": 206.0
  }
}

# 批量索引
POST /products/_bulk
{"index":{"_id":"2"}}
{"name":"Samsung Galaxy S23","brand":"Samsung","price":899.99,"category":"smartphone"}
{"index":{"_id":"3"}}
{"name":"MacBook Pro","brand":"Apple","price":1999.99,"category":"laptop"}

6.3 基础查询

# 简单搜索
GET /products/_search
{
  "query": {
    "match": {
      "name": "iPhone"
    }
  }
}

# 复合查询
GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"category": "smartphone"}}
      ],
      "filter": [
        {"range": {"price": {"gte": 500, "lte": 1500}}},
        {"term": {"brand": "Apple"}}
      ]
    }
  },
  "sort": [
    {"price": {"order": "desc"}}
  ],
  "size": 10,
  "from": 0
}

7. 最佳实践

7.1 索引设计原则

合理设置分片数量
```
# 分片数量计算公式
分片数量 = 数据总量 / 单分片最大容量(20-50GB)
```
1. 选择合适的字段类型 json { "mappings": { "properties": { "id": {"type": "keyword"}, // 精确匹配用keyword "title": {"type": "text"}, // 全文搜索用text "status": {"type": "keyword"}, // 枚举值用keyword "timestamp": {"type": "date"} // 时间字段用date } } }

禁用不需要的功能

{
 "mappings": {
   "properties": {
     "large_text": {
       "type": "text",
       "index": false,      // 不需要搜索
       "store": true        // 但需要返回原始值
     }
   }
 }
}

7.2 性能优化建议

批量操作

# 使用bulk API提高索引性能
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
es = Elasticsearch()
def generate_docs():
for i in range(1000):
   yield {
       "_index": "products",
       "_id": i,
       "name": f"Product {i}",
       "price": i * 10
   }
bulk(es, generate_docs())

合理使用过滤器
```
{
 "query": {
   "bool": {
     "must": [
       {"match": {"title": "search term"}}  // 影响评分
     ],
     "filter": [
       {"term": {"status": "published"}},   // 不影响评分，可缓存
       {"range": {"date": {"gte": "2024-01-01"}}}
     ]
   }
 }
}
```
本章总结

本章我们学习了Elasticsearch的基础概念和架构设计： 1. 核心概念：理解了文档、索引、映射等基本概念 2. 架构设计：掌握了集群、节点、分片的分布式架构 3. 数据流程：了解了索引和搜索的完整流程 4. 实践应用：通过示例掌握了基本的索引和查询操作下一章我们将学习Elasticsearch的安装部署和环境配置，为实际使用做好准备。

练习题
1. 解释Elasticsearch中文档、索引、分片的关系
2. 设计一个电商网站的商品索引结构
3. 分析在什么场景下选择Elasticsearch而不是传统数据库
4. 计算一个包含1TB数据的索引需要多少个分片

📂 分类导航

▶ 学与练
- ▶ 软件技术基础
  - ▶ 操作系统技术
    - Linux实战
    - ▶ Linux技巧
      - debug-remote-api.md
  - ▶ 容器化与编排
    - Docker实战
    - ▶ Docker高级
- ▶ 前端开发技术
  - ▶ 框架与库
    - js
    - vue
  - ▶ 前端生态
    - bootstrap
    - vue-ssr
- ▶ 后端开发技术
  - ▶ 编程语言
    - ▶ Java
    - ▶ Go
      - go-server.md
      - mini.md
    - Rust
    - Python
    - csharp
  - ▶ 中间件
    - redis
    - ▶ minio
      - minio.md
    - elasticsearch
    - kafka
    - elk
    - caddy
  - ▶ 数据库
    - MySQL
    - SQLServer
    - ▶ Dameng
      - sql.md
    - clickhouse
- ▶ 数据开发与运维
  - ▶ 数据开发
    - hadoop
  - ▶ 运维开发
    - ▶ CI/CD
      - jenkins
    - ▶ 自动化
      - allinssl.md
    - ▶ 日志处理
      - elk
    - ▶ 监控
- 软件速学教程
▶ 软件园
- AI智能体与应用
- 开发工具与环境
- AI 开发和编排
- 业务与生产力应用
- 数据和中间件
▶ 工具箱
- 内容管理
- 编码解码
- ▶ 系统监控
  - miaotixing.md
- ▶ 日常工具
- 工具命令
- 使用教程

📚 Elasticsearch基础概念与架构