Elasticsearch - Full-Text Search in Applications
Elasticsearch Full-Text Search
technologieElasticsearch: Full-Text Search in Applications
Elasticsearch is a distributed search and analytics engine that has revolutionized how applications handle full-text search. Built on top of Apache Lucene, it offers exceptional speed, scalability, and flexibility. In this article, we will explore how Elasticsearch works under the hood, how to deploy it, and how to leverage its full potential in production applications.
What is Elasticsearch?#
Elasticsearch is an open-source NoSQL database designed specifically for full-text search and real-time data analytics. Its key features include:
- Distributed engine - automatic sharding and data replication
- Full-text search - advanced ranking algorithms (BM25)
- Real-time analytics - live aggregations and metrics
- RESTful API - communication via HTTP/JSON
- Horizontal scalability - easy addition of nodes to a cluster
- Near real-time - documents become searchable within ~1 second of indexing
Elasticsearch is used wherever fast and intelligent search is needed - from e-commerce stores, through logging systems, to analytics platforms.
The Inverted Index - The Heart of Elasticsearch#
The key to Elasticsearch's performance is the inverted index. Unlike traditional databases that scan documents sequentially, an inverted index maps terms (words) to the documents that contain them.
How Does the Inverted Index Work?#
Suppose we have three documents:
| Document | Content | |----------|---------| | Doc 1 | "Elasticsearch is fast" | | Doc 2 | "Fast data search" | | Doc 3 | "Elasticsearch supports full-text search" |
The inverted index looks like this:
| Term | Documents | |------|-----------| | elasticsearch | Doc 1, Doc 3 | | is | Doc 1 | | fast | Doc 1, Doc 2 | | search | Doc 2, Doc 3 | | data | Doc 2 | | supports | Doc 3 | | full-text | Doc 3 |
This way, searching for "search elasticsearch" immediately returns documents 2 and 3 without scanning the entire database. This is what enables Elasticsearch to search millions of documents in milliseconds.
Running Elasticsearch with Docker#
The quickest way to start Elasticsearch is with Docker. Here is a Docker Compose configuration:
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
container_name: elasticsearch
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- "ES_JAVA_OPTS=-Xms1g -Xmx1g"
ports:
- "9200:9200"
volumes:
- es_data:/usr/share/elasticsearch/data
networks:
- elastic
kibana:
image: docker.elastic.co/kibana/kibana:8.12.0
container_name: kibana
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
depends_on:
- elasticsearch
networks:
- elastic
volumes:
es_data:
driver: local
networks:
elastic:
driver: bridge
After running (docker-compose up -d), Elasticsearch will be available at http://localhost:9200 and Kibana at http://localhost:5601.
# Check if the cluster is running
curl -X GET "localhost:9200/_cluster/health?pretty"
Mapping and Analyzers#
What is Mapping?#
Mapping is the data schema in Elasticsearch - it defines field types and how they are indexed. It is analogous to a table schema in relational databases, but more flexible.
PUT /products
{
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "english",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"description": {
"type": "text",
"analyzer": "english"
},
"price": {
"type": "float"
},
"category": {
"type": "keyword"
},
"created_at": {
"type": "date"
},
"in_stock": {
"type": "boolean"
},
"tags": {
"type": "keyword"
}
}
}
}
Field Types#
- text - analyzed text for full-text search
- keyword - exact values (filters, sorting, aggregations)
- integer/float/long - numeric values
- date - dates and timestamps
- boolean - logical values
- nested - nested objects with preserved relationships
- geo_point - geographic coordinates
Text Analyzers#
Analyzers transform text before it is stored in the inverted index. The analysis process consists of three stages:
- Character filters - character transformation (e.g., HTML removal)
- Tokenizer - splitting text into tokens (words)
- Token filters - token modification (lowercasing, stemming, synonyms)
PUT /products
{
"settings": {
"analysis": {
"analyzer": {
"custom_english": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"english_stop",
"english_stemmer",
"synonym_filter"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"synonym_filter": {
"type": "synonym",
"synonyms": [
"laptop,notebook,portable computer",
"phone,smartphone,mobile"
]
}
}
}
}
}
Thanks to analyzers, searching for "computers" will also find documents containing "computer", "computing", or "notebook".
Full-Text Search Queries#
Indexing Documents#
Before searching, we need to add documents to the index:
POST /products/_doc/1
{
"name": "Dell XPS 15 Laptop",
"description": "High-performance laptop for work and entertainment with Intel i7 processor",
"price": 1499.99,
"category": "laptops",
"tags": ["dell", "xps", "intel", "laptop"],
"in_stock": true,
"created_at": "2024-01-15"
}
POST /products/_doc/2
{
"name": "MacBook Pro 14 M3",
"description": "Professional Apple notebook with M3 Pro chip for creative tasks",
"price": 1999.00,
"category": "laptops",
"tags": ["apple", "macbook", "m3", "laptop"],
"in_stock": true,
"created_at": "2024-02-01"
}
For bulk indexing, use the Bulk API:
POST /products/_bulk
{"index": {"_id": "3"}}
{"name": "Samsung Galaxy S24", "description": "Flagship smartphone with AI features", "price": 899.00, "category": "phones", "tags": ["samsung", "galaxy", "smartphone"], "in_stock": true}
{"index": {"_id": "4"}}
{"name": "iPhone 15 Pro", "description": "Latest Apple phone with titanium body", "price": 1199.00, "category": "phones", "tags": ["apple", "iphone", "smartphone"], "in_stock": false}
Basic Queries#
Match Query - Full-Text Search
GET /products/_search
{
"query": {
"match": {
"description": "high-performance laptop for work"
}
}
}
Multi-Match Query - Searching Across Multiple Fields
GET /products/_search
{
"query": {
"multi_match": {
"query": "apple laptop",
"fields": ["name^3", "description", "tags^2"],
"type": "best_fields",
"fuzziness": "AUTO"
}
}
}
The ^3 operator means the name field carries three times more weight in result ranking.
Bool Query - Combining Conditions
GET /products/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"description": "laptop"
}
}
],
"filter": [
{
"range": {
"price": {
"gte": 500,
"lte": 2000
}
}
},
{
"term": {
"in_stock": true
}
}
],
"should": [
{
"match": {
"tags": "intel"
}
}
],
"must_not": [
{
"term": {
"category": "accessories"
}
}
]
}
}
}
- must - required conditions (affect score)
- filter - required conditions (no score impact, cached)
- should - optional conditions (boost score)
- must_not - exclusion conditions
Fuzzy Search - Typo-Tolerant Search
GET /products/_search
{
"query": {
"match": {
"name": {
"query": "laptpo",
"fuzziness": "AUTO"
}
}
}
}
Fuzziness allows finding results despite typos - "laptpo" will match "laptop".
Phrase Search and Highlighting#
GET /products/_search
{
"query": {
"match_phrase": {
"description": {
"query": "laptop for work",
"slop": 2
}
}
},
"highlight": {
"fields": {
"description": {
"pre_tags": ["<strong>"],
"post_tags": ["</strong>"],
"fragment_size": 150
}
}
}
}
The slop parameter specifies the allowed number of positions between phrase words.
Aggregations - Data Analytics#
Aggregations are a powerful Elasticsearch mechanism for data analysis - the equivalent of GROUP BY in SQL, but significantly more versatile.
Bucket Aggregations#
GET /products/_search
{
"size": 0,
"aggs": {
"categories": {
"terms": {
"field": "category",
"size": 10
}
},
"price_ranges": {
"range": {
"field": "price",
"ranges": [
{ "key": "budget", "to": 500 },
{ "key": "mid-range", "from": 500, "to": 1000 },
{ "key": "premium", "from": 1000, "to": 2000 },
{ "key": "luxury", "from": 2000 }
]
}
}
}
}
Metric Aggregations#
GET /products/_search
{
"size": 0,
"aggs": {
"avg_price": {
"avg": { "field": "price" }
},
"price_stats": {
"stats": { "field": "price" }
},
"category_avg_price": {
"terms": {
"field": "category"
},
"aggs": {
"avg_price": {
"avg": { "field": "price" }
}
}
}
}
}
Aggregations are incredibly useful for creating faceted filters in e-commerce stores - such as dynamic price filters, categories with product counts, and popular tags.
ELK Stack - Elasticsearch, Logstash, Kibana#
The ELK Stack (also known as the Elastic Stack) is a suite of tools for centralized logging and data analysis:
Components#
- Elasticsearch - data storage and search
- Logstash - log collection, transformation, and loading
- Kibana - data visualization and exploration
- Beats - lightweight data collection agents (Filebeat, Metricbeat, Heartbeat)
Logstash Configuration#
input {
beats {
port => 5044
}
}
filter {
if [type] == "nginx" {
grok {
match => {
"message" => '%{IPORHOST:remote_addr} - %{USER:remote_user} \[%{HTTPDATE:time_local}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:body_bytes_sent}'
}
}
date {
match => ["time_local", "dd/MMM/yyyy:HH:mm:ss Z"]
}
geoip {
source => "remote_addr"
}
}
}
output {
elasticsearch {
hosts => ["http://elasticsearch:9200"]
index => "logs-nginx-%{+YYYY.MM.dd}"
}
}
Kibana - Visualization#
Kibana enables creating interactive dashboards with Elasticsearch data. Key features include:
- Discover - raw data exploration with filtering
- Visualize - chart creation (pie, bar, line, heatmap)
- Dashboard - combining visualizations into interactive panels
- Dev Tools - console for executing REST API queries
- Lens - intuitive drag-and-drop visualization builder
Use Cases#
E-commerce Search#
Elasticsearch is ideal for building advanced search into online stores:
- Autocomplete - suggestions while typing
- Faceted filters - dynamic price, category, and brand filters
- Fuzzy search - typo tolerance
- Synonyms - "phone" = "smartphone" = "mobile"
- Personalization - boosting results based on user history
GET /products/_search
{
"suggest": {
"product-suggest": {
"prefix": "sam",
"completion": {
"field": "suggest",
"size": 5,
"fuzzy": {
"fuzziness": 1
}
}
}
}
}
Centralized Logging#
Collecting logs from multiple microservices into a single location:
- Application logs (errors, warnings, debug)
- Access logs (Nginx, Apache)
- System metrics (CPU, RAM, disk)
- Request tracing
Business Analytics#
- Sales trend analysis over time
- Real-time KPI monitoring
- Reporting and dashboards
- User behavior analysis
Geospatial Search#
GET /stores/_search
{
"query": {
"geo_distance": {
"distance": "10km",
"location": {
"lat": 51.5074,
"lon": -0.1278
}
}
},
"sort": [
{
"_geo_distance": {
"location": {
"lat": 51.5074,
"lon": -0.1278
},
"order": "asc"
}
}
]
}
Comparison with Alternatives#
Elasticsearch vs Algolia#
| Feature | Elasticsearch | Algolia | |---------|--------------|---------| | Hosting | Self-hosted / Cloud | SaaS | | Cost | Open-source (infrastructure) | Pay per search | | Configuration | Advanced | Minimal | | Scalability | Unlimited | Automatic | | Customization | Full | Limited | | Analytics | Aggregations, ELK | Built-in | | Latency | ~10-50ms | ~1-20ms | | Use case | Complex systems | Quick search-as-a-service |
Algolia is better for simple search use cases in front-end applications where instant setup matters. Elasticsearch excels with complex requirements, data analysis, and full infrastructure control.
Elasticsearch vs Meilisearch#
| Feature | Elasticsearch | Meilisearch | |---------|--------------|-------------| | Language | Java | Rust | | Memory | Demanding (JVM) | Lightweight | | Configuration | Complex | Simple | | Features | Comprehensive | Basic | | Aggregations | Advanced | Basic filters | | Production-readiness | Enterprise-ready | Maturing | | Typo tolerance | Fuzzy + phonetic | Built-in typo-tolerant |
Meilisearch is an excellent choice for smaller projects needing fast search with minimal configuration. Elasticsearch remains the standard for enterprise systems and advanced analytics.
Performance Optimization#
1. Mapping Optimization#
PUT /products
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "5s"
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "english"
},
"internal_id": {
"type": "keyword",
"index": false
}
}
}
}
- Set
index: falsefor fields you do not search on - Disable
_sourcefor large documents (if you do not need the original) - Use
keywordinstead oftextfor fields that do not require analysis
2. Query Optimization#
GET /products/_search
{
"_source": ["name", "price", "category"],
"query": {
"bool": {
"filter": [
{ "term": { "category": "laptops" } },
{ "range": { "price": { "gte": 500 } } }
]
}
},
"size": 20,
"from": 0
}
- Use
filterinstead ofmustfor conditions without scoring - filtered results are cached - Limit fields in
_sourceto what is required - Avoid deep pagination (
from> 10000) - usesearch_afterinstead
3. Bulk Operations#
POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"name": "Product 1", "price": 100}
{"index": {"_index": "products", "_id": "2"}}
{"name": "Product 2", "price": 200}
Always use the Bulk API for mass operations - it is significantly more efficient than indexing documents one at a time.
4. Cluster Monitoring#
# Cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"
# Index statistics
curl -X GET "localhost:9200/_cat/indices?v"
# Node load
curl -X GET "localhost:9200/_cat/nodes?v&h=name,heap.percent,cpu,load_1m"
# Hot threads - diagnosing slow operations
curl -X GET "localhost:9200/_nodes/hot_threads"
5. Index Lifecycle Management (ILM)#
Automated index lifecycle management - particularly useful for logs:
PUT /_ilm/policy/logs_policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "7d"
}
}
},
"warm": {
"min_age": "30d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 }
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
Deployment Checklist#
Before going to production, make sure you have:
- [ ] Mapping defined (do not rely on dynamic mapping)
- [ ] Analyzers configured for supported languages
- [ ] Shard count matched to data size
- [ ] Replicas enabled (minimum 1 replica)
- [ ] JVM heap set to max 50% of RAM (no more than 32GB)
- [ ] Snapshot and restore configured
- [ ] Cluster monitoring enabled
- [ ] ILM for log indices
- [ ] Security (authentication and authorization) enabled
- [ ] Load tests completed
Summary#
Elasticsearch is a powerful search engine that:
- Provides lightning-fast search - millisecond responses to complex queries
- Handles full-text search - with morphological analysis, synonyms, and fuzzy search
- Scales horizontally - from a single node to clusters with hundreds of TB of data
- Delivers real-time analytics - aggregations and dashboards in Kibana
- Integrates with the ecosystem - ELK stack, Beats, clients for every language
For smaller projects, consider Meilisearch or Algolia, but when you need full control, advanced aggregations, and enterprise-grade scalability - Elasticsearch remains the undisputed leader.
Need Help?#
At MDS Software Solutions Group, we help with:
- Implementing Elasticsearch search in applications
- Setting up the ELK Stack for centralized logging
- Optimizing query performance and cluster configuration
- Migrating from other search solutions
- Building analytics systems on the Elastic Stack
Contact us to discuss your project!
Team of programming experts specializing in modern web technologies.