Elasticsearch: Full-Text Search in Applications#

Elasticsearch is a distributed search and analytics engine that has revolutionized how applications handle full-text search. Built on top of Apache Lucene, it offers exceptional speed, scalability, and flexibility. In this article, we will explore how Elasticsearch works under the hood, how to deploy it, and how to leverage its full potential in production applications.

What is Elasticsearch?#

Elasticsearch is an open-source NoSQL database designed specifically for full-text search and real-time data analytics. Its key features include:

Distributed engine - automatic sharding and data replication
Full-text search - advanced ranking algorithms (BM25)
Real-time analytics - live aggregations and metrics
RESTful API - communication via HTTP/JSON
Horizontal scalability - easy addition of nodes to a cluster
Near real-time - documents become searchable within ~1 second of indexing

Elasticsearch is used wherever fast and intelligent search is needed - from e-commerce stores, through logging systems, to analytics platforms.

The Inverted Index - The Heart of Elasticsearch#

The key to Elasticsearch's performance is the inverted index. Unlike traditional databases that scan documents sequentially, an inverted index maps terms (words) to the documents that contain them.

How Does the Inverted Index Work?#

Suppose we have three documents:

| Document | Content | |----------|---------| | Doc 1 | "Elasticsearch is fast" | | Doc 2 | "Fast data search" | | Doc 3 | "Elasticsearch supports full-text search" |

The inverted index looks like this:

| Term | Documents | |------|-----------| | elasticsearch | Doc 1, Doc 3 | | is | Doc 1 | | fast | Doc 1, Doc 2 | | search | Doc 2, Doc 3 | | data | Doc 2 | | supports | Doc 3 | | full-text | Doc 3 |

This way, searching for "search elasticsearch" immediately returns documents 2 and 3 without scanning the entire database. This is what enables Elasticsearch to search millions of documents in milliseconds.

Running Elasticsearch with Docker#

The quickest way to start Elasticsearch is with Docker. Here is a Docker Compose configuration:

version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
    ports:
      - "9200:9200"
    volumes:
      - es_data:/usr/share/elasticsearch/data
    networks:
      - elastic

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    container_name: kibana
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      - elasticsearch
    networks:
      - elastic

volumes:
  es_data:
    driver: local

networks:
  elastic:
    driver: bridge

After running (docker-compose up -d), Elasticsearch will be available at http://localhost:9200 and Kibana at http://localhost:5601.

# Check if the cluster is running
curl -X GET "localhost:9200/_cluster/health?pretty"

Mapping and Analyzers#

What is Mapping?#

Mapping is the data schema in Elasticsearch - it defines field types and how they are indexed. It is analogous to a table schema in relational databases, but more flexible.

PUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "description": {
        "type": "text",
        "analyzer": "english"
      },
      "price": {
        "type": "float"
      },
      "category": {
        "type": "keyword"
      },
      "created_at": {
        "type": "date"
      },
      "in_stock": {
        "type": "boolean"
      },
      "tags": {
        "type": "keyword"
      }
    }
  }
}

Field Types#

text - analyzed text for full-text search
keyword - exact values (filters, sorting, aggregations)
integer/float/long - numeric values
date - dates and timestamps
boolean - logical values
nested - nested objects with preserved relationships
geo_point - geographic coordinates

Text Analyzers#

Analyzers transform text before it is stored in the inverted index. The analysis process consists of three stages:

Character filters - character transformation (e.g., HTML removal)
Tokenizer - splitting text into tokens (words)
Token filters - token modification (lowercasing, stemming, synonyms)

PUT /products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_english": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "english_stop",
            "english_stemmer",
            "synonym_filter"
          ]
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "laptop,notebook,portable computer",
            "phone,smartphone,mobile"
          ]
        }
      }
    }
  }
}

Thanks to analyzers, searching for "computers" will also find documents containing "computer", "computing", or "notebook".

Full-Text Search Queries#

Indexing Documents#

Before searching, we need to add documents to the index:

POST /products/_doc/1
{
  "name": "Dell XPS 15 Laptop",
  "description": "High-performance laptop for work and entertainment with Intel i7 processor",
  "price": 1499.99,
  "category": "laptops",
  "tags": ["dell", "xps", "intel", "laptop"],
  "in_stock": true,
  "created_at": "2024-01-15"
}

POST /products/_doc/2
{
  "name": "MacBook Pro 14 M3",
  "description": "Professional Apple notebook with M3 Pro chip for creative tasks",
  "price": 1999.00,
  "category": "laptops",
  "tags": ["apple", "macbook", "m3", "laptop"],
  "in_stock": true,
  "created_at": "2024-02-01"
}

For bulk indexing, use the Bulk API:

POST /products/_bulk
{"index": {"_id": "3"}}
{"name": "Samsung Galaxy S24", "description": "Flagship smartphone with AI features", "price": 899.00, "category": "phones", "tags": ["samsung", "galaxy", "smartphone"], "in_stock": true}
{"index": {"_id": "4"}}
{"name": "iPhone 15 Pro", "description": "Latest Apple phone with titanium body", "price": 1199.00, "category": "phones", "tags": ["apple", "iphone", "smartphone"], "in_stock": false}

Basic Queries#

Match Query - Full-Text Search

GET /products/_search
{
  "query": {
    "match": {
      "description": "high-performance laptop for work"
    }
  }
}

Multi-Match Query - Searching Across Multiple Fields

GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "apple laptop",
      "fields": ["name^3", "description", "tags^2"],
      "type": "best_fields",
      "fuzziness": "AUTO"
    }
  }
}

The ^3 operator means the name field carries three times more weight in result ranking.

Bool Query - Combining Conditions

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "description": "laptop"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "price": {
              "gte": 500,
              "lte": 2000
            }
          }
        },
        {
          "term": {
            "in_stock": true
          }
        }
      ],
      "should": [
        {
          "match": {
            "tags": "intel"
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "category": "accessories"
          }
        }
      ]
    }
  }
}

must - required conditions (affect score)
filter - required conditions (no score impact, cached)
should - optional conditions (boost score)
must_not - exclusion conditions

Fuzzy Search - Typo-Tolerant Search

GET /products/_search
{
  "query": {
    "match": {
      "name": {
        "query": "laptpo",
        "fuzziness": "AUTO"
      }
    }
  }
}

Fuzziness allows finding results despite typos - "laptpo" will match "laptop".

Phrase Search and Highlighting#

GET /products/_search
{
  "query": {
    "match_phrase": {
      "description": {
        "query": "laptop for work",
        "slop": 2
      }
    }
  },
  "highlight": {
    "fields": {
      "description": {
        "pre_tags": ["<strong>"],
        "post_tags": ["</strong>"],
        "fragment_size": 150
      }
    }
  }
}

The slop parameter specifies the allowed number of positions between phrase words.

Aggregations - Data Analytics#

Aggregations are a powerful Elasticsearch mechanism for data analysis - the equivalent of GROUP BY in SQL, but significantly more versatile.

Bucket Aggregations#

GET /products/_search
{
  "size": 0,
  "aggs": {
    "categories": {
      "terms": {
        "field": "category",
        "size": 10
      }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "key": "budget", "to": 500 },
          { "key": "mid-range", "from": 500, "to": 1000 },
          { "key": "premium", "from": 1000, "to": 2000 },
          { "key": "luxury", "from": 2000 }
        ]
      }
    }
  }
}

Metric Aggregations#

GET /products/_search
{
  "size": 0,
  "aggs": {
    "avg_price": {
      "avg": { "field": "price" }
    },
    "price_stats": {
      "stats": { "field": "price" }
    },
    "category_avg_price": {
      "terms": {
        "field": "category"
      },
      "aggs": {
        "avg_price": {
          "avg": { "field": "price" }
        }
      }
    }
  }
}

Aggregations are incredibly useful for creating faceted filters in e-commerce stores - such as dynamic price filters, categories with product counts, and popular tags.

ELK Stack - Elasticsearch, Logstash, Kibana#

The ELK Stack (also known as the Elastic Stack) is a suite of tools for centralized logging and data analysis:

Components#

Elasticsearch - data storage and search
Logstash - log collection, transformation, and loading
Kibana - data visualization and exploration
Beats - lightweight data collection agents (Filebeat, Metricbeat, Heartbeat)

Logstash Configuration#

input {
  beats {
    port => 5044
  }
}

filter {
  if [type] == "nginx" {
    grok {
      match => {
        "message" => '%{IPORHOST:remote_addr} - %{USER:remote_user} \[%{HTTPDATE:time_local}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:body_bytes_sent}'
      }
    }
    date {
      match => ["time_local", "dd/MMM/yyyy:HH:mm:ss Z"]
    }
    geoip {
      source => "remote_addr"
    }
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    index => "logs-nginx-%{+YYYY.MM.dd}"
  }
}

Kibana - Visualization#

Kibana enables creating interactive dashboards with Elasticsearch data. Key features include:

Discover - raw data exploration with filtering
Visualize - chart creation (pie, bar, line, heatmap)
Dashboard - combining visualizations into interactive panels
Dev Tools - console for executing REST API queries
Lens - intuitive drag-and-drop visualization builder

Use Cases#

E-commerce Search#

Elasticsearch is ideal for building advanced search into online stores:

Autocomplete - suggestions while typing
Faceted filters - dynamic price, category, and brand filters
Fuzzy search - typo tolerance
Synonyms - "phone" = "smartphone" = "mobile"
Personalization - boosting results based on user history

GET /products/_search
{
  "suggest": {
    "product-suggest": {
      "prefix": "sam",
      "completion": {
        "field": "suggest",
        "size": 5,
        "fuzzy": {
          "fuzziness": 1
        }
      }
    }
  }
}

Centralized Logging#

Collecting logs from multiple microservices into a single location:

Application logs (errors, warnings, debug)
Access logs (Nginx, Apache)
System metrics (CPU, RAM, disk)
Request tracing

Business Analytics#

Sales trend analysis over time
Real-time KPI monitoring
Reporting and dashboards
User behavior analysis

Geospatial Search#

GET /stores/_search
{
  "query": {
    "geo_distance": {
      "distance": "10km",
      "location": {
        "lat": 51.5074,
        "lon": -0.1278
      }
    }
  },
  "sort": [
    {
      "_geo_distance": {
        "location": {
          "lat": 51.5074,
          "lon": -0.1278
        },
        "order": "asc"
      }
    }
  ]
}

Comparison with Alternatives#

Elasticsearch vs Algolia#

| Feature | Elasticsearch | Algolia | |---------|--------------|---------| | Hosting | Self-hosted / Cloud | SaaS | | Cost | Open-source (infrastructure) | Pay per search | | Configuration | Advanced | Minimal | | Scalability | Unlimited | Automatic | | Customization | Full | Limited | | Analytics | Aggregations, ELK | Built-in | | Latency | ~10-50ms | ~1-20ms | | Use case | Complex systems | Quick search-as-a-service |

Algolia is better for simple search use cases in front-end applications where instant setup matters. Elasticsearch excels with complex requirements, data analysis, and full infrastructure control.

Elasticsearch vs Meilisearch#

| Feature | Elasticsearch | Meilisearch | |---------|--------------|-------------| | Language | Java | Rust | | Memory | Demanding (JVM) | Lightweight | | Configuration | Complex | Simple | | Features | Comprehensive | Basic | | Aggregations | Advanced | Basic filters | | Production-readiness | Enterprise-ready | Maturing | | Typo tolerance | Fuzzy + phonetic | Built-in typo-tolerant |

Meilisearch is an excellent choice for smaller projects needing fast search with minimal configuration. Elasticsearch remains the standard for enterprise systems and advanced analytics.

Performance Optimization#

1. Mapping Optimization#

PUT /products
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "refresh_interval": "5s"
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "english"
      },
      "internal_id": {
        "type": "keyword",
        "index": false
      }
    }
  }
}

Set index: false for fields you do not search on
Disable _source for large documents (if you do not need the original)
Use keyword instead of text for fields that do not require analysis

2. Query Optimization#

GET /products/_search
{
  "_source": ["name", "price", "category"],
  "query": {
    "bool": {
      "filter": [
        { "term": { "category": "laptops" } },
        { "range": { "price": { "gte": 500 } } }
      ]
    }
  },
  "size": 20,
  "from": 0
}

Use filter instead of must for conditions without scoring - filtered results are cached
Limit fields in _source to what is required
Avoid deep pagination (from > 10000) - use search_after instead

3. Bulk Operations#

POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"name": "Product 1", "price": 100}
{"index": {"_index": "products", "_id": "2"}}
{"name": "Product 2", "price": 200}

Always use the Bulk API for mass operations - it is significantly more efficient than indexing documents one at a time.

4. Cluster Monitoring#

# Cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"

# Index statistics
curl -X GET "localhost:9200/_cat/indices?v"

# Node load
curl -X GET "localhost:9200/_cat/nodes?v&h=name,heap.percent,cpu,load_1m"

# Hot threads - diagnosing slow operations
curl -X GET "localhost:9200/_nodes/hot_threads"

5. Index Lifecycle Management (ILM)#

Automated index lifecycle management - particularly useful for logs:

PUT /_ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "7d"
          }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Deployment Checklist#

Before going to production, make sure you have:

[ ] Mapping defined (do not rely on dynamic mapping)
[ ] Analyzers configured for supported languages
[ ] Shard count matched to data size
[ ] Replicas enabled (minimum 1 replica)
[ ] JVM heap set to max 50% of RAM (no more than 32GB)
[ ] Snapshot and restore configured
[ ] Cluster monitoring enabled
[ ] ILM for log indices
[ ] Security (authentication and authorization) enabled
[ ] Load tests completed

Summary#

Elasticsearch is a powerful search engine that:

Provides lightning-fast search - millisecond responses to complex queries
Handles full-text search - with morphological analysis, synonyms, and fuzzy search
Scales horizontally - from a single node to clusters with hundreds of TB of data
Delivers real-time analytics - aggregations and dashboards in Kibana
Integrates with the ecosystem - ELK stack, Beats, clients for every language

For smaller projects, consider Meilisearch or Algolia, but when you need full control, advanced aggregations, and enterprise-grade scalability - Elasticsearch remains the undisputed leader.

Need Help?#

At MDS Software Solutions Group, we help with:

Implementing Elasticsearch search in applications
Setting up the ELK Stack for centralized logging
Optimizing query performance and cluster configuration
Migrating from other search solutions
Building analytics systems on the Elastic Stack

Contact us to discuss your project!

Author

MDS Software Solutions Group

Team of programming experts specializing in modern web technologies.