Design Elastic Search

4 min readApr 30, 2024

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.

Functional Requirements:

Search documents with given queries. (Full Text Search)
Find top documents (Ranking).

Non-Functional Requirements:

We have billions of documents.
Eventual Consistency. (It’s okay if new documents are included in search queries later.).
Highly Available.

Examples:

E-commerce Website: Review Search, Product Search
Social Media: Search bar (Post, comment, tweets)
Search Engine

Jamboard Link

Elastic search - Google Jamboard

The version of the browser you are using is no longer supported. Please upgrade to a supported browser. Dismiss

jamboard.google.com

How to store Documents:

We will store documents in the form of an inverted index.

Let’s see examples of how this is being stored:

Inverted Index (HashMap):

Product -> [(1,4), (2,0), (3,0)]
Using -> [(1,2)]
Working -> [(1,8)]
Nice -> [(1,9)]
Great -> [(2,2)]
Feature -> [(2,4)]
OverPriced -> [(3,2)]

Documents often contain numerous unnecessary words. Let’s explore ways to remove them.

Remove Stop Words: Useless words that appear often. of, the, are, a, of etc.
Stemming/Lemmatization
Other Cleaning Methods: Removing negative/abusive/sensitive words.

Stemming: It’s fast and simple.

Escaped -> Escap
Escaping -> Escap
Escape -> Escap
Caring -> Car

Lemmatization: It takes context into account.

Caring -> Care
Escaping -> Escape

Example of Search Query:

Search Query: The Product Quality is great.
Cleaning: product quality great.
Intersection: Find all the docs with all the words.
Union: Find all docs that contains any of the word.
Maintain Order of words: [product — i, quality — i+1, great — i+2]

Design DeepDive:

Single Node:

It can cause Single Point of Failure.
Run Out of space.
High Load
Solution: Sharding.

CAP Theorem:

Eventual Consistency: It’s okay if new documents are included in search queries later.
High Availability: System Should be available, whenever we search something.
Partition Tolerant: System should handle any network partitions, it should continue to work despite any number of communication breakdowns.

Sharding:

Shard by Document Id [Recommended]:

DocumentId can be:

ReviewId
LogId
ProductId
PostId/TweetId

Pros:

Latency is highly predictable: ResponseTime doesn’t depend on search query.
No need to perform intersection.
Can skip unresponsive/slow shards.

Cons:

Read Heavy: Complexity (Run on all shards)
Write Heavy: Simple (1 Shard)

Shard by Words:

Insert/update/Delete: Go to all shards corresponding to the words in the document. (Multiple Shards)
Query: Go to all shards

Let’s See How Processing will Work:

Map Reduce:

Search: “product quality good”, Go to all shards -> Run query at every shards.
Each shards will return a list of docs that matches the query.
Collect them all.
Items: Map(Single Shard), Reduce(Combine result from 2 shards) -> Easy to fit in a memory.