Design Elastic Search
4 min readApr 30, 2024
Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.
Functional Requirements:
- Search documents with given queries. (Full Text Search)
- Find top documents (Ranking).
Non-Functional Requirements:
- We have billions of documents.
- Eventual Consistency. (It’s okay if new documents are included in search queries later.).
- Highly Available.
Examples:
- E-commerce Website: Review Search, Product Search
- Social Media: Search bar (Post, comment, tweets)
- Search Engine
Jamboard Link
How to store Documents:
We will store documents in the form of an inverted index.
Let’s see examples of how this is being stored:
Inverted Index (HashMap):
- Product -> [(1,4), (2,0), (3,0)]
- Using -> [(1,2)]
- Working -> [(1,8)]
- Nice -> [(1,9)]
- Great -> [(2,2)]
- Feature -> [(2,4)]
- OverPriced -> [(3,2)]
Documents often contain numerous unnecessary words. Let’s explore ways to remove them.
- Remove Stop Words: Useless words that appear often. of, the, are, a, of etc.
- Stemming/Lemmatization
- Other Cleaning Methods: Removing negative/abusive/sensitive words.
Stemming: It’s fast and simple.
- Escaped -> Escap
- Escaping -> Escap
- Escape -> Escap
- Caring -> Car
Lemmatization: It takes context into account.
- Caring -> Care
- Escaping -> Escape
Example of Search Query:
- Search Query: The Product Quality is great.
- Cleaning: product quality great.
- Intersection: Find all the docs with all the words.
- Union: Find all docs that contains any of the word.
- Maintain Order of words: [product — i, quality — i+1, great — i+2]
Design DeepDive:
Single Node:
- It can cause Single Point of Failure.
- Run Out of space.
- High Load
- Solution: Sharding.
CAP Theorem:
- Eventual Consistency: It’s okay if new documents are included in search queries later.
- High Availability: System Should be available, whenever we search something.
- Partition Tolerant: System should handle any network partitions, it should continue to work despite any number of communication breakdowns.
Sharding:
Shard by Document Id [Recommended]:
DocumentId can be:
- ReviewId
- LogId
- ProductId
- PostId/TweetId
Pros:
- Latency is highly predictable: ResponseTime doesn’t depend on search query.
- No need to perform intersection.
- Can skip unresponsive/slow shards.
Cons:
- Read Heavy: Complexity (Run on all shards)
- Write Heavy: Simple (1 Shard)
Shard by Words:
- Insert/update/Delete: Go to all shards corresponding to the words in the document. (Multiple Shards)
- Query: Go to all shards
Let’s See How Processing will Work:
Map Reduce:
- Search: “product quality good”, Go to all shards -> Run query at every shards.
- Each shards will return a list of docs that matches the query.
- Collect them all.
- Items: Map(Single Shard), Reduce(Combine result from 2 shards) -> Easy to fit in a memory.
Replication:
- It will be good to have replication in case a node goes down.
- Parameter: Number of shards, Replication factor.
- n -> number of nodes, M -> shards, R -> replication factor
- n ≤ M*R
- Master Slave Replication
Ranking:
- Rank all the document matches a certain query.
- Method: TF-IDF, is a popular method.
TF-IDF (Below Diagram):
ThankYou. Please do give suggestions.