Prefix, Contains and Fuzzy match – Part 3: Feature by feature Autocomplete implementation


Welcome again, my friend! I am glad that you have come this far with me in our journey of Autocomplete implementation.  And if you have landed here directly without going through previous articles in the same series, you may want to visit them first before continuing –

Part 1 – What is predictive search or auto complete search? – The Must Have Features

Part 2 – How to implement autocomplete search API – Getting Basics out of the way!

Let us get going… In the previous articles we discussed the MUST HAVE features that we need to have in a typical autocomplete implementation and we have already implemented the basic front to back working version of an autocomplete implementation. In this article we shall look to add most of the core ‘autocomplete search’ features in our basic implementation and make it feature rich, well almost ready for rollout. Did I leave a hint there, that we shall have a Part 4 to this series as well 😄.

In the Part 2 as per the below code snippet for the index, I had shown that fields are provided with the analyzers like, "analyzer" : "nGram_analyzer" in the mappings section.

PUT smart-stocklist-index
{
   "settings": {
    "index" : {
      "analysis": {
        <TO BE DISCUSSED LATER>
      }
    },
    "mappings": {
      "properties": {
  	        "BSE_SECURITY_CODE" : {
          		"type" : "text"
        	},
        	"BSE_SYMBOL" : {
          		"type" : "text",
          		"analyzer" : "enGram_analyzer",
          		"search_analyzer" : "whitespace_analyzer"
        	},
        	"ISIN_NUMBER" : {
          		"type" : "text"
        	},
        	"NSE_SYMBOL" : {
          		"type" : "text",
          		"analyzer" : "enGram_analyzer",
          		"search_analyzer" : "whitespace_analyzer"
        	},
        	"SECURITY_NAME" : {
          		"type" : "text",
          		"copy_to" : [ "eSECURITY_NAME" ],
          		"analyzer" : "nGram_analyzer",
          		"search_analyzer" : "whitespace_analyzer"
        	}
      }
     }
    }
}

However, I did not describe the Analyzers then and I had blanked out the Analysis section of the index definition. These analyzers actually define the scheme to which the corresponding field are to be indexed. In this article, I will show you the details of these analyzers and how do they solve the autocomplete problem.

Index that enables the prefix match, contains match and fuzzy match – My example

Now, let me first show the analysis section and the analyzers used in my Index before defining the problems they solve and how do they solve those problems

"analysis" : {
          "filter" : {
		"enGram_filter" : {
             			 "token_chars" : [ "letter", "digit", "punctuation", "symbol" ],
              			"min_gram" : "2",
            			  "type" : "edge_ngram",
             			 "max_gram" : "20"
            		},
            		"nGram_filter" : {
              			"token_chars" : [ "letter", "digit", "punctuation", "symbol" ],
              			"min_gram" : "2",
             			 "type" : "ngram",
              			"max_gram" : "20"
            		},
		"english_stop" : {
              			"type" : "stop",
              			"stopwords" : [ "limited", "ltd" ]
           		 }
          },
          "analyzer" : {
"nGram_analyzer" : {
              			"filter" : [ "lowercase","english_stop", "nGram_filter" ],
              			"char_filter" : [ "replaceCharWithSpace", "specialCharactersFilter" ],
              			"type" : "custom",
              			"tokenizer" : "whitespace"
            		},
		"enGram_analyzer" : {
              			"filter" : [ "lowercase", "enGram_filter" ],
              			"char_filter" : [ "replaceCharWithSpace", "specialCharactersFilter" ],
              			"type" : "custom",
              			"tokenizer" : "whitespace"
            		}
          },
          "char_filter" : {
            "replaceCharWithSpace" : {
              "pattern" : "[.-]",
              "type" : "pattern_replace",
              "replacement" : " "
            },
            "specialCharactersFilter" : {
              "pattern" : "[^A-Za-z0-9 ]",
              "type" : "pattern_replace",
              "replacement" : ""
            }
          }
        }

Special characters and autocomplete search query

Notice the replaceCharWithSpace and specialCharactersFilter char_filters used within the analyzers, let me describe how they are useful.

Well, I had this kind of data in my index HCL-INSYS, BAJAJ-AUTO, D-Link (India) Limited, Mold-Tek Packaging Limited, GAIL (India) Limited.  So the data has non-alphanumeric characters like hyphens (-) and parenthesis (()). It is not fair to expect a user to remember to provide these in input field and it is highly unlikely that they would ever be able to provide these correctly. So how do you ignore them if they are provided and how do you ensure that comparison still works. And off course, while these can be removed from searching / comparing they would still need to be retained in the results being returned as they are relevant for display or end user consumption.   

That is where these char_filters come handy.  replaceCharWithSpace char_filter below replaces the characters like hyphens (-) and dots (.) with spaces.  Finally specialCharactersFilter char_filter eliminates any character other than alphanumeric and space characters. Only after this filtering stage, data is indexed.

Now querying for BAJAJ-AUTO and as well as BAJAJ AUTO will also be successful. Likewise the query for D-Link (India) and as well as for D Link India will also be successful.

Ignoring special characters from Elasticsearch query
Ignoring special characters from Elasticsearch query

What are ngram analyzers and how they help with contains style partial word matching

Let me explain with above index example; you can see that “SECURITY_NAME” field is mapped to an index analyzer / filter of type ‘ngram’. This analyzer will ensure that a data like “Bharti Airtel” data in the “SECURITY_NAME” field will get indexed as "bh", "bha", "bhar", "bhart", "bharti", "ha", "har", "hart", "harti", "ar", "art", "arti", "rt", "rti", "ti", "ai", "air", "airt", "airte", "airtel", "ir", "irt", "irte", "irtel", "rt", "rte", "rtel", "te", "tel","el"

Now you see that data is broken in every possible substrings of length of 2 or more to add to the index. And off course the “min_gram” and “max_gram” settings in ngram_filter controls this.

Now when user types any part of “Bharti Airtel” in the input field the above index comes handy in offering this auto complete feature to the input field

What are edge_ngram analyzers and how they help with prefix style partial word matching

Now edge_ngram is only slightly different to ngram; again let me show an example. This analyzer / filter combination will ensure that a "bhartiartl" data in BSE_SYMBOL and NSE_SYMBOL  fields will get indexed as  "bh", "bha", "bhar", "bhart", "bharti", "bhartia", "bhartiar", "bhartiart", "bhartiartl"

So only startswith or prefix substrings are added to the index. In the case of single word strings like business codes / identifiers it is more meaningful to use edge_ngram than ngram analyzers; as it less likely for user to meaningfully remember / identify contains style substrings.

Also edge_ngram saves significant space over ngram analysed index.  So if we would have analysed the same data using ngram; the following combinations would have been indexed.

"bh", "bha", "bhar", "bhart", "bharti", "bhartia", "bhartiar", "bhartiart", "bhartiartl", "ha", "har", "hart", "harti", "hartia", "hartiar", "hartiart", "hartiartl", "ar", "art", "arti", "artia", "artiar", "artiart", "artiartl", "rt", "rti", "rtia", "rtiar", "rtiart", "rtiartl", "ti", "tia", "tiar", "tiart", "tiartl", "ia", "iar", "iart", "iartl", "ar", "art", "artl", "rt", "rtl", "tl"

Clearly it would have taken 4-5 times more space in index without any additional value for these kinds of data fields.

Removing common and insignificant words from Elasticsearch index

The words that frequently occur in your data do not provide any additional value during searching and hence during indexing; like in my case of securities data almost every other security name has words like “limited”, “ltd.”, “Private” or “pvt.” within them. So there is little value in adding these to indexing, in fact they only increase the index storage and lead to lower quality of search results.

So I employed the following stop words filter in my ngram analyzer to eliminate such words from indexing process –

"english_stop" : {
              			"type" : "stop",
              			"stopwords" : [ "limited", "ltd" ]
           		 }

How to autocorrect words misspelled by humans or by speech to text voice assistants using Elastic search

Users have come to expect that systems should correct their typos automatically; so search using “Larson should still find Larsen or search for “SIPLA” should still find “CIPLA”. Additionally in my case voice enabled assistants / services were throwing randomly weird text strings like “Awrobindoh” was coming for “Aurobindo”, “Bharati Airtell” for “Bharti Airtel”, “Godrage” for “Godrej” and so on .. What a fun 😄 is not it ….?

All we need in our search query for Elasticsearch is to add “fuzziness”  to it 😄 as shown below in the code snippet.

GET smart-stocklist-index/_search?filter_path=hits.hits._source
{
   "size": 10,
   "query": { 
     "multi_match": {
      "query": "godrage",
      "fields": ["NSE_SYMBOL", "SECURITY_NAME","ISIN_NUMBER","BSE_SECURITY_CODE","BSE_SYMBOL"],
      "operator": "and", 
      "type": "most_fields", 
      "fuzziness": "AUTO"
    }
   }
}

And following successful results will start to appear and obviously without “fuzziness” they were failing.

Fuzzy search providing Auto correction Feature in Elasticsearch
Fuzzy search providing Auto correction Feature in Elasticsearch

Working software gives immense pleasure and here are some shining examples

Ngram and edge_ngram working together in a single query

ngram and edge_ngram working together in a single query
ngram and edge_ngram working together

Ngram, edge_ngram and fuzziness all working together

That is partial & incorrectly spelled word appears corrected as prefix and also in “contains” matching. So in example below, a user commits a typo; i.e. user types infas may be its intention was to type Infosys or may  be it was to type Infrastructure  .. who knows 😄

Prefix , Contains and Fuzzy Match all together in Elasticsearch
Prefix , Contains and Fuzzy Match, all together in Elasticsearch

Conclusion

We have implemented most of the features as described in Part 1 of the series as MUST HAVE features. Or, have we … ? Can you spot some errors and limitations? You can leave comments or share those issues with me 😄, It will be fun!

I can leave one error as a teaser; search for “Apollo t” fails but for “Apollo ty” works … Why?

There are more errors when features have to work in combination with each other. Hope that you will be able to spot them. Else, anyway see you in the next part with mystery resolved 😄. Happy searching!

Part 1 – What is predictive search or auto complete search? – The Must Have Features

Part 2 – How to implement autocomplete search API – Getting Basics out of


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: