How to implement autocomplete search API – Part 2: Getting Basics out of the way!


Welcome friends! This article is in continuity to my previous one on What is predictive search or auto complete search? – Part 1:  The Must Have Features.  In that one, I had talked about the MUST HAVE features for any autocomplete or predictive search implementations. So if you have not read that yet, I would encourage you to read that first, else crack on 😄.

Enough of easy talk on MUST Have Features 😄, so let us drop our Product Owner hat and let us put on the hat of mighty software engineer who makes things happen! In this article, I will talk about the basic steps which would be needed to get a multi field / multi match index and query implementation done. So a single search team as an input and a single search for that term but executed against multiple fields in your index data in one go.

Creating an OpenSearch or Elasticsearch Index that enables multi field search

I have used AWS OpenSearch (Elasticsearch) Service in my implementation; the journey there starts by creating an index.  Let me show you the index that I created for my security search implementation to bring the concept to life –

PUT smart-stocklist-index
{
   "settings": {
    "index" : {
      "analysis": {
        <TO BE DISCUSSED LATER>
      }
    },
    "mappings": {
      "properties": {
        "BSE_SECURITY_CODE" : {
          "type" : "text"
        },
        "BSE_SYMBOL" : {
          "type" : "text",
          "analyzer" : "enGram_analyzer"
        },
        "ISIN_NUMBER" : {
          "type" : "text"
        },
        "NSE_SYMBOL" : {
          "type" : "text",
          "analyzer" : "enGram_analyzer"
        },
        "SECURITY_NAME" : {
          "type" : "text",
          "analyzer" : "nGram_analyzer"
        }
      }
     }
    }
}

Note that this index comprised of 5 different fields and each having their own separate analyzers.

  • ISIN_NUMBER and BSE_SECURITY_CODE fields using the default / standard analyzers, that is because full / exact match scheme is to be used for them.
  • BSE_SYMBOL and NSE_SYMBOL using the edge_ngram type analyzer, this analyzer indexes the data in the field such that it is suitable for starts with or prefix style auto completion. So a "bhartiartl" will get indexed as "bh", "bha", "bhar", "bhart", "bharti", "bhartia", "bhartiar", "bhartiart", "bhartiartl"
  • SECURITY_NAME using the nGram type analyzer, this analyzer indexes the data in the field such that it is suitable for contains with style auto completion. So a "Bharti Airtel" data in the field will get indexed as "bh", "bha", "bhar", "bhart", "bharti", "ha", "har", "hart", "harti", "ar", "art", "arti", "rt", "rti", "ti", "ai", "air", "airt", "airte", "airtel", "ir", "irt", "irte", "irtel", "rt", "rte", "rtel", "te", "tel", "el"

Implementations for custom analyzers shall be covered in the next article. At this stage it is important to incrementally understand the concepts.

Populating the OpenSearch / Elastic search index with data

With index ready, now let us populate it with our data. Obvious first important step is data preparation before index is actually populated with prepared data. In my case data came from two sources NSE and BSE and they were in different formats. So I had to transform to create consistent field headers and record formats. I had to remove the unnecessary trailing spaces and characters. Then came the merging of multiple files into single which could be loaded to index. I used python pandas library for the data preparation steps. There are numerous sources for reading more about these steps , so I am skipping them and they are not the core topic of this article.

OpenSearch (& Elasticsearch) accepts data in JSON format, so I had to transform CSV data that I had into the JSON format.

NSE_SYMBOLSECURITY_NAMEISIN_NUMBERBSE_SECURITY_CODEBSE_SYMBOL
20MICRONS20 Microns LimitedINE144J0102753302220MICRONS
21STCENMGM21st Century Management Services LimitedINE253B0101552692121STCENMGM
3IINFOLTD3i Infotech LimitedINE748C010385326283IINFOTECH

……..

Had to be transformed into …

{"index": {}}
{"NSE_SYMBOL": "20MICRONS", "SECURITY_NAME": "20 Microns Limited", "ISIN_NUMBER": "INE144J01027", "BSE_SECURITY_CODE": "533022", "BSE_SYMBOL": "20MICRONS"}
{"index": {}}
{"NSE_SYMBOL": "21STCENMGM", "SECURITY_NAME": "21st Century Management Services Limited", "ISIN_NUMBER": "INE253B01015", "BSE_SECURITY_CODE": "526921", "BSE_SYMBOL": "21STCENMGM"}
{"index": {}}
{"NSE_SYMBOL": "3IINFOLTD", "SECURITY_NAME": "3i Infotech Limited", "ISIN_NUMBER": "INE748C01038", "BSE_SECURITY_CODE": "532628", "BSE_SYMBOL": "3IINFOTECH"}
….

And, the following CSV to JSON format conversion function came handy in that regard –

import json 
import collections
orderedDict = collections.OrderedDict()
from collections import OrderedDict

def csv_to_json(csvFilePath, jsonFilePath):
    jsonArray = []
    x = OrderedDict([('index', {})])      
    jsonString = json.dumps(x)  
    row_counter = 0
    with open(csvFilePath, encoding='utf-8') as csvf: 
        with open(jsonFilePath, 'w', encoding='utf-8') as jsonf:
            csvReader = csv.DictReader(csvf) 
            for row in csvReader: 
                jsonf.write(jsonString)
                jsonf.write("\n")
                y = json.dumps(row)
                jsonf.write(y)
                jsonf.write("\n")
                row_counter += 1
          
csvFilePath = '../../data/final_data/Securities.csv'
jsonFilePath = '../../data/final_data/Securities.json'
csv_to_json(csvFilePath, jsonFilePath)

Then I used the CURL utility to export this JSON data into my Open Search Index, as follows –

C:\Users\*******>curl -XPOST -u "<user-id>:<password>" "<URL-to-OpenSearchService>/smart-stocklist-index/_bulk" --data-binary @<File Path to JSON File>/Securities.json -H "Content-Type:application/json"

Performing the multi field search query on OpenSearch or Elasticsearch Index

Now with index ready and data in the index ready, we are off to doing multi field search querying and off course including auto completions. OpenSearch Dashboards Dev Tools Console is an easy way of building and validating queries before integrating them into your programmes. Attached below is a screenshot of the same , on the left hand side of the console is the GET search query with a single search term “BEL” issued in one go against multiple fields and right hand side of the console shows results . Some results are appearing because search term “BEL” appears as prefix in NSE_SYMBOL or BSE-SYMBOL fields. Others are appearing because search term appears as a part of SECURITY_NAME anywhere in that field.

Multi field Elasticsearch query example
Multi field Elasticsearch query example

Here is the same query for easy copy / paste 😄

GET smart-stocklist-index/_search?filter_path=hits.hits._source
{
   "size": 10,
   "query": { 
     "multi_match": {
      "query": "bel",
      "fields": ["NSE_SYMBOL", "SECURITY_NAME","ISIN_NUMBER","BSE_SECURITY_CODE","BSE_SYMBOL"],
      "operator": "and", 
      "type": "most_fields"
    }
   }
}

Now if the search term would have contained an ISIN Number like INE263A01024, then the same search query would have been successful in fetching single record in the result with search term exactly matching to ISIN_NUMBER field (see image below for reference). So no hassle of writing multiple search queries or complex if-else conditions for performing varied searches across fields of various types.

Multi field search query with Exact Match example
Multi field search query with Exact Match example

Using an Autocomplete Search API on a web page

A picture is worth 1000 words, so exactly like this …

Autocomplete implementation example with OpenSearch
My Autocomplete control on a Web UI

I will be skipping the bulk of my autocomplete client side implementation, but let me mention the key part that client side needed a Server side REST API which when sent with a partial search term from UI should return autocompleted / matched results like above. Particularly look at below the REST API being called through an AJAX HTTP request in the code below.

Autocomplete HTML – Client code

<body>
<form autocomplete="off"><input type="text" name="q" id="q" onKeyUp="showResults(this.value)" />
<div id="result"></div>
</form>
</body>

Autocomplete JavaScript – Client Code

function showResults(val) 
{
var request = new XMLHttpRequest();
request.open('GET', 'https://**************.amazonaws.com/staging/security/'+val, true);
request.send();
// Callback function reads the search API’s results and shows them in innerHTML of DIV
request.onload = function () {
		var datastr = this.response
		var data = JSON.parse(datastr)
		var resultsarr = []
  		if (request.status >= 200 && request.status < 400) {
			if (data.SECURITIES.length > 0) {
				resultsarr = new Array (data.SECURITIES.length);
			}
for (var i = 0; i < data.SECURITIES.length; i++) {
var security = data.SECURITIES[i];
resultsarr[i] = security.SECURITY_NAME
}
} 
res = document.getElementById("result");
res.innerHTML = '';
let list = '';
for (i=0; i<resultsarr.length; i++) {
list += '<li>' + resultsarr[i] + '</li>';
}
res.innerHTML = '<ul>' + list + '</ul>';
}
}

Now let me talk about the challenges faced in implementation of my rest API 'https://**************.amazonaws.com/staging/security/{pp_stockstring }’

{pp_stockstring } being the search term passed as the path parameter passed to my API.

Implementing the Autocomplete Search API using AWS Lambda and API Gateway

I implemented my API as an AWS Lambda function fronted by a REST API hosted on AWS API Gateway. But some of the challenges mentioned here should apply on other technology platforms too.

Firstly I needed to create a Lambda Layer containing the opensearch-py python library. At a directory of choice say <dir> , I created a folder named python. Then from the command line environment I moved to <dir> location and executed the following python pip command.

pip install opensearch-py --target python/.

Then I zipped this folder location and used that zip to create a Lambda layer from AWS Management web console. My AWS Lambda function used this Layer at its base thus was able to access opensearch-py library.

Here is my lambda code for your reference, particularly read the documentation comments in the code –

import json
from opensearchpy import OpenSearch
# URL to my AWS OpenSearchService
host = '*******************.es.amazonaws.com' 
port = 443
# For testing only. Don't store credentials in code.
auth = ('*********', '*********')

# Opening a connection to AWS OpenSearch Service
client = OpenSearch(
    hosts = [{'host': host, 'port': port}],
    http_compress = True, # enables gzip compression for request bodies
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    ssl_assert_hostname = False,
    ssl_show_warn = False
)
index_name = 'smart-stocklist-index'
# Actual lambda function
def lambda_handler(event, context):
    # Fetching search term from path paramater
   pp_security_str = event["pathParameters"]["pp_stockstring"]
#Decoding the Space character  
 pp_security_str = pp_security_str.replace('%20',' ')
    # Actual Open Search Query string against index
    query = {
       "size": 10,
       "query": { 
         "multi_match": {
          "query": pp_security_str,
          "fields": ["NSE_SYMBOL", "SECURITY_NAME","ISIN_NUMBER","BSE_SECURITY_CODE","BSE_SYMBOL"],
          "operator": "and",
          "type": "most_fields"
        }
       }
    }
    
    response = client.search(
        body = query,
        index = index_name
    )
  #Forming the JSON results string from OpenSeach Query results
    results_json = '{ "SECURITIES" : ['
    counter = 0
    for hit in response['hits']['hits']:
      if counter > 0 : 
        results_json +=','
      results_json +=  '{'
      results_json += '"SECURITY_NAME" : "' + hit['_source']['SECURITY_NAME'] + '",'
      results_json += '"ISIN_NUMBER" : "' + hit['_source']['ISIN_NUMBER'] + '"'
      results_json += '}'
      counter +=1 
    results_json += '] }'
    return {
        'statusCode': 200,
       # Headers added to Enable CORS , otherwise autocomplete UI hosted from different domain origin will not be able to call the API hosted in alternate domain successfully 
       'headers': {
            'Access-Control-Allow-Headers': 'Content-Type',
            'Access-Control-Allow-Origin': '*',
            'Access-Control-Allow-Methods': 'OPTIONS,POST,GET'
        },
        'body': results_json
   }

Enabling Cross Origin Resource Sharing or CORS on API

Cross-origin resource sharing (CORS) is a browser security feature that restricts cross-origin HTTP requests that are initiated from scripts running in the browser. This would typically be the situation in autocomplete implementations, because typically the HTML / web content would be hosted in an alternate domain to where the Search API would be implemented. And I faced this scenario in my implementation as well. To fix this issue you would have to do two things –

  • One you would have to Enable CORS on your API / Gateway end point
  • In your API implementation you would have to return HTTP headers which allow for CORS access for the relevant API methods and from the client domain. In my lambda code above I have allowed for all domain origins 'Access-Control-Allow-Origin': '*'. But, ideally you should only do this for your specific web domains.

Refer to following useful links for more details on this issue.

https://docs.aws.amazon.com/apigateway/latest/developerguide/how-to-cors.html

https://www.spektor.dev/cors-and-other-gotchas-with-aws-api-gateway-lambda/

Conclusion

In this article I showed you, how to get the basic implementation done for the Autocomplete / multi field Search API and also addressed some of the common issues one would face in the implementation. In the next article we shall have a detailed look at how to implement the Prefix styled, Contains styled auto completions and auto corrections. And, most importantly to look at the challenges one would get into while making all of these multiple features work together with each other sensibly 😄

Feel free to leave your comments and queries. Link my article into yours, get the word going. See you in the next article in next few days. Bye !

Other Parts in this article series –

Part 1 – What is predictive search or auto complete search? – The Must Have Features

Part 3 – Prefix, Contains and Fuzzy match – Feature by feature Autocomplete implementation


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: