vector search using Elastic Search, index and search example using python requests library

data engineering

Publish Date: 2023-03-18

Vector search has becoming very useful in deep learning applications.

To search dense vectors in Elasticsearch 8.6, you can use the “dense_vector” data type, which was introduced in Elasticsearch 7.10. This data type allows you to store dense vectors as a single field in your documents, which can then be searched using various similarity measures such as cosine similarity or euclidean distance.

Here’s an example of how to search for similar vectors using cosine similarity:

First, you need to create an index with a dense_vector field:

PUT my_index
{
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "dense_vector",
        "dims": 3
      }
    }
  }
}

In this example, we create an index called “my_index” with a dense_vector field called “my_vector” with three dimensions.

Next, you can index some documents with vectors:

POST my_index/_doc/1
{
  "my_vector": [0.2, 0.3, 0.4]
}

POST my_index/_doc/2
{
  "my_vector": [0.1, 0.7, 0.2]
}

POST my_index/_doc/3
{
  "my_vector": [0.8, 0.2, 0.1]
}

In this example, we index three documents with dense vectors.

Finally, you can search for documents that are similar to a given vector:

GET my_index/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'my_vector') + 1.0",
        "params": {
          "queryVector": [0.1, 0.5, 0.3]
        }
      }
    }
  }
}

putting all together, corresponding requests code in Python are:

import requests
import json

# Create an index with a dense_vector field
url = 'http://localhost:9200/my_index'
data = {
    "mappings": {
        "properties": {
            "my_vector": {
                "type": "dense_vector",
                "dims": 3
            }
        }
    }
}
headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer YOUR_ACCESS_TOKEN'}
response = requests.put(url, data=json.dumps(data), headers=headers)
print(response.json())

# Index some documents with vectors
data = {"my_vector": [0.2, 0.3, 0.4]}
headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer YOUR_ACCESS_TOKEN'}
response = requests.post(url+'/_doc/1', data=json.dumps(data), headers=headers)
print(response.json())

data = {"my_vector": [0.1, 0.7, 0.2]}
headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer YOUR_ACCESS_TOKEN'}
response = requests.post(url+'/_doc/2', data=json.dumps(data), headers=headers)
print(response.json())

data = {"my_vector": [0.8, 0.2, 0.1]}
headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer YOUR_ACCESS_TOKEN'}
response = requests.post(url+'/_doc/3', data=json.dumps(data), headers=headers)
print(response.json())

# Search for documents that are similar to a given vector
data = {
    "query": {
        "script_score": {
            "query": {"match_all": {}},
            "script": {
                "source": "cosineSimilarity(params.queryVector, 'my_vector') + 1.0",
                "params": {"queryVector": [0.1, 0.5, 0.3]}
            }
        }
    }
}
headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer YOUR_ACCESS_TOKEN'}
response = requests.get(url+'/_search', data=json.dumps(data), headers=headers)
print(response.json())

Notice that in the cosineSimilarity function, we add 1.0 to the equation to avoid negative error from elastic search something like this:
Error: {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"script_score script returned an invalid score [-0.9679827] for doc [0]. Must be a non-negative score!"}]

Another to notice is that, we are using cosineSimilarity to calcualte score. However if the vector is normalized or you simply want to calcuate the dot product
score of vectors, we switch to use dotProduct. Reason to use `dotProduct’ is becuase less computing and potential faster in elasticsearch. code is for example:

data = {
    "query": {
        "script_score": {
            "query": {"match_all": {}},
            "script": {
                "source": "dotProduct(params.queryVector, 'my_vector') + 1.0",
                "params": {"queryVector": [0.1, 0.5, 0.3]}
            }
        }
    }
}

One last thing, if you want to limit the number of records to return form search, add size in the query:

data = {
    "query": {
        "script_score": {
            "query": {"match_all": {}},
            "script": {
                "source": "dotProduct(params.queryVector, 'my_vector') + 1.0",
                "params": {"queryVector": [0.1, 0.5, 0.3]}
            }
        },
	"size": 5
    }
}

robot learner

https://datasciencebyexample.github.io/2023/03/18/elasticsearch-dense-vector-search/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !

elasticsearch

How to bulk index data with Elastic Search engine

2023-03-18 data engineering

elasticsearch

AI and the Future of Jobs, Questions and Answers on ChatGPT and Automation

2023-03-17 data science

chatGPT GPT4