FAISS index and normalization

data science

Publish Date: 2023-01-18

Previously, we have discussed how to implement a real time semantic search using sentence transformer and FAISS.
real time semantic search

Here, we talk more about indexing in FAISS.
The most popular indexes we should look at are the simplest — flat indexes.

Flat indexes are ‘flat’ because we do not modify the vectors that we feed into them.

Because there is no approximation or clustering of our vectors — these indexes produce the most accurate results. While we have perfect search quality, this comes at the cost of significant search times.

Two flat indexes

Two common flat index:

IndexFlatL2, which uses Euclidean/L2 distance
IndexFlatIP, which uses inner product distance (similar as cosine distance but without normalization)

The search speed between these two flat indexes are very similar, and IndexFlatIP is slightly faster for larger datasets.
See the following query time vs dataset size comparison:

how to normalize similarity metrics

If the vectors we indexed are not normalized, the similarity metrics came out from FAISS are not normalized either.
For example, sometimes we want to have a cosine similarity metrics, where we can have a more meaningful threshold to compare.

It’s very easy to do it with FAISS, just need to make sure vectors are normalized before indexing, and before sending the query vector.

Example code, during indexing time:

index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
faiss.normalize_L2(encoded_data)
index.add_with_ids(encoded_data, np.array(range(0, len(encoded_data))))

during query time:

query_vector = model.encode([query])
k = 3
faiss.normalize_L2(query_vector)
top_k = index.search(query_vector, k)

robot learner

https://datasciencebyexample.github.io/2023/01/18/fais-index-and-normalization/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !

semantic search FAISS

How to generate access token in Databricks

2023-01-20 data engineering

databricks

The fanaticism and reality of Web3

2023-01-15 web3

web3