Real time in memory semantic search with deep learning embedding and FAISS


We use sentence transformer to encode short texts, and then index the results using in memory search engine FAISS;
Togehter we can achieve real time performance of the semantic search simply on CPU platforms.

install packages if necessary

!pip install faiss-cpu
!pip install -U sentence-transformers

import libraries

import numpy as np
import torch
import os
import pandas as pd
import faiss
import time
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
model.encode(['how are you'])[0].shape
(768,)

get data

data = fetch_20newsgroups()['data']
subjects = [item.split('\n')[1] for item in data]
subjects[:10]
['Subject: WHAT car is this!?',
 'Subject: SI Clock Poll - Final Call',
 'Subject: PB questions...',
 'Subject: Re: Weitek P9000 ?',
 'Subject: Re: Shuttle Launch Question',
 'Subject: Re: Rewording the Second Amendment (ideas)',
 'Subject: Brain Tumor Treatment (thanks)',
 'Subject: Re: IDE vs SCSI',
 'Subject: WIn 3.0 ICON HELP PLEASE!',
 'Subject: Re: Sigma Designs Double up??']
encoded_data = model.encode(subjects)
encoded_data.shape
(11314, 768)

indexing the dataset

index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
index.add_with_ids(encoded_data, np.array(range(0, len(encoded_data))))

serializing the index to disk, The serialized index can be then exported into any machine for hosting the search engine

faiss.write_index(index, '20news')

read the index back from disk for demo purpose, so called desearializing

index = faiss.read_index('20news')

Now do the semantic search

def search(query):
start=time.time()
query_vector = model.encode([query])
k = 5
top_k = index.search(query_vector, k)
print('spent time: {}'.format(time.time()-start))
return [subjects[_id] for _id in top_k[1].tolist()[0]]
# type the query
# query=str(input())

query = "auto"
results=search(query)
print('results :')
for result in results:
print('\t',result)
spent time: 0.035505056381225586
results :
     Subject: (w)rec.autos
     Subject: Re: DRIVE
     Subject: WHAT car is this!?
     Subject: Re: WHAT car is this!?
     Subject: Car AMP [Forsale]

Code link

github link


Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC