We use sentence transformer to encode short texts, and then index the results using in memory search engine FAISS; Togehter we can achieve real time performance of the semantic search simply on CPU platforms.
import numpy as np import torch import os import pandas as pd import faiss import time from sklearn.datasets import fetch_20newsgroups from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
model.encode(['how are you'])[0].shape
(768,)
get data
data = fetch_20newsgroups()['data'] subjects = [item.split('\n')[1] for item in data]
subjects[:10]
['Subject: WHAT car is this!?',
'Subject: SI Clock Poll - Final Call',
'Subject: PB questions...',
'Subject: Re: Weitek P9000 ?',
'Subject: Re: Shuttle Launch Question',
'Subject: Re: Rewording the Second Amendment (ideas)',
'Subject: Brain Tumor Treatment (thanks)',
'Subject: Re: IDE vs SCSI',
'Subject: WIn 3.0 ICON HELP PLEASE!',
'Subject: Re: Sigma Designs Double up??']
encoded_data = model.encode(subjects)
encoded_data.shape
(11314, 768)
indexing the dataset
index = faiss.IndexIDMap(faiss.IndexFlatIP(768)) index.add_with_ids(encoded_data, np.array(range(0, len(encoded_data))))
serializing the index to disk, The serialized index can be then exported into any machine for hosting the search engine
faiss.write_index(index, '20news')
read the index back from disk for demo purpose, so called desearializing
index = faiss.read_index('20news')
Now do the semantic search
defsearch(query): start=time.time() query_vector = model.encode([query]) k = 5 top_k = index.search(query_vector, k) print('spent time: {}'.format(time.time()-start)) return [subjects[_id] for _idin top_k[1].tolist()[0]]
# type the query # query=str(input())
query = "auto" results=search(query) print('results :') for result in results: print('\t',result)
spent time: 0.035505056381225586
results :
Subject: (w)rec.autos
Subject: Re: DRIVE
Subject: WHAT car is this!?
Subject: Re: WHAT car is this!?
Subject: Car AMP [Forsale]
Reprint policy:
All articles in this blog are used except for special statements
CC BY 4.0
reprint policy. If reproduced, please indicate source
robot learner
!