Background
I have data in my Mongodb collection(not a lot ~100 MB).It consists of several documents.Pymongo queries the collection with some user input.Through that query appropriate document need to be searched and returned.The data won't update regularly.
I have decided to use elasticsearch in order to search and return the relevant document.My workflow is(when queried everytime)-
Retrieve all documents from mongodb-->bulk insert it in elasticsearch--> search using elasticsearch-->Return the top document.
I am using pymongo,elasticsearch-py
Problem
As I bulk insert for every query fired,the documents keep on accumulating(the count variable increases everytime by the number of documents in collection).As I understand I need to bulk insert only once.But this seems a bit hacky.(setting to run bulk only if count==0)
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from pymongo import MongoClient
es = Elasticsearch()
INDEX_NAME = "documents"
TYPE = "document"
stop_words=[]
client=MongoClient("URI")
db=client["database"]
collection=db["collection"]
def make_documents():
for faq in collection.find():
doc = {
'_op_type': 'create',
'_index': INDEX_NAME,
'_type': TYPE,
'_source': {'text': faq['answer']}
}
yield (doc)
# put documents in index in bulk
bulk(es, make_documents())
# count the matches
count = es.count(index=INDEX_NAME, doc_type=TYPE, body={"query": {"match_all": {}}})
# now we can do searches.
print("Ok. I've got an index of {0} documents. Let's do some searches...".format(count['count']))
while True:
try:
query = input("Enter a search: ")
result = es.search(index=INDEX_NAME, doc_type=TYPE, body={"query": {"match": {"text": query.strip()}}})
if result.get('hits') is not None and result['hits'].get('hits') is not None:
print(result['hits']['hits'])
else:
print({})
except(KeyboardInterrupt):
break
I just need to load all the data in elasticsearch once and query a lot of times without updating the elasticsearch.I can be wildly incorrect in my approach as I am really new to elasticsearch and I feel there has to be a better way.