How to bulk index data with Elastic Search engine


Previously, we have discussed how to index and query data using elasticsearch in Python
Here

However, indexing large amounts of data in Elasticsearch can be a challenging task, especially if you need to index millions of documents or more. Fortunately, Elasticsearch provides a powerful API endpoint called _bulk that allows you to index multiple documents in a single request, which can greatly improve indexing performance.

In this article, we’ll explore how to use the _bulk API endpoint in Elasticsearch to index large amounts of data efficiently. We’ll start by discussing the _bulk API endpoint and its requirements, and then we’ll provide some examples of how to use it in Python using the requests library.

What is the _bulk API endpoint?

The _bulk API endpoint in Elasticsearch allows you to index, update, or delete multiple documents in a single request. This can be much more efficient than sending individual requests for each document, especially when dealing with large amounts of data.

The _bulk endpoint accepts a newline-delimited JSON (NDJSON) payload that specifies the operations to perform on each document. Each line in the payload represents a single operation, and each operation consists of a JSON object that specifies the index, update, or delete action to perform on a single document.

Here’s an example of what a _bulk payload might look like:

POST my_index/_bulk
{"index":{"_id":1}}
{"name":"John Doe","age":35,"city":"New York"}
{"index":{"_id":2}}
{"name":"Jane Doe","age":28,"city":"San Francisco"}
{"index":{"_id":3}}
{"name":"Bob Smith","age":42,"city":"Chicago"}

In this example, we’re indexing three documents in the my_index index. Each document is represented as a separate JSON object, and the index action is used to specify the operation type for each document. The _id field is also specified for each document using the index action.

Note that each document is separated by a newline character (\n) and that the bulk request is wrapped in a single JSON object. You can include multiple index or delete actions in a single _bulk request, and Elasticsearch will process them all in one go.

Using _bulk with Python and requests

Now that we understand the basics of the _bulk API endpoint, let’s look at how to use it in Python using the requests library.

Suppose we have a list of data that we want to index in Elasticsearch. Here’s an example of how we might loop through the list of data and call the _bulk API endpoint using requests:

import json
import requests

# Example list of data to index
data_list = [
{"name": "John Doe", "age": 35, "city": "New York"},
{"name": "Jane Doe", "age": 28, "city": "San Francisco"},
{"name": "Bob Smith", "age": 42, "city": "Chicago"}
]

# Elasticsearch settings
es_url = "http://localhost:9200"
es_index = "my_index"

# Bulk index the data
bulk_data = ""
for i, data in enumerate(data_list):
# Add the index action and ID for each document
bulk_data += json.dumps({"index": {"_id": i+1}}) + "\n"
# Add the document data
bulk_data += json.dumps(data) + "\n"

# Send the bulk request every 1000 documents
if i > 0 and i % 1000 == 0:
# Send the bulk request
response = requests.post(f"{es_url}/{es_index}/_bulk", headers={"Content-Type": "application/json"}, data=bulk_data)
# Reset the bulk_data variable
bulk_data = ""

# Send the final bulk request
response = requests.post(f"{es_url}/{es_index}/_bulk", headers={"Content-Type": "application/json"}, data=bulk_data)
# Print the response content
print(response.content)



Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC