How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I originally posted an issue about Ray Tune here: Tune.run works but TuneGridSearchCV.fit does not work for me - Ray Tune - Ray
However, after talking to Matthew, I found that the issue in actuality is related to the local object store. Whenever I run something on a cluster, I noticed that the plasma % for the worker nodes always stays at 0%. To highlight this error, I ran the following script which I modified from another issue I found online:
import numpy as np
from time import sleep
import ray
@ray.remote
def calc_similarity(sims, offset):
# Fake some work for 100 ms.
sleep(0.10)
return True
if __name__ == "__main__":
ray.shutdown()
ray.init(address='auto')
num_docs = 1000000
num_dimensions = 300
chunk_size = 128
sim_pct = 0.82
# Initialize the array
index = np.random.random((num_docs, num_dimensions)).astype(dtype=np.float32)
index_array = np.arange(num_docs).reshape(1, num_docs)
index_array_id = ray.put(index_array)
calc_results = []
for count, start_doc_no in enumerate(range(0, num_docs, chunk_size)):
size = min( chunk_size, num_docs - (start_doc_no) + 1 )
# Get the query vector out of the index.
query_vector = index[start_doc_no:start_doc_no+size]
# Calculate the matrix multiplication.
result_transformed = np.matmul(index, query_vector.T).T
# Serialize the result matrix out for each client.
result_id = ray.put(result_transformed)
# Simulate multi-threading extracting the results of a cosine similarity calculation
for offset in range(chunk_size):
calc_results.append(calc_similarity.remote(sims=result_id, offset=offset ))
# , index_array=index_array_id))
res = ray.get(calc_results)
calc_results.clear()
When I ran this script, I get many of the following errors:
At least one of the input arguments for this task could not be computed:
ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0500000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*05000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.3.103) for more information about the Python worker failure.
I also noticed on the Ray Dashboard that the Plasma % for the head node increases, but the Plasma % for the worker nodes stay at 0.
Here is my configuration:
The host operating system is Linux (Ubuntu 20.4)
The containers are Linux containers (LXC).
We’re developing this on the DNAnexus platform (https://documentation.dnanexus.com/ ). I’m spinning up a cluster of 5 identical EC2 instances. We opened up all the ports from 6500 to 65535 to see if that resolves the issue, but that didn’t help.
Reproduction script
Run these statements in the nodes containers.
ray start --head --port=6379 --dashboard-host=0.0.0.0 --dashboard-port=443
and for worker nodes
ray start --address=“$head_node_ip:6379” --node-ip-address=“y.y.y.y”
Unusually, due to the fact we’re running inside LXC containers, the head IP address always appears as 10.0.3.103, which is the IP address of the container (setting the head node’s IP address manually to 0.0.0.0 with the --node-ip-address argument causes errors because the Ray job still expects the head IP address to be 10.0.3.103). I absolutely must manually set the worker nodes’ IP address with --node-ip-address because all nodes have the same host name and default to the same ip address of 10.0.3.103.