Local object store on worker nodes not working, worker plasma stays at 0%

daquang · June 30, 2022, 7:30am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I originally posted an issue about Ray Tune here: Tune.run works but TuneGridSearchCV.fit does not work for me - Ray Tune - Ray

However, after talking to Matthew, I found that the issue in actuality is related to the local object store. Whenever I run something on a cluster, I noticed that the plasma % for the worker nodes always stays at 0%. To highlight this error, I ran the following script which I modified from another issue I found online:

import numpy as np
from time import sleep
import ray

@ray.remote
def calc_similarity(sims, offset):
    # Fake some work for 100 ms.
    sleep(0.10)
    return True

if __name__ == "__main__":
    ray.shutdown()
    ray.init(address='auto')

    num_docs = 1000000
    num_dimensions = 300
    chunk_size = 128
    sim_pct = 0.82

    # Initialize the array
    index = np.random.random((num_docs, num_dimensions)).astype(dtype=np.float32)
    index_array = np.arange(num_docs).reshape(1, num_docs)
    index_array_id = ray.put(index_array)

    calc_results = []

    for count, start_doc_no in enumerate(range(0, num_docs, chunk_size)):
        size = min( chunk_size, num_docs - (start_doc_no) + 1 )
        # Get the query vector out of the index.
        query_vector = index[start_doc_no:start_doc_no+size]
        # Calculate the matrix multiplication.
        result_transformed = np.matmul(index, query_vector.T).T
        # Serialize the result matrix out for each client.
        result_id = ray.put(result_transformed)

        # Simulate multi-threading extracting the results of a cosine similarity calculation
        for offset in range(chunk_size):
            calc_results.append(calc_similarity.remote(sims=result_id, offset=offset ))
            # , index_array=index_array_id))
        res = ray.get(calc_results)
        calc_results.clear()

When I ran this script, I get many of the following errors:

At least one of the input arguments for this task could not be computed:
ray.exceptions.OwnerDiedError: Failed to retrieve object 00ffffffffffffffffffffffffffffffffffffff0500000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*05000000ffffffffffffffffffffffffffffffffffffffffffffffff*` at IP address 10.0.3.103) for more information about the Python worker failure.

I also noticed on the Ray Dashboard that the Plasma % for the head node increases, but the Plasma % for the worker nodes stay at 0.

Here is my configuration:

The host operating system is Linux (Ubuntu 20.4)
The containers are Linux containers (LXC).

We’re developing this on the DNAnexus platform (https://documentation.dnanexus.com/ ). I’m spinning up a cluster of 5 identical EC2 instances. We opened up all the ports from 6500 to 65535 to see if that resolves the issue, but that didn’t help.

Reproduction script

Run these statements in the nodes containers.

ray start --head --port=6379 --dashboard-host=0.0.0.0 --dashboard-port=443

and for worker nodes

ray start --address=“$head_node_ip:6379” --node-ip-address=“y.y.y.y”

Unusually, due to the fact we’re running inside LXC containers, the head IP address always appears as 10.0.3.103, which is the IP address of the container (setting the head node’s IP address manually to 0.0.0.0 with the --node-ip-address argument causes errors because the Ray job still expects the head IP address to be 10.0.3.103). I absolutely must manually set the worker nodes’ IP address with --node-ip-address because all nodes have the same host name and default to the same ip address of 10.0.3.103.

Alex · June 30, 2022, 8:49pm

Which ray version are you using? This sounds very sinmilar to a bug we fixed in ray 1.13. [Core] Account for spilled objects when reporting object store memory usage by mwtian · Pull Request #23425 · ray-project/ray · GitHub

daquang · June 30, 2022, 10:33pm

@Alex I am using Python 3.8 and Ray 1.13.0

daquang · July 5, 2022, 7:36pm

Thanks to Matthew, I was able to get a script that reproduces that error much faster:

import ray
import numpy as np
ray.shutdown()
ray.init(address='auto')
@ray.remote
 def f():
     return np.arange(100_000)
 
ip_resource = "node:<WORKER_NODE_IP>"
result = ray.get(f.options(resources={ip_resource: 0.01}).remote())

When I ran this script (replacing <WORKER_NODE_IP> with the ip address of a worker node), I got the following error:

(f pid=10363) [2022-07-05 19:30:26,734 E 10363 10684] core_worker.h:1020: Mismatched WorkerID: ignoring RPC for previous worker 04000000ffffffffffffffffffffffffffffffffffffffffffffffff, current worker ID: fcf9cd5f60a0d8083861064bd01833344ea796960eca7e8195389c7e
(f pid=10363) [2022-07-05 19:30:27,035 E 10363 10684] core_worker.h:1020: Mismatched WorkerID: ignoring RPC for previous worker 04000000ffffffffffffffffffffffffffffffffffffffffffffffff, current worker ID: fcf9cd5f60a0d8083861064bd01833344ea796960eca7e8195389c7e

Stephanie_Wang · July 7, 2022, 7:01pm

I know you tried this on your earlier script, but what happens when you remove the ray.shutdown() line?

Also, I believe those messages are just warnings and may not actually interfere with the job itself. Can you describe what symptoms you’re seeing from the script? In particular, does the ray.get line hang or throw an exception?

daquang · July 13, 2022, 8:10am

Hu @Stephanie_Wang. These are great questions. As per your instructions I removed the ray.shutdown line and got the same error. The ray.get line hangs and does not throw an exception. Matthew theorized this may be an issue with ports, although I do not know how to diagnose this properly. Even after opening all ports I’m getting the same issue. My current working theory is there might be some latency issue between the EC2 instances or some sort of firewall is preventing data from being shared between nodes, but again I have no idea on how to test this.

Stephanie_Wang · July 13, 2022, 7:04pm

Thanks! A few things:

What happens if you do not include the --node-ip-address flag at all during ray start? It should be okay to start multiple Ray nodes with the same hostname.
Can you attach the Ray system logs after one of the script runs? You can do this by running ray cluster-dump --host localhost.
Can you also include the output of running ray.init(address="auto"); ray.nodes()?

daquang · July 15, 2022, 6:50am

What happens if you do not include the --node-ip-address flag at all during ray start? It should be okay to start multiple Ray nodes with the same hostname.

We found that we must set --node-ip-address for the worker nodes. Otherwise, all nodes, including the head node and worker nodes, will default to the same IP address of 10.0.3.103, which is just the IP address of the LXC container (note that each node is running its own LXC container and hence have the same container IP address in common). All the nodes would still show up on the dashboard if I omit --node-ip-address, but no workers would ever get distributed.

Can you attach the Ray system logs after one of the script runs? You can do this by running ray cluster-dump --host localhost.

I ran the script that hangs at ray.get and dumped the logs into this file: Dropbox - collected_logs_2022-07-15_06-04-27.tar.gz - Simplify your life

This was run in a 3 node cluster of P3 instances.

Notably, the python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_8370.log file displays a lot of errors like this:

[2022-07-15 06:04:27,025 W 8370 8370] plasma_store_provider.cc:461: Objects c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000 are still not local after 212s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/.

Can you also include the output of running ray.init(address="auto"); ray.nodes()?

Here’s the ray.init(address=“auto”) output:

RayContext(dashboard_url='10.0.3.103:443', python_version='3.8.10', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', address_info={'node_ip_address': '10.0.3.103', 'raylet_ip_address': '10.0.3.103', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-07-15_05-43-03_077214_7641/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-07-15_05-43-03_077214_7641/sockets/raylet', 'webui_url': '10.0.3.103:443', 'session_dir': '/tmp/ray/session_2022-07-15_05-43-03_077214_7641', 'metrics_export_port': 64107, 'gcs_address': '10.0.3.103:6379', 'address': '10.0.3.103:6379', 'node_id': 'b13e4141cacdaeb064e0f0d21cbb308b9ab8f00d249340d87ad230b2'})

Here’s the ray.nodes output:

[{'NodeID': 'b13e4141cacdaeb064e0f0d21cbb308b9ab8f00d249340d87ad230b2',
  'Alive': True,
  'NodeManagerAddress': '10.0.3.103',
  'NodeManagerHostname': 'job-GF8Pj8j0fbq53q6VB2x6J8fQ',
  'NodeManagerPort': 37453,
  'ObjectManagerPort': 40961,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-07-15_05-43-03_077214_7641/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2022-07-15_05-43-03_077214_7641/sockets/raylet',
  'MetricsExportPort': 64107,
  'NodeName': '10.0.3.103',
  'alive': True,
  'Resources': {'accelerator_type:V100': 1.0,
   'GPU': 1.0,
   'memory': 36180111360.0,
   'node:10.0.3.103': 1.0,
   'CPU': 8.0,
   'object_store_memory': 18090055680.0}},
 {'NodeID': 'cd5cc5695e8f750567e019171c6824a4f7ae24cd95d4dd40025ae5d3',
  'Alive': True,
  'NodeManagerAddress': 'ip-172-31-70-192.ec2.internal',
  'NodeManagerHostname': 'job-GF8Pj8j0fbq53q6VB2x6J8fQ',
  'NodeManagerPort': 34089,
  'ObjectManagerPort': 37295,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-07-15_05-43-03_077214_7641/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2022-07-15_05-43-03_077214_7641/sockets/raylet',
  'MetricsExportPort': 61739,
  'NodeName': 'ip-172-31-70-192.ec2.internal',
  'alive': True,
  'Resources': {'GPU': 1.0,
   'memory': 43026413159.0,
   'CPU': 8.0,
   'node:ip-172-31-70-192.ec2.internal': 1.0,
   'accelerator_type:V100': 1.0,
   'object_store_memory': 18439891353.0}},
 {'NodeID': '4e46726d4f9211dac999216c19879b01abb931a3a4a4d9fd9afe78d6',
  'Alive': True,
  'NodeManagerAddress': 'ip-172-31-68-115.ec2.internal',
  'NodeManagerHostname': 'job-GF8Pj8j0fbq53q6VB2x6J8fQ',
  'NodeManagerPort': 40297,
  'ObjectManagerPort': 42449,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-07-15_05-43-03_077214_7641/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2022-07-15_05-43-03_077214_7641/sockets/raylet',
  'MetricsExportPort': 60379,
  'NodeName': 'ip-172-31-68-115.ec2.internal',
  'alive': True,
  'Resources': {'GPU': 1.0,
   'memory': 43029905408.0,
   'object_store_memory': 18441388032.0,
   'accelerator_type:V100': 1.0,
   'node:ip-172-31-68-115.ec2.internal': 1.0,
   'CPU': 8.0}}]

daquang · August 31, 2022, 9:16am

Reposted from the previous thread:

Problem solved. Turns out it required me to set the --node-ip-address argument to the head node’s ip address for the ray start --head command and to change the ray.init(address=‘auto’) line to ray.init(address=‘auto’, _node_ip_address=“head node’s ip address”)

Topic		Replies	Views
Proxying call to GetObject failed Ray Client	5	744	November 11, 2021
Ray on slurm - different ip addresses of worker nodes Ray Clusters	14	3671	August 29, 2023
Cluster dosen't distribute workloads Ray Clusters	0	403	November 10, 2022
Big cluster job failing due to SIGBUS in plasma Ray Core	16	937	July 12, 2021
Running Ray on Slurm Cluster	11	1256	January 31, 2021

Local object store on worker nodes not working, worker plasma stays at 0%

Reproduction script

Related topics