Memory error in distributed multiprocessing

Hello, I created a head node, and then added 2 more worker nodes from terminal, I ran the program which processes things and then writes into a csv file. The program runs till the end and then head node finishes the tasks first… After that one worker node with 3.3 GB RAM fails with a memory error

Can you give me the output of ray memory before it fails?

Also, what’s the version of Ray are you using?

Actually I lost that code due to system crashing, I will share the error in few days when I am back at the same point… by then, I have another issue…where I need some help, so when I am using distributed ray multiprocessing.Pool starmap, the cluster it initially uses all the cpus specified in cluster but later only one systems cpus are being used and rest are sitting ideal. How to make use of those? Currently I am creating cluster manually using 3 local systems.

Hmm this is a bit hard to say why it happens without looking at it.

  • What do you mean by 3 local systems? (do you use ray start 3 times in your machine?)
  • Also there’s a command called ray status and you can see the number of utilized CPUs using this.
  • One other possibility is some of your tasks are small, so Ray doesn’t feel the need of scheduling tasks to other nodes.

Also, is your workload having memory pressure? What’s the version of ray you are using & OS?

I am using ray 0.8.6 in ubuntu 18.04, and ubuntu 20.04 LTS.
By 3 local systems I mean, one laptop as a head node and 2 computers as workers.

The computation is heavy and I am using 32 threads in total. Initially, the program uses all 32 of it for 30 minutes or so and then two computers(in general workers) stop getting work and the entire processing is done only in the head node.

I am using the ray webpage to monitor the cores through localhost:port

Please help me with this.

If you run ray status when other 2 nodes stop processing tasks, how does the output look like?

Hi ray status gives the following error:

Try ‘ray --help’ for help.

Error: No such command ‘status’.

This is the dashboard status

What version of ray are you using? Can you try 1.2.0 or the latest master?

I tried ray 1.2.0,
following is the output of ray status:

======== Cluster status: 2021-02-23 21:33:15.387934 ========
Node status

1 node(s) with resources: {‘memory’: 346.0, ‘node:’: 1.0, ‘object_store_memory’: 119.0, ‘CPU’: 12.0}
1 node(s) with resources: {‘node:’: 1.0, ‘object_store_memory’: 183.0, ‘GPU’: 2.0, ‘CPU’: 8.0, ‘accelerator_type:RTX’: 1.0, ‘memory’: 620.0}


0.00/47.168 GiB memory
20.0/20.0 CPU
0.00/14.746 GiB object_store_memory
0.0/2.0 GPU
0.0/1.0 accelerator_type:RTX

(no resource demands)

In this version, It works fine, I guess in 0.8.6 there is some batch input issue that got resolved in later versions.

Thanks for helping me out

1 Like

That’s awesome! Glad it is working. Lmk if you need any other help!