Memory error in distributed multiprocessing

Swadhin_Agrawal · February 19, 2021, 8:35am

Hello, I created a head node, and then added 2 more worker nodes from terminal, I ran the program which processes things and then writes into a csv file. The program runs till the end and then head node finishes the tasks first… After that one worker node with 3.3 GB RAM fails with a memory error

sangcho · February 19, 2021, 8:42am

Can you give me the output of ray memory before it fails?

Also, what’s the version of Ray are you using?

Swadhin_Agrawal · February 19, 2021, 8:49am

Actually I lost that code due to system crashing, I will share the error in few days when I am back at the same point… by then, I have another issue…where I need some help, so when I am using distributed ray multiprocessing.Pool starmap, the cluster it initially uses all the cpus specified in cluster but later only one systems cpus are being used and rest are sitting ideal. How to make use of those? Currently I am creating cluster manually using 3 local systems.

sangcho · February 19, 2021, 8:52pm

Hmm this is a bit hard to say why it happens without looking at it.

What do you mean by 3 local systems? (do you use ray start 3 times in your machine?)
Also there’s a command called ray status and you can see the number of utilized CPUs using this.
One other possibility is some of your tasks are small, so Ray doesn’t feel the need of scheduling tasks to other nodes.

Also, is your workload having memory pressure? What’s the version of ray you are using & OS?

Swadhin_Agrawal · February 21, 2021, 2:40pm

Hey,
I am using ray 0.8.6 in ubuntu 18.04, and ubuntu 20.04 LTS.
By 3 local systems I mean, one laptop as a head node and 2 computers as workers.

The computation is heavy and I am using 32 threads in total. Initially, the program uses all 32 of it for 30 minutes or so and then two computers(in general workers) stop getting work and the entire processing is done only in the head node.

I am using the ray webpage to monitor the cores through localhost:port

Please help me with this.

sangcho · February 22, 2021, 3:14am

If you run ray status when other 2 nodes stop processing tasks, how does the output look like?

Swadhin_Agrawal · February 22, 2021, 7:52am

Hi ray status gives the following error:

Usage: ray [OPTIONS] COMMAND [ARGS]…
Try ‘ray --help’ for help.

Error: No such command ‘status’.

Swadhin_Agrawal · February 22, 2021, 9:37am

This is the dashboard status

sangcho · February 22, 2021, 5:49pm

What version of ray are you using? Can you try 1.2.0 or the latest master?

Swadhin_Agrawal · February 23, 2021, 4:05pm

I tried ray 1.2.0,
following is the output of ray status:

======== Cluster status: 2021-02-23 21:33:15.387934 ========
Node status

1 node(s) with resources: {‘memory’: 346.0, ‘node:10.42.0.237’: 1.0, ‘object_store_memory’: 119.0, ‘CPU’: 12.0}
1 node(s) with resources: {‘node:10.42.0.231’: 1.0, ‘object_store_memory’: 183.0, ‘GPU’: 2.0, ‘CPU’: 8.0, ‘accelerator_type:RTX’: 1.0, ‘memory’: 620.0}

Resources

Usage:
0.00/47.168 GiB memory
20.0/20.0 CPU
0.00/14.746 GiB object_store_memory
0.0/2.0 GPU
0.0/1.0 accelerator_type:RTX

Demands:
(no resource demands)

Swadhin_Agrawal · February 23, 2021, 4:07pm

In this version, It works fine, I guess in 0.8.6 there is some batch input issue that got resolved in later versions.

Thanks for helping me out

sangcho · February 23, 2021, 5:47pm

That’s awesome! Glad it is working. Lmk if you need any other help!

Topic		Replies	Views
Problem node running low on memory	3	1971	April 11, 2023
Not able to use second worker cpu and memory Ray Clusters	2	345	May 25, 2022
Problem with 8 worker	4	545	April 7, 2023
Why is the head dying regularly with OOM while the workers barely have any RAM usage?	3	656	July 5, 2023
Local Ray cluster won't send any tasks to worker node Ray Clusters	11	905	August 6, 2024

Memory error in distributed multiprocessing

======== Cluster status: 2021-02-23 21:33:15.387934 ======== Node status

Resources

Related topics

======== Cluster status: 2021-02-23 21:33:15.387934 ========
Node status