Problem node running low on memory

tonidep · April 7, 2023, 10:31am

“ERROR”, “error”: "OutOfMemoryError: Task was killed due to the node running low on memory.\nMemory on the node (IP: 172.31.6.213, ID: 9c9e42d5874264e5d92e6f911a30a27be1eac71a7764130e45c0da60) where the task (task ID: 6f583cb255c69937b122c0f49be9bbfab317ead402000000, name=train_test, pid=5809, memory used=0.13GB) was running was 7.14GB / 7.50GB (0.951585), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 99672e8a7b9979379a03eeac1f287acb22c074eab5696c44bd600c30) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 172.31.6.213.

Jules_Damji · April 7, 2023, 6:13pm

@tonidep this is related to the previous issue.

tonidep · April 11, 2023, 12:55pm

I’m remotely running a python algorithm on a ray cluster on aws.
This algorithm is very computationally expensive.
We can also assume the simultaneous execution of multiple algorithms on the cluster.
What reasoning do I have to do to size:

           1) core cpu and memory of head 
           2) number of workers, number of cores per worker, and memory per worker

Jules_Damji · April 11, 2023, 9:43pm

core cpu and memory of head

number of workers, number of cores per worker, and memory per worker

That depends on use-case-to-use-case. There is no magic rule of thumb or formula how many cores and memory your head node must have. You might have to try and observe the memory consumption of your compute tasks, and based on that have a node with 2x or 1.5 more memory, in case.

Basically, ensure that if you want maximize you parallelism use at 10 - 12 cores per node, reserving 10 cores for Ray tasks, and additional two cores for OS related tasks on each node.

As for the memory and compute intensive tasks, you can choose any of those instances recommended

Topic		Replies	Views
Problem with 8 worker	4	565	April 7, 2023
Best way to config ray workers Ray Core	6	465	February 26, 2021
Why is the head dying regularly with OOM while the workers barely have any RAM usage?	3	702	July 5, 2023
Memory error in distributed multiprocessing	11	710	February 23, 2021
Ray Worker Max Memory Ray Core	3	538	February 5, 2021

Problem node running low on memory

Related topics