“ERROR”, “error”: "OutOfMemoryError: Task was killed due to the node running low on memory.\nMemory on the node (IP: 172.31.6.213, ID: 9c9e42d5874264e5d92e6f911a30a27be1eac71a7764130e45c0da60) where the task (task ID: 6f583cb255c69937b122c0f49be9bbfab317ead402000000, name=train_test, pid=5809, memory used=0.13GB) was running was 7.14GB / 7.50GB (0.951585), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 99672e8a7b9979379a03eeac1f287acb22c074eab5696c44bd600c30) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 172.31.6.213.
@tonidep this is related to the previous issue.
I’m remotely running a python algorithm on a ray cluster on aws.
This algorithm is very computationally expensive.
We can also assume the simultaneous execution of multiple algorithms on the cluster.
What reasoning do I have to do to size:
1) core cpu and memory of head 2) number of workers, number of cores per worker, and memory per worker
- core cpu and memory of head
- number of workers, number of cores per worker, and memory per worker
That depends on use-case-to-use-case. There is no magic rule of thumb or formula how many cores and memory your head node must have. You might have to try and observe the memory consumption of your compute tasks, and based on that have a node with 2x or 1.5 more memory, in case.
Basically, ensure that if you want maximize you parallelism use at 10 - 12 cores per node, reserving 10 cores for Ray tasks, and additional two cores for OS related tasks on each node.
As for the memory and compute intensive tasks, you can choose any of those instances recommended