Using Ray in AWS Batch in docker containers

arunppsg · June 23, 2022, 4:10pm

Hi all,

I am testing a simple application for using ray in AWS Batch. Here are the steps I followed:

created a dummy ray application which reads a dataset from s3 datastore,
repartitions it to n-partitions where n is the number of cpus
run a map_batches operations on the dataset

The application worked fine in local machine and I observed speed improvement when num_cpus were increased. Now, I tried testing the same in an AWS Batch Compute Environment. I made a docker image with rayproject/ray:latest base image, added my program to it with num_cpus as docker run command arguments. When I did it, I am noticing this weird errors sometime:

WARNING worker.py:1404 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: f3645dd2343ba1628c0e33ed3df84ccac8a7a5ac01000000 Worker ID: fc4897cf9d29dfc602ce30728bd89c74dc925bfa966633df33b39f50 Node ID: 1b5112d22ad9948af73174d31feb75c5392ba4f8c24eae6a845c008e Worker IP address: x.x.x.x Worker port: -- Worker PID: 593
2022-06-23T14:01:07.269Z
next_line = file_info.file_handle.readline()
OSError: [Errno 12] Cannot allocate memory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_monitor.py", line 451, in <module>
log_monitor.run()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_monitor.py", line 377, in run
anything_published = self.check_log_files_and_publish_updates()
f"Error: Reading file: {file_info.filename}, "
AttributeError: 'LogFileInfo' object has no attribute 'file_info'

The error occurred occurs mostly when num_cpus is greater than 4 like 4, 8 but so far, I did not notice the error on smaller datasets with num_cpus as 2. Any ideas on how to solve it or more broadly, is this the right approach to use ray in AWS Batch? Any user experience on using ray with AWS Batch will be helpful.

Topic		Replies	Views
Previously well running script does not allocate resources correctly anymore Ray Core	2	378	December 6, 2021
Ray / gRPC Ambiguous Error Message Kubernetes	12	2267	May 13, 2022
Sample ray program does not work on kubernetes with ray1.4.0 branch Kubernetes	1	479	June 10, 2021
Autoscaling is very slow and not working correctly Ray Clusters	6	632	April 30, 2021
Using Ray Multiprocessing on Docker Ray Core	1	545	March 7, 2022

Using Ray in AWS Batch in docker containers

Related topics