Hi all,
I am testing a simple application for using ray in AWS Batch. Here are the steps I followed:
- created a dummy ray application which reads a dataset from s3 datastore,
- repartitions it to n-partitions where n is the number of cpus
- run a map_batches operations on the dataset
The application worked fine in local machine and I observed speed improvement when num_cpus were increased. Now, I tried testing the same in an AWS Batch Compute Environment. I made a docker image with rayproject/ray:latest base image, added my program to it with num_cpus as docker run command arguments. When I did it, I am noticing this weird errors sometime:
WARNING worker.py:1404 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: f3645dd2343ba1628c0e33ed3df84ccac8a7a5ac01000000 Worker ID: fc4897cf9d29dfc602ce30728bd89c74dc925bfa966633df33b39f50 Node ID: 1b5112d22ad9948af73174d31feb75c5392ba4f8c24eae6a845c008e Worker IP address: x.x.x.x Worker port: -- Worker PID: 593
2022-06-23T14:01:07.269Z
next_line = file_info.file_handle.readline()
OSError: [Errno 12] Cannot allocate memory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_monitor.py", line 451, in <module>
log_monitor.run()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_monitor.py", line 377, in run
anything_published = self.check_log_files_and_publish_updates()
f"Error: Reading file: {file_info.filename}, "
AttributeError: 'LogFileInfo' object has no attribute 'file_info'
The error occurred occurs mostly when num_cpus is greater than 4 like 4, 8 but so far, I did not notice the error on smaller datasets with num_cpus as 2. Any ideas on how to solve it or more broadly, is this the right approach to use ray in AWS Batch? Any user experience on using ray with AWS Batch will be helpful.