Setting ulimits on EC2 instances

Anyone has experience successfully setting ulimits for open file descriptors for ray start when running on EC2? In my cluster yaml, head_start_ray_commands looks like this:

- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

But I’m not sure the 65536 limit is actually honored because the default limit in EC2 instances is 8192, and trying to run ulimit -n 65536 gives this error: bash: ulimit: open files: cannot modify limit: Operation not permitted . To verify that ray workers indeed have a ulimit of 8192 (instead of 65536), I ran this snippet:

import resource
import ray

def get_limit():
    return resource.getrlimit(resource.RLIMIT_NOFILE)

f = ray.remote(get_limit)
result = ray.get(f.remote())
print(result)   # Soft, hard limit
# Result was (8192, 8192) on a r5.2xlarge instance

To increase the ulimit, I tried running sudo bash -c "echo $USER hard nofile 65536 >> /etc/security/limits.conf" as recommended in the docs. This increases the limit, but only after a sudo reboot of the ec2 instance (log-out log-in does not update the limit). I’m afraid adding the limit update to my setup scripts is not useful since the instance wont be restarted before the head_start_ray_commands are run.

Is there a way to reliably set the ulimit on EC2 instances in the autoscaler yaml?

1 Like

After a helpful discussion with @sangcho and @Alex on the Ray slack, here’s a summary of the problem and a workaround:

  1. ulimit -n 65535 fails if the limit in /etc/security/limits.conf is < 65535. It is set to 8192 in Amazon’s default AMIs.
  2. The fix is to overwrite the limits in limits.conf with sudo bash -c 'rm -rf /etc/security/limits.d; echo "* soft nofile 65535" >> /etc/security/limits.conf; echo "* hard nofile 65535" >> /etc/security/limits.conf;'. You should add this line to setup_commands in your yaml. However, the changes to limits.conf apply only on restart (log-in log-out does not work).
  3. Workaround - to simulate a restart, run ray up (this overwrites limits.conf by running setup_commands), then run ray down and again ray up. Now you can use your cluster.