Ray log location

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task.

I can’t find ray logs in /tmp/ray of the head node any more. Where are they?

I’m running ray on an aws cluster, and since a couple of versions ago (I think starting with 2.3.1.) I can’t find them any more. The docs still say they should be in /tmp/ray, but they’re not. The point would be that after stopping ray, I still want to sift through the logs to see why some crashes might have occurred.

Also, in Ray 2.4.0. console output I can’t find the “Memory usage on this node” any more.

Also, I can notice now that the logs are there while the job is running. Are they, perhaps, deleted when the job is stopped? Is there an option to keep them so as to allow forensic research for causes of failure?

Hmm they should be in /tmp/ray/session_latest normally unless you specify the temp_dir explicitly

Is it possible they get deleted on shutdown?

Also, in Ray 2.4.0. console output I can’t find the “Memory usage on this node” any more.

What about this?

After setting temp_dir explicitly (passing it to ray start within the aws yaml config file) to some location inside the user home directory, the worker nodes (running on machines other than one that runs the head node) don’t get initialized any more. Is there anywhere else I need to specify this? Perhaps in the yaml? (I didn’t find anything in the docs)

LE: strangely, the dashboard reports all the nodes, but just the head node has a non-zero cpu load. ray monitor and console output on the head node show just the cores on the machine running the head node

console output I can’t find the “Memory usage on this node” any more

Not sure what you referred to. Any example?

I just tried it on my local laptop. I ran ray start --head and I could see the /tmp/ray/logs. I ran ray stop and the directory was still there.

BTW, did you try to ssh into your ec2 head node and look at the /tmp/ray?

BTW, did you try to ssh into your ec2 head node and look at the /tmp/ray?

Yes, that’s precisely why I started this thread. I can’t find them there any more. Now, I observed the logs are in /tmp/ray while ray is running, and even after ray tune finishes. However, if the machine is shut down, they’re gone when the machine is started again. I’m beginning to suspect that maybe it’s Amazon Linux that decides to delete some stuff in /tmpupon restart. Now, I tried to address this by changing the log location to a custom directory path inside the user home directory, but, as I said, worker nodes don’t get initialized any more.

Hmm. It should work. Can you paste your yaml file?

Not sure if different nodes can have different root directories? cc: @sangcho

Hmm. It should work. Can you paste your yaml file?

Sure, here it is:

# aws.yaml
cluster_name: default
max_workers: 2
provider:
    type: aws
    region: eu-west-1
    availability_zone: eu-west-1c
    cache_stopped_nodes: True
    security_group:
        GroupName: ray-autoscaler-default
auth:
    ssh_user: ec2-user
    ssh_private_key: /home/user_name/.ssh/private_key.pem
available_node_types:
    ray.head.default:
        resources: {"CPU": 64, "GPU": 0, "object_store_memory": 5000000000}
        node_config:
            KeyName: key_name
            InstanceType: c7g.16xlarge  # 64 cpus
            ImageId: ami-09ca4fd95e59ee59a # this is arm64 amazon linux 2023 in eu-west-1 (Ireland)
            BlockDeviceMappings:
                - DeviceName: /dev/xvda
                  Ebs:
                      VolumeSize: 100
    ray.worker.default:
        min_workers: 2
        max_workers: 2
        resources: {"CPU": 64, "GPU": 0, "object_store_memory": 5000000000}
        node_config:
            KeyName: key_name
            InstanceType: c7g.16xlarge # 64 cpus
            ImageId: ami-09ca4fd95e59ee59a # this is arm64 amazon linux 2023 in eu-west-1 (Ireland)
            BlockDeviceMappings:
                - DeviceName: /dev/xvda
                  Ebs:
                      VolumeSize: 100
head_node_type: ray.head.default
file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
    - "**/.git"
    - "**/.git/**"
    - "**/*.pyc"
    - "**/__pycache__/**"
rsync_filter:
    - ".gitignore"
initialization_commands:
    - touch ~/.sudo_as_admin_successful
    - sudo yum update -y
    - sudo yum -y install tmux screen iotop htop python3 python3-devel git gcc-c++
    - python3 -m venv pyvenv && source pyvenv/bin/activate && pip install --upgrade setuptools pip wheel && pip install numpy && pip install -r requirements.txt && pip install boto3
    - [...]
    - mkdir -p ray_temp_logs
setup_commands:
    - echo 'export PATH=$HOME/pyvenv/bin:$PATH' >> ~/.bashrc
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --temp-dir=~/ray_temp_logs/
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

It was based on this example config (I created this yaml some time ago, so maybe, in the mean time, the base example file has changed), which I found referenced in the docs here.

I’m using the following commands to launch the cluster:

ray up aws/ray_arm.yaml --yes
ray exec --tmux --verbose aws/ray_arm.yaml "python3 -m my_ray_module 2>&1 | tee /home/ec2-user/stdouterr.log"
ray attach aws/ray_arm.yaml --tmux

Then I also launch

(venv) user@laptop:~$ ray dashboard aws.yaml

and

(venv) user@laptop:~$ ray monitor aws.yaml

It’s within the my_ray_module.py script where I call ray.init() without any arguments, and also I call flaml.tune.run(), which, in turn, under the hood, is calling ray.tune.run().

Is there any other info that I neglected to send?

P.S.:

Not sure if different nodes can have different root directories? cc: @sangcho

As per the docs here:

--temp-dir <temp_dir>
manually specify the root temporary dir of the Ray process, only works when –head is specified

the --temp-dir option is only suited for the head node, and that’s why I haven’t passed it to the ray start command within the worker_start_ray_commands section in my aws.aml file above.

Not sure what you referred to. Any example?

With my verbosity options:

  • I’m running ray with the config from my previous comment, so I pass the --verbose option to ray exec
  • ray.tune.run() is being passed verbose=2

every 5 seconds I get an output to the console (which also gets saved to /home/ec2-user/stdouterr.log as per the call to ray exec in the above post [ray exec ... "python3 ...2>&1 | tee /home/ec2-user/stdouterr.log"]) that shows the current status of the cluster - something like this:

== Status ==
Current time: 2023-05-03 08:23:19 (running for 18:17:31.51)
Using FIFO scheduling algorithm.
Logical resource usage: 191.0/192 CPUs, 0/0 GPUs
Current best trial: af7b05ae with score=... and parameters={...}
Result logdir: /home/ec2-user/ray_results/evaluate_config_2023-05-02_14-05-48
Number of trials: 25569/infinite (1 PENDING, 191 RUNNING, 25377 TERMINATED)

However, some time ago (which is why I suspect it might have been caused by upgrading ray to version 2.4.0), when running similar tune sessions, the output used to look like this:

== Status ==
Current time: 2023-04-25 14:51:54 (running for 1 days, 00:13:57.95)
Memory usage on this node: 47.3/123.6 GiB 
Using AsyncHyperBand: num_stopped=29293
Bracket: Iter 8.000: 440.91518941131363 | Iter 4.000: 298.4077658137006 | Iter 2.000: 86.60210328240834 | Iter 1.000: 23.400558285708392
Resources requested: 0/192 CPUs, 0/0 GPUs, 0.0/287.74 GiB heap, 0.0/13.97 GiB objects
Current best trial: 808805c3 with score=... and parameters={...}
Result logdir: /home/ec2-user/ray_results/evaluate_config_2023-04-24_14-37-56
Number of trials: 29485/infinite (29485 TERMINATED)

Notice how there’s an extra line in there:

Memory usage on this node: 47.3/123.6 GiB 

Now, I don’t think that occurs all the time under 2.4.0, but I haven’t noticed memory usage reports missing before. Also, it might be worth mentioning that memory usage reports are missing also when using ASHA (AsyncHyperBand lines in the status messages are generated by it).

In case you’re wondering why I care about these logs, it’s because I’m trying to use ray to implement a cache based on this post and whenever I’m using it, after a while, my ray tune session crashes without telling me why, and it suggests I should look into the logs:

2023-04-30 00:25:46,554	ERROR trial_runner.py:671 -- Trial evaluate_config_c4ac0141: Error stopping trial.
Traceback (most recent call last):
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 665, in stop_trial
    self._callbacks.on_trial_complete(
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/callback.py", line 365, in on_trial_complete
    callback.on_trial_complete(**info)
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/syncer.py", line 817, in on_trial_complete
    self._sync_trial_dir(trial, force=True, wait=False)
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/syncer.py", line 766, in _sync_trial_dir
    sync_process.wait()
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/syncer.py", line 254, in wait
    raise exception
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/syncer.py", line 217, in entrypoint
    result = self._fn(*args, **kwargs)
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 69, in sync_dir_between_nodes
    return _sync_dir_between_different_nodes(
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 197, in _sync_dir_between_different_nodes
    return ray.get(unpack_future)
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/_private/worker.py", line 2523, in get
    raise value
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.

The same traceback can be found in the error.txt file within the trial’s output directory.

Now, I had similar problems before and back then it had turned out that it was because the machine was running out of memory. However, since I now can’t get any logs on that (neither in /tmp/ray, nor even the Memory usage on this node console outputs), I can’t get any hint on what makes my cache crash.

Also, maybe it’s worth mentioning why the machine needs to be restarted. When the tuning session finishes, I’m usually out of the office so someone else shuts down the virtual machines so we don’t have to pay for them. When I turn them back on, I’d like the logs to stick around in case tuning has crashed for some reason, so I can investigate the cause of the crash. I’m not sure, but I think at one time I got to ssh into the machine before it got shut down, but even in that case, I think that if ray terminates abnormally, the logs are also gone. It’s pretty hard to get an exact reproduction because the time the crashes occur is rather arbitrary.

By the way, is there a way to automatically stop all the nodes in the cluster when the tuning finishes its time budget? I mean, I’m starting ray from my local laptop, but then all other commands are running on the head node. I couldn’t find any command in the docs that, executed from the head node, would shut down all workers and then itself.

LE: some time ago I tried tinkering with the --stop option of ray exec but I can’t remember why exactly, but it didn’t work as needed.

Got it. Can you create another post under Ray AIR for this and tag me? It’s a separate topic.
I can follow up there

I don’t think that it is supported. If you use KubeRay, there is way to achieve it. Feel free to create a feature request here: Sign in to GitHub · GitHub

@bbudescu I tried to simplify the yaml and repro your issue. However, it worked fine for me. I attached the yaml file below. Could you try to narrow down the issue and provide minimal repro for us?

cluster_name: 0503-8
max_workers: 2
docker:
    image: "rayproject/ray-ml:latest-cpu"
    container_name: "ray_container"
provider:
    type: aws
    region: us-west-2
    key_pair:
      key_name: my-ssh-key
auth:
    ssh_user: ubuntu
available_node_types:
    ray.head.default:
        node_config:
            InstanceType: m5.2xlarge
    ray.worker.default:
        min_workers: 2
        max_workers: 2
        node_config:
            InstanceType: m5.2xlarge
head_node_type: ray.head.default
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0 --temp-dir=~/ray_temp_logs/
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

I can now confirm that the logs are deleted when the aws virtual machine machine instances are stopped. Had another crash, but the the ray process kept running in tmux. I attached to it, saved the logs, then, using Ctrl+^C, I terminated the process. The logs were still there. Then, I stopped the VMs, then I started them again and they were gone.

I’m not sure whether there were any other ray processes running after sending the SIGINT via Ctrl+^C inside the terminal, and then ending the tmux session (I haven’t checked ps or htop), but I assume there shouldn’t be any, right?

So does this mean that, most probably, the OS removes stuff in /tmp/? I’m using Amazon’s Linux distribution - AWS Linux 2023 update 2023-03-15 (release notes here).

If so, then we should move the logs out of /tmp, but I can’t do that, since worker nodes fail to initialized and run trials.

https://systemd.io/TEMPORARY_DIRECTORIES/
Cleaning up tmp up reboot seems to be just a linux feature.

You should try to narrow down the issue of why the worker nodes fail to initialize. You can look at monitor.* log files and there should be logs about why autoscaler cannot scape up the worker nodes.

Ok, I will do that. However, I hope you won’t mind me suggesting the ray team to save logs somewhere where they don’t get deleted upon reboot. I mean, I ended up losing the logs when I needed them most - when crashes occur. Is there perhaps another reason for which the choice for /tmp/ was made? Would it be ok for me to open a github issue about this?

Feel free to open one and see if someone knows the reason : )

I guess the main reason is not to waste the disk space over time because users probably won’t clean them up by themselves and the amount of logs could grow infinitely…