Not sure what you referred to. Any example?
With my verbosity options:
- I’m running ray with the config from my previous comment, so I pass the --verboseoption toray exec
- ray.tune.run()is being passed- verbose=2
every 5 seconds I get an output to the console (which also gets saved to /home/ec2-user/stdouterr.log as per the call to ray exec in the above post [ray exec ... "python3 ...2>&1 | tee /home/ec2-user/stdouterr.log"]) that shows the current status of the cluster - something like this:
== Status ==
Current time: 2023-05-03 08:23:19 (running for 18:17:31.51)
Using FIFO scheduling algorithm.
Logical resource usage: 191.0/192 CPUs, 0/0 GPUs
Current best trial: af7b05ae with score=... and parameters={...}
Result logdir: /home/ec2-user/ray_results/evaluate_config_2023-05-02_14-05-48
Number of trials: 25569/infinite (1 PENDING, 191 RUNNING, 25377 TERMINATED)
However, some time ago (which is why I suspect it might have been caused by upgrading ray to version 2.4.0), when running similar tune sessions, the output used to look like this:
== Status ==
Current time: 2023-04-25 14:51:54 (running for 1 days, 00:13:57.95)
Memory usage on this node: 47.3/123.6 GiB 
Using AsyncHyperBand: num_stopped=29293
Bracket: Iter 8.000: 440.91518941131363 | Iter 4.000: 298.4077658137006 | Iter 2.000: 86.60210328240834 | Iter 1.000: 23.400558285708392
Resources requested: 0/192 CPUs, 0/0 GPUs, 0.0/287.74 GiB heap, 0.0/13.97 GiB objects
Current best trial: 808805c3 with score=... and parameters={...}
Result logdir: /home/ec2-user/ray_results/evaluate_config_2023-04-24_14-37-56
Number of trials: 29485/infinite (29485 TERMINATED)
Notice how there’s an extra line in there:
Memory usage on this node: 47.3/123.6 GiB 
Now, I don’t think that occurs all the time under 2.4.0, but I haven’t noticed memory usage reports missing before. Also, it might be worth mentioning that memory usage reports are missing also when using ASHA (AsyncHyperBand lines in the status messages are generated by it).
In case you’re wondering why I care about these logs, it’s because I’m trying to use ray to implement a cache based on this post and whenever I’m using it, after a while, my ray tune session crashes without telling me why, and it suggests I should look into the logs:
2023-04-30 00:25:46,554	ERROR trial_runner.py:671 -- Trial evaluate_config_c4ac0141: Error stopping trial.
Traceback (most recent call last):
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 665, in stop_trial
    self._callbacks.on_trial_complete(
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/callback.py", line 365, in on_trial_complete
    callback.on_trial_complete(**info)
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/syncer.py", line 817, in on_trial_complete
    self._sync_trial_dir(trial, force=True, wait=False)
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/syncer.py", line 766, in _sync_trial_dir
    sync_process.wait()
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/syncer.py", line 254, in wait
    raise exception
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/syncer.py", line 217, in entrypoint
    result = self._fn(*args, **kwargs)
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 69, in sync_dir_between_nodes
    return _sync_dir_between_different_nodes(
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 197, in _sync_dir_between_different_nodes
    return ray.get(unpack_future)
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/_private/worker.py", line 2523, in get
    raise value
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
The same traceback can be found in the error.txt file within the trial’s output directory.
Now, I had similar problems before and back then it had turned out that it was because the machine was running out of memory. However, since I now can’t get any logs on that (neither in /tmp/ray, nor even the Memory usage on this node console outputs), I can’t get any hint on what makes my cache crash.