Not sure what you referred to. Any example?
With my verbosity options:
- I’m running ray with the config from my previous comment, so I pass the
--verbose
option to ray exec
-
ray.tune.run()
is being passed verbose=2
every 5 seconds I get an output to the console (which also gets saved to /home/ec2-user/stdouterr.log
as per the call to ray exec
in the above post [ray exec ... "python3 ...2>&1 | tee /home/ec2-user/stdouterr.log"
]) that shows the current status of the cluster - something like this:
== Status ==
Current time: 2023-05-03 08:23:19 (running for 18:17:31.51)
Using FIFO scheduling algorithm.
Logical resource usage: 191.0/192 CPUs, 0/0 GPUs
Current best trial: af7b05ae with score=... and parameters={...}
Result logdir: /home/ec2-user/ray_results/evaluate_config_2023-05-02_14-05-48
Number of trials: 25569/infinite (1 PENDING, 191 RUNNING, 25377 TERMINATED)
However, some time ago (which is why I suspect it might have been caused by upgrading ray to version 2.4.0), when running similar tune sessions, the output used to look like this:
== Status ==
Current time: 2023-04-25 14:51:54 (running for 1 days, 00:13:57.95)
Memory usage on this node: 47.3/123.6 GiB
Using AsyncHyperBand: num_stopped=29293
Bracket: Iter 8.000: 440.91518941131363 | Iter 4.000: 298.4077658137006 | Iter 2.000: 86.60210328240834 | Iter 1.000: 23.400558285708392
Resources requested: 0/192 CPUs, 0/0 GPUs, 0.0/287.74 GiB heap, 0.0/13.97 GiB objects
Current best trial: 808805c3 with score=... and parameters={...}
Result logdir: /home/ec2-user/ray_results/evaluate_config_2023-04-24_14-37-56
Number of trials: 29485/infinite (29485 TERMINATED)
Notice how there’s an extra line in there:
Memory usage on this node: 47.3/123.6 GiB
Now, I don’t think that occurs all the time under 2.4.0, but I haven’t noticed memory usage reports missing before. Also, it might be worth mentioning that memory usage reports are missing also when using ASHA (AsyncHyperBand
lines in the status messages are generated by it).
In case you’re wondering why I care about these logs, it’s because I’m trying to use ray to implement a cache based on this post and whenever I’m using it, after a while, my ray tune session crashes without telling me why, and it suggests I should look into the logs:
2023-04-30 00:25:46,554 ERROR trial_runner.py:671 -- Trial evaluate_config_c4ac0141: Error stopping trial.
Traceback (most recent call last):
File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/execution/trial_runner.py", line 665, in stop_trial
self._callbacks.on_trial_complete(
File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/callback.py", line 365, in on_trial_complete
callback.on_trial_complete(**info)
File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/syncer.py", line 817, in on_trial_complete
self._sync_trial_dir(trial, force=True, wait=False)
File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/syncer.py", line 766, in _sync_trial_dir
sync_process.wait()
File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/syncer.py", line 254, in wait
raise exception
File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/syncer.py", line 217, in entrypoint
result = self._fn(*args, **kwargs)
File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 69, in sync_dir_between_nodes
return _sync_dir_between_different_nodes(
File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/tune/utils/file_transfer.py", line 197, in _sync_dir_between_different_nodes
return ray.get(unpack_future)
File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ec2-user/pyvenv/lib64/python3.9/site-packages/ray/_private/worker.py", line 2523, in get
raise value
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
The same traceback can be found in the error.txt
file within the trial’s output directory.
Now, I had similar problems before and back then it had turned out that it was because the machine was running out of memory. However, since I now can’t get any logs on that (neither in /tmp/ray
, nor even the Memory usage on this node
console outputs), I can’t get any hint on what makes my cache crash.