After a worker errors out and I attempt to run the job again, print messages or logs fail to appear in the terminal.
related github issue.
After a worker errors out and I attempt to run the job again, print messages or logs fail to appear in the terminal.
related github issue.
Hey @Javier_Bosch do you have a repro script I can run?
Hey @rliaw , I was trying to reproduce the implementation with another script by raising an error, but I could not reproduce it.
In my current implementation I am making calls to elasticsearch. After the function fails (likely a timeout or over-extended queue inside elastic), I believe ray attempts to run the function again and retries but fails. When I rerun the job, it runs but does not logs/print output to the screen.
Also I am running my ray function inside kedro which all logs to the console. I am not sure if there is something else at play here. When I rerun my pipeline, of which one of the nodes is executing my ray job, My main scripts logs get output to the console, but the ray task function logs/print statements do not.
I have the same issue. The source is as listed in /1/, the first prints work, not the second, neither the third. What’s the follow up?
/1/
import ray
import time
ray.init();
@ray.remote
def f():
a = 3;
print("before print",flush=True);
time.sleep(1);
print("after print",flush=True);
time.sleep(2);
print("end",flush=True);
return 1;
f.remote()
f.remote()
f.remote()
@rliaw , I believe this is still an issue. @juhanishen seems to have a script to reproduce the behavior.
My error still occur after a previous Exception was raise. This happens irregularly sometimes. When I restart ray, I can see my print messages again.
Hi @juhanishen,
What version of ray are you running? I just ran your script in ray 1.2.0, python 3.7 and this is what I get:
>>> import ray
>>> import time
>>>
>>> ray.init();
2021-04-12 21:14:47,819 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
{'node_ip_address': '192.168.1.216', 'raylet_ip_address': '192.168.1.216', 'redis_address': '192.168.1.216:6379', 'object_store_address': '/tmp/ray/session_2021-04-12_21-14-47_321816_3083317/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-04-12_21-14-47_321816_3083317/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2021-04-12_21-14-47_321816_3083317', 'metrics_export_port': 64664, 'node_id': '19df1654af900d5dcd0281a1097678fd2f9fab6169deae3a2fd9c6d8'}
>>>
>>> @ray.remote
... def f():
... a = 3;
... print("before print",flush=True);
... time.sleep(1);
... print("after print",flush=True);
... time.sleep(2);
... print("end",flush=True);
... return 1;
...
>>> f.remote()
ObjectRef(a67dc375e60ddd1affffffffffffffffffffffff0100000001000000)
>>> f.remote()
ObjectRef(63964fa4841d4a2effffffffffffffffffffffff0100000001000000)
>>> f.remote()
ObjectRef(69a6825d641b4613ffffffffffffffffffffffff0100000001000000)
>>>
>>> (pid=3083627) before print
(pid=3083630) before print
(pid=3083631) before print
(pid=3083627) after print
(pid=3083630) after print
(pid=3083631) after print
(pid=3083627) end
(pid=3083630) end
(pid=3083631) end
>>>
Hi, I have python version 3.7, but ray version 1.0.1post. I try to install ray 1.2 and see whether it will reproduce the error. But anyway, it is great that someone follows the issue.
Br
Juhani
I also tried the latest master with the repro script and could not reproduce the issue. So, I assume the problem was resolved at one point between 1.0.1 and 1.2.0
@juhanishen I think I know what is happening. I ran mine in an interpreter whereas you ran yours as a script. The ray.remote calls are not blocking so after the third call the script terminates before any of the f functions complete. If you put the handles you get back in a list and then use ray.get() it will wait until all three calls return before exiting.
It would look like this.
import ray
import time
ray.init();
@ray.remote
def f():
a = 3;
print("before print",flush=True);
time.sleep(1);
print("after print",flush=True);
time.sleep(2);
print("end",flush=True);
return 1;
f_handles =[f.remote() for _ in range(3)]
ray.get(f_handles)
yes, I got my case fixed. Thanks a lot @mannyv !!