[Core] Ray fails to log or print messages to console after worker previously dies

After a worker errors out and I attempt to run the job again, print messages or logs fail to appear in the terminal.

related github issue.

Hey @Javier_Bosch do you have a repro script I can run?

Hey @rliaw , I was trying to reproduce the implementation with another script by raising an error, but I could not reproduce it.

In my current implementation I am making calls to elasticsearch. After the function fails (likely a timeout or over-extended queue inside elastic), I believe ray attempts to run the function again and retries but fails. When I rerun the job, it runs but does not logs/print output to the screen.

Also I am running my ray function inside kedro which all logs to the console. I am not sure if there is something else at play here. When I rerun my pipeline, of which one of the nodes is executing my ray job, My main scripts logs get output to the console, but the ray task function logs/print statements do not.

I have the same issue. The source is as listed in /1/, the first prints work, not the second, neither the third. What’s the follow up?

/1/

import ray
import time

ray.init();

@ray.remote
def f():
    a = 3;
    print("before print",flush=True);
    time.sleep(1);
    print("after print",flush=True);
    time.sleep(2);
    print("end",flush=True);
    return 1;

f.remote()
f.remote()
f.remote()

@rliaw , I believe this is still an issue. @juhanishen seems to have a script to reproduce the behavior.

My error still occur after a previous Exception was raise. This happens irregularly sometimes. When I restart ray, I can see my print messages again.

Hi @juhanishen,

What version of ray are you running? I just ran your script in ray 1.2.0, python 3.7 and this is what I get:

>>> import ray
>>> import time
>>> 
>>> ray.init();
2021-04-12 21:14:47,819 INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
{'node_ip_address': '192.168.1.216', 'raylet_ip_address': '192.168.1.216', 'redis_address': '192.168.1.216:6379', 'object_store_address': '/tmp/ray/session_2021-04-12_21-14-47_321816_3083317/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-04-12_21-14-47_321816_3083317/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2021-04-12_21-14-47_321816_3083317', 'metrics_export_port': 64664, 'node_id': '19df1654af900d5dcd0281a1097678fd2f9fab6169deae3a2fd9c6d8'}
>>> 
>>> @ray.remote
... def f():
...     a = 3;
...     print("before print",flush=True);
...     time.sleep(1);
...     print("after print",flush=True);
...     time.sleep(2);
...     print("end",flush=True);
...     return 1;
... 
>>> f.remote()
ObjectRef(a67dc375e60ddd1affffffffffffffffffffffff0100000001000000)
>>> f.remote()
ObjectRef(63964fa4841d4a2effffffffffffffffffffffff0100000001000000)
>>> f.remote()
ObjectRef(69a6825d641b4613ffffffffffffffffffffffff0100000001000000)
>>> 
>>> (pid=3083627) before print
(pid=3083630) before print
(pid=3083631) before print
(pid=3083627) after print
(pid=3083630) after print
(pid=3083631) after print
(pid=3083627) end
(pid=3083630) end
(pid=3083631) end
>>> 

Hi, I have python version 3.7, but ray version 1.0.1post. I try to install ray 1.2 and see whether it will reproduce the error. But anyway, it is great that someone follows the issue.

Br

Juhani

I also tried the latest master with the repro script and could not reproduce the issue. So, I assume the problem was resolved at one point between 1.0.1 and 1.2.0

reproduce the bug。CentOS 8.3

@juhanishen I think I know what is happening. I ran mine in an interpreter whereas you ran yours as a script. The ray.remote calls are not blocking so after the third call the script terminates before any of the f functions complete. If you put the handles you get back in a list and then use ray.get() it will wait until all three calls return before exiting.

It would look like this.

import ray
import time

ray.init();

@ray.remote
def f():
    a = 3;
    print("before print",flush=True);
    time.sleep(1);
    print("after print",flush=True);
    time.sleep(2);
    print("end",flush=True);
    return 1;

f_handles =[f.remote() for _ in range(3)]
ray.get(f_handles)

yes, I got my case fixed. Thanks a lot @mannyv !!