I run a RayCluster(Ray 2.39.0) using KubeRay(1.2.2), and submit many job to it. I discover that there many zombie process left after the job is finished.
The zombie processes cause some psutil methods runs vary slow
It will leave 2 zombie processes when I submit one job. For more detail, when I submit a job, the JobSupervisor
will start up at head node to hold the job, JobSupervisor
(pid=152424) will run 2 subprocesses:
/bin/bash -c python numpy-cpu-job-actor.py
, pid is 152834/bin/bash -c while kill -s 0 152424; do sleep 1; done; kill -9 -152824
, pid is 152836
When the job is finished, 152424 & 152834 is exited, but leave 152836 and its subprocess zombie: 1)[sh] <defunct>
; 2) [sleep] <defunct>
Code of numpy-cpu-job-actor.py is
import ray
import numpy as np
import datetime
t0 = datetime.datetime.now()
formatted_time = t0.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
print("Starting at ", formatted_time)
t1 = datetime.datetime.now()
formatted_time = t1.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
print("Ray initialized at ", formatted_time)
def cpu_intensive_task():
result = 0
tt1 = datetime.datetime.now()
print("Start at ", tt1.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3])
for _ in range(int(5e6)):
result += np.random.rand()
tt2 = datetime.datetime.now()
formatted_time1 = tt2.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
print("Finished at %s, cost %.2f second." % (formatted_time1, (tt2-tt1).total_seconds()))
return result
t2 = datetime.datetime.now()
formatted_time = t2.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
print("Placement group ready at ", formatted_time)
t3 = datetime.datetime.now()
formatted_time = t3.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
print("Actor scheduled at ", formatted_time)
result_ids = [cpu_intensive_task.options().remote() for _ in range(2)]
t4 = datetime.datetime.now()
formatted_time = t4.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
print("Actor task scheduled at ", formatted_time)
results = ray.get(result_ids)
t5 = datetime.datetime.now()
formatted_time = t5.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
print("Finished at ", formatted_time)
print("Result: ")
print("t0 - t1 - t2 - t3 - t4 - t5: %.2f - %.2f - %.2f - %.2f - %.2f" % ((t1-t0).total_seconds(), (t2-t1).total_seconds(), (t3-t2).total_seconds(), (t4-t3).total_seconds(), (t5-t4).total_seconds()))
except KeyboardInterrupt:
print("Ray shutdown at ", formatted_time)
And the submit command is ray job submit --working-dir . -- python numpy-cpu-job.py
I’m wondering if I did something wrong that caused this, of if this is a community bug?