Ray k8s cluster, cannot run new task when previous task failed

GoingMyWay · June 17, 2022, 10:37am

Hi, @architkulkarni thanks. I think the runtime env is not what I want. I use docker and created a cluster. The environment has been created as I set the docker image. The runtime env seems to be an environment setup setting. I need to set many things to complete the runtime env setup, which is contradicting what I have done with docker. Following your suggestion, it returns the following error:

(raylet, ip=172.24.56.163) [2022-06-17 18:32:17,992 E 73 73] (raylet) agent_manager.cc:136: Not all required Ray dependencies for the runtime_env feature were found. To install the required dependencies, please 
run `pip install "ray[default]"`.
[ERROR 18:32:18] pymarl Failed after 0:00:02!
Traceback (most recent calls WITHOUT Sacred internals):
  File "/home/me/app/epymarl/src/main.py", line 65, in my_main
    run_train_meltingpot(_run, config, _log)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 62, in run
    run_sequential(args=args, logger=logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 122, in run_sequential
    buffer, queue, buffer_queue, ray_ws = create_buffer(args, scheme, groups, env_info, preprocess, logger)
  File "/home/me/app/epymarl/src/run_meltingpot.py", line 423, in create_buffer
    assert ray.get(buffer.ready.remote())
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/me/miniconda3/lib/python3.9/site-packages/ray/worker.py", line 1765, in get
    raise value
ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.

I think it was asking me to set the requirements. I think I do not need to set it as I am using docker.

I want to fix why a created k8s ray pod cluster cannot be reused?

The following shows how I use the k8s ray pod cluster:

1. The admin created a new chart and created a ray operator

2. I use the YAML file to create a ray pod cluster

3. I log in to the head node and run the code (it works fine for debugging purpose)

4. I kill the current programme and then re-run the code. However, the cluster cannot be reused

5. I have to create a new cluster and run my job, which costs more time and patience.

Topic		Replies	Views
Question about Ray Cluster/ Ray on prem Ray Clusters	6	747	June 15, 2021
Runtime_env fails when running Ray in Docker Ray Core	8	2006	April 6, 2022
Raylet worker processes are failing Ray Core	3	250	March 5, 2025
Ray / gRPC Ambiguous Error Message Kubernetes	12	2244	May 13, 2022
Ray 1.7.0 ray.init(runtime_env=) kills cluster (was: cluster stuck on "The actor or task with ID [] cannot be scheduled right now") Ray Core	5	1267	October 18, 2021

Ray k8s cluster, cannot run new task when previous task failed

Related topics