Unknown error that no appears on dashboard_agent.log

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi! I am trying to use ray.tune to do some hyperparameter optimisation but I am not able to run a super simple model cause every time I receive the same error: agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See dashboard_agent.log for the root cause..

But if I go to the specified file, I don’t see any error there or something that could indicate me how to proceed to solve the issue. The content of the file is the following (I’m not gonna copy entirely here but the last lines):

2023-01-24 21:56:36,091 INFO agent.py:160 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.runtime_env.runtime_env_agent.RuntimeEnvAgent’>
2023-01-24 21:56:36,092 INFO agent.py:160 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.serve.serve_agent.ServeAgent’>
2023-01-24 21:56:36,093 INFO agent.py:165 – Loaded 8 modules.
2023-01-24 21:56:36,099 INFO http_server_agent.py:74 – Dashboard agent http address: 0.0.0.0:52365
2023-01-24 21:56:36,099 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/local_raylet_healthz> → <function HealthzAgent.health_check at 0x7f4966b417e0>
2023-01-24 21:56:36,099 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/local_raylet_healthz> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f496693c640>>
2023-01-24 21:56:36,099 INFO http_server_agent.py:81 – <ResourceRoute [POST] <PlainResource /api/job_agent/jobs/> → <function JobAgent.submit_job at 0x7f4966b68dc0>
2023-01-24 21:56:36,099 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/job_agent/jobs/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f496693c640>>
2023-01-24 21:56:36,099 INFO http_server_agent.py:81 – <ResourceRoute [POST] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/stop> → <function JobAgent.stop_job at 0x7f4966b68f70>
2023-01-24 21:56:36,100 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/stop> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f496693c640>>
2023-01-24 21:56:36,100 INFO http_server_agent.py:81 – <ResourceRoute [DELETE] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}> → <function JobAgent.delete_job at 0x7f4966b69120>
2023-01-24 21:56:36,100 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f496693c640>>
2023-01-24 21:56:36,100 INFO http_server_agent.py:81 – <ResourceRoute [GET] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/logs> → <function JobAgent.get_job_logs at 0x7f4966b692d0>
2023-01-24 21:56:36,100 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/logs> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f496693c640>>
2023-01-24 21:56:36,100 INFO http_server_agent.py:81 – <ResourceRoute [GET] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/logs/tail> → <function JobAgent.tail_job_logs at 0x7f4966b69480>
2023-01-24 21:56:36,100 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/logs/tail> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f496693c640>>
2023-01-24 21:56:36,100 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/ray/version> → <function ServeAgent.get_version at 0x7f49668e2950>
2023-01-24 21:56:36,100 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/ray/version> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f496693c640>>
2023-01-24 21:56:36,100 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/serve/deployments/> → <function ServeAgent.get_all_deployments at 0x7f49668e29e0>
2023-01-24 21:56:36,100 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/deployments/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f496693c640>>
2023-01-24 21:56:36,100 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/serve/deployments/status> → <function ServeAgent.get_all_deployment_statuses at 0x7f49668e2b90>
2023-01-24 21:56:36,101 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/deployments/status> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f496693c640>>
2023-01-24 21:56:36,101 INFO http_server_agent.py:81 – <ResourceRoute [DELETE] <PlainResource /api/serve/deployments/> → <function ServeAgent.delete_serve_application at 0x7f49668e2d40>
2023-01-24 21:56:36,101 INFO http_server_agent.py:81 – <ResourceRoute [PUT] <PlainResource /api/serve/deployments/> → <function ServeAgent.put_all_deployments at 0x7f49668e2ef0>
2023-01-24 21:56:36,101 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/deployments/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f496693c640>>
2023-01-24 21:56:36,101 INFO http_server_agent.py:81 – <ResourceRoute [GET] <StaticResource /logs → PosixPath(‘/tmp/ray/session_2023-01-24_21-56-32_050369_270787/logs’)> → <bound method StaticResource._handle of <StaticResource /logs → PosixPath(‘/tmp/ray/session_2023-01-24_21-56-32_050369_270787/logs’)>>
2023-01-24 21:56:36,101 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <StaticResource /logs → PosixPath(‘/tmp/ray/session_2023-01-24_21-56-32_050369_270787/logs’)> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f496693c640>>
2023-01-24 21:56:36,101 INFO http_server_agent.py:82 – Registered 23 routes.
2023-01-24 21:56:36,110 INFO event_agent.py:56 – Report events to 10.216.0.171:45473
2023-01-24 21:56:36,111 INFO event_utils.py:131 – Monitor events logs modified after 1674591995.9445176 on /tmp/ray/session_2023-01-24_21-56-32_050369_270787/logs/events, the source types are all.

Does anyone have an idea of what could be going on here? I’m completely blocked.

Thanks!

@AleTL Hey, thanks for posting!

One question first, do you have a multi node cluster? Is this log from the agent that’s on the same node as the failed raylet?

@kai ccing library oncall as well for tune

This looks like an issue on the Ray cluster. Where are you running this code? Is this on a laptop, or a cluster started with the open source cluster launcher, or somewhere else? Are you using Kubernetes? Which Ray version are you using?
Lastly, does this only come up when you run the tune script or also when you run a random other Ray snippet?
If this only comes up with the tune snippet, can you post it?

Thanks for your answer!

It’s a multi node cluster, yes. It’s the log from the same node. I’ve tried again with different configuration of resources and now the own error prints of Ray print:

*** Error in `/miniconda3/myenv/bin/python’: free(): corrupted unsorted chunks: 0x000056272463b3b0 ***

And a list of chunks in which are things like: /miniconda3/myenv/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-x86_64-linux-gnu.so(+0x344ed)[0x7f57094214ed] or miniconda3/envs/myenv/bin/python(_PyEval_EvalFrameDefault+0x5d6f)[0x56272234618f. Then, it also appears a Memory Map.

Thanks for your answer!

I am running the code in a cluster. I am not using Kubernetes and the Ray version is 2.2.0 and the ray-tune version is 2.1.0.

def train(configuration):
if configuration[“trainer_config”][“use_tune”]:
###########################################
# Use tune
###########################################

    ray.init(num_cpus=configuration["total_resources"]["cpu"],
             num_gpus=configuration["total_resources"]["gpu"])

    time_to_sleep = 5
    print("Sleeping for %d seconds" % time_to_sleep)
    time.sleep(time_to_sleep)
    print("Woke up.. Scheduling")

    tune.run(
        configuration["trainer"],
        name=configuration["name"],
        config=configuration["trainer_config"],
        stop=configuration["stop"],
        resources_per_trial=configuration["resources_per_trial"],
        local_dir=configuration["summaries_dir"],
        checkpoint_freq=configuration.get("checkpoint_freq"),
        checkpoint_at_end=configuration.get("checkpoint_at_end"),
        checkpoint_score_attr=configuration.get("checkpoint_score_attr"),
        keep_checkpoints_num=configuration.get("keep_checkpoints_num"),
    )

This is the part where I call the trainer hyperparameters specified by a config file. But it’s not there where it fails. Apparently, the problem is somewhere else:

2023-01-25 08:07:35,789 ERROR trial_runner.py:1088 – Trial Trainer_f1786_00000: Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
File “/home/miniconda3/myenv/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py”, line 1070, in get_next_executor_event
future_result = ray.get(ready_future)
File “/home/miniconda3/envs/myenv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/home/miniconda3/envs/myenv/lib/python3.10/site-packages/ray/_private/worker.py”, line 2311, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: Trainer
actor_id: 51e0cd6bce3feb86454f0eb601000000
namespace: 2db8ac6c-e458-411d-8c33-9c756a80914a