Ray start --head Unable to connect to GCS

david.waterworth · March 31, 2022, 12:19am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m working through the ray[serve] tutorials but I’m unable to start and run them detached because ray start --head fails.

If I used method 1 from https://docs.ray.io/en/master/serve/deployment.html#deploying-on-a-single-node (i.e. serve.start() then loop to prevent the script exiting) the examples work. However I wanted to experiment with method 2, starting the head node from the console. However every time I run

$ ray start --head

I get the error

Local node IP: 10.0.102.40
2022-03-31 11:03:14,134	WARNING node.py:1501 -- Unable to connect to GCS at 10.0.102.40:6379. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

Via lsof -i :6379 I’ve confirmed the port isn’t in use.

The contents of gcs_server.out are below. Can anyone point me in the correct direction, for now I’m just trying to start a single node cluster.

[2022-03-31 11:03:09,119 I 11416 11416] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-03-31 11:03:09,119 I 11416 11416] (gcs_server) gcs_server.cc:51: GCS storage type is memory
[2022-03-31 11:03:09,119 I 11416 11416] (gcs_server) gcs_server.cc:52: gRPC based pubsub is enabled
[2022-03-31 11:03:09,120 I 11416 11416] (gcs_server) gcs_init_data.cc:42: Loading job table data.
[2022-03-31 11:03:09,120 I 11416 11416] (gcs_server) gcs_init_data.cc:54: Loading node table data.
[2022-03-31 11:03:09,120 I 11416 11416] (gcs_server) gcs_init_data.cc:66: Loading cluster resources table data.
[2022-03-31 11:03:09,120 I 11416 11416] (gcs_server) gcs_init_data.cc:93: Loading actor table data.
[2022-03-31 11:03:09,120 I 11416 11416] (gcs_server) gcs_init_data.cc:79: Loading placement group table data.
[2022-03-31 11:03:09,120 I 11416 11416] (gcs_server) gcs_init_data.cc:46: Finished loading job table data, size = 0
[2022-03-31 11:03:09,120 I 11416 11416] (gcs_server) gcs_init_data.cc:58: Finished loading node table data, size = 0
[2022-03-31 11:03:09,120 I 11416 11416] (gcs_server) gcs_init_data.cc:70: Finished loading cluster resources table data, size = 0
[2022-03-31 11:03:09,120 I 11416 11416] (gcs_server) gcs_init_data.cc:97: Finished loading actor table data, size = 0
[2022-03-31 11:03:09,120 I 11416 11416] (gcs_server) gcs_init_data.cc:84: Finished loading placement group table data, size = 0
[2022-03-31 11:03:09,120 I 11416 11416] (gcs_server) gcs_heartbeat_manager.cc:30: GcsHeartbeatManager start, num_heartbeats_timeout=30
[2022-03-31 11:03:09,143 C 11416 11416] (gcs_server) grpc_server.cc:93: Check failed: server_ Failed to start the grpc server. The specified port is 6379. This means that Ray’s core components will not be able to function correctly. If the server startup error message is Address already in use, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :6379 to check if there are other processes listening to the port.
*** StackTrace Information ***
ray::SpdLogMessage::Flush()
ray::RayLog::~RayLog()
ray::rpc::GrpcServer::Run()
ray::gcs::GcsServer::DoStart()
std::_Function_handler<>::_M_invoke()
EventTracker::RecordExecution()
std::_Function_handler<>::_M_invoke()
boost::asio::detail::completion_handler<>::do_complete()
boost::asio::detail::scheduler::do_run_one()
boost::asio::detail::scheduler::run()
boost::asio::io_context::run()
main
__libc_start_main

zyc-bit · May 7, 2022, 3:57pm

I have the same problem, I can not ray start --head . It is also unable to connect to GCS.

Mingwei · May 7, 2022, 8:42pm

@zyc-bit, can you specify the Ray and Python version being used, and post the content of gcs_server.out?

@david.waterworth sorry for missing the issue earlier! From the log message the issue is most likely related to registering port at 6379, but as you confirmed there is no process currently using the port … Can you try if starting another service, e.g. Redis, on the same port (6379) would work?

david.waterworth · May 8, 2022, 10:29pm

@Mingwei thanks - my question was answered here - there actually was something using the port.

zyc-bit · May 9, 2022, 5:09am

@Mingwei , thank you vary much for your reply.
When I run ray start --head, I got an error:

$ ray start --head
Usage stats collection will be enabled by default in the next release. See https://github.com/ray-project/ray/issues/20857 for more details.
Local node IP: 10.140.0.32
2022-05-08 22:32:27,173 WARNING utils.py:1249 -- Unable to connect to GCS at 10.140.0.32:6379. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
2022-05-08 22:32:27,173 WARNING utils.py:1249 -- Unable to connect to GCS at 10.140.0.32:6379. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

And sorry for the lack of gcs-server.out . I do not know how to get the content of gcs_server.out
Looking forword for your reply!

Mingwei · May 9, 2022, 5:24am

Hi @zyc-bit, gcs-server.out is most likely at /tmp/ray/session_latest/logs/gcs_server.out on Linux and MacOS. Btw, can you also try lsof -i :6379 to see if there is a port conflict?

zyc-bit · May 9, 2022, 5:27am

I run lsof -i :6379 . And it gave me nothing.
And my ray is 1.12.0, and my python is 3.7.

zyc-bit · May 9, 2022, 5:33am

Hi @Mingwei , my gcs_server.out shows blow,

[2022-05-08 22:32:21,865 I 129461 129461] (gcs_server) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-05-08 22:32:21,866 I 129461 129461] (gcs_server) gcs_server.cc:55: GCS storage type is memory
[2022-05-08 22:32:21,866 I 129461 129461] (gcs_server) gcs_server.cc:56: gRPC based pubsub is enabled
[2022-05-08 22:32:21,866 I 129461 129461] (gcs_server) gcs_init_data.cc:42: Loading job table data.
[2022-05-08 22:32:21,866 I 129461 129461] (gcs_server) gcs_init_data.cc:54: Loading node table data.
[2022-05-08 22:32:21,866 I 129461 129461] (gcs_server) gcs_init_data.cc:66: Loading cluster resources table data.
[2022-05-08 22:32:21,866 I 129461 129461] (gcs_server) gcs_init_data.cc:93: Loading actor table data.
[2022-05-08 22:32:21,866 I 129461 129461] (gcs_server) gcs_init_data.cc:79: Loading placement group table data.
[2022-05-08 22:32:21,866 I 129461 129461] (gcs_server) gcs_init_data.cc:46: Finished loading job table data, size = 0
[2022-05-08 22:32:21,866 I 129461 129461] (gcs_server) gcs_init_data.cc:58: Finished loading node table data, size = 0
[2022-05-08 22:32:21,866 I 129461 129461] (gcs_server) gcs_init_data.cc:70: Finished loading cluster resources table data, size = 0
[2022-05-08 22:32:21,866 I 129461 129461] (gcs_server) gcs_init_data.cc:97: Finished loading actor table data, size = 0
[2022-05-08 22:32:21,866 I 129461 129461] (gcs_server) gcs_init_data.cc:84: Finished loading placement group table data, size = 0
[2022-05-08 22:32:21,867 I 129461 129461] (gcs_server) gcs_heartbeat_manager.cc:31: GcsHeartbeatManager start, num_heartbeats_timeout=30
[2022-05-08 22:32:21,892 C 129461 129461] (gcs_server) grpc_server.cc:95:  Check failed: server_ Failed to start the grpc server. The specified port is 6379. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :6379 to check if there are other processes listening to the port.
*** StackTrace Information ***
    ray::SpdLogMessage::Flush()
    ray::RayLog::~RayLog()
    ray::rpc::GrpcServer::Run()
    ray::gcs::GcsServer::DoStart()
    ray::gcs::GcsTable<>::GetAll()::{lambda()#1}::operator()()
    EventTracker::RecordExecution()
    std::_Function_handler<>::_M_invoke()
    boost::asio::detail::completion_handler<>::do_complete()
    boost::asio::detail::scheduler::do_run_one()
    boost::asio::detail::scheduler::run()
    boost::asio::io_context::run()
    main
    __libc_start_main

Thank you very much for taking the time to answer my questions.

zyc-bit · May 9, 2022, 7:06am

Hi @Mingwei , I successed. Command lsof -i :6379 always returned nothing. So I thought there was no port conflict. But the fact is that my colleague told me he just stop his ray process. (I still do not understand why lsof -i :6379 returned nothing before.) So mine worked. Thank you a lot.

Mingwei · May 9, 2022, 6:49pm

Great that you found the issue. It seems only using sudo lsof -i :6379 can show processes from other users. We can update log messages and documentations in Ray.

xcy · June 19, 2022, 10:36pm

Hi! I am now facing the same problem. And port 6739 can not be used.

And could you give me some advice about this problem. Thank you very much!

zyc-bit · June 21, 2022, 3:38am

Hi, @xcy are you using you PC instead of a cluster? Have you ever started ray succeessful before?
You can run ray stop first to make sure there is no other ray process.

xcy · June 21, 2022, 4:15am

Thank you for your reply. I think I don’t start the ray. Because this is the code from my professor. And what I should do is to run the train code. And I should input two params. One is gpus and the other is port. But after I do this, there was a warning message like this. So i don’t know how to solve this

zyc-bit · June 21, 2022, 5:02am

I understood. Firstly the code your professor provided should be proved correct. Make sure the code can be run successful on other machine before. Secondly, you need to start ray by run ray start --head in your cmd. Try to start ray in the cmd by running ray start --head to see if the ray can be start successfully.

Topic		Replies	Views
ERROR gcs_utils.py:137 -- Failed to send request to gcs Ray Clusters	20	2663	February 11, 2022
2023-03-19 13:38:56,574 WARNING utils.py:1445 -- Unable to connect to GCS at gaowei0155.69.142.146:8901 Ray Core	1	447	March 21, 2023
Local Cluster - Failed to connect to GCS Ray Core	3	1694	August 21, 2023
Unable to manually start ray cluster Ray Core	2	783	April 26, 2021
Intial setup for ray on a HPC Ray Serve	4	690	January 20, 2024

Ray start --head Unable to connect to GCS

Related topics