KeyError: 'stateSnapshot'

We updated ray on k8s cluster from 2.7.1 to 2.8.1 (tried 2.9.2 too).

Base image rayproject/ray:2.7.1-py38 → rayproject/ray:2.8.1-py38

Kuberay was updated from 1.0.0rc1 to 1.0.0

After that in dashboard we can’t see the nodes summary.
The endpoint returns the following response:

{
    "result": false,
    "msg": "Traceback (most recent call last):\n  File \"/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/optional_utils.py\", line 224, in _update_cache\n    response = task.result()\n  File \"/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/modules/node/node_head.py\", line 311, in get_all_nodes\n    all_node_summary, nodes_logical_resources = await asyncio.gather(\n  File \"/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/datacenter.py\", line 173, in get_all_node_summary\n    return [\n  File \"/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/datacenter.py\", line 174, in <listcomp>\n    await DataOrganizer.get_node_info(node_id, get_summary=True)\n  File \"/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/datacenter.py\", line 146, in get_node_info\n    node_info[\"status\"] = node[\"stateSnapshot\"][\"state\"]\n  File \"/home/ray/anaconda3/lib/python3.8/site-packages/ray/dashboard/utils.py\", line 426, in __getitem__\n    proxy = self._proxy[item] = make_immutable(self._dict[item])\nKeyError: 'stateSnapshot'\n",
    "data": {}
}

And because of that we don’t see cluster status.
Any hints on how to solve this or what can lead to such behavior?

Would be glad to provide additional info, though not sure what should I additionally provide now.

Thanks in advance!

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
3 Likes

@rickyyx do you have ideas?

1 Like

Also, I don’t know whether it will be helpful or not.

But the other thing noticed is that our session id became constant even though documentation says it should be regenerated.

session_2023-10-23_04-54-52_321964_21

If anyone else struggles with this issue, there is a solution:

Try completely wiping/recreating redis instance. There is a chance for some leftover stuff.