Basic examples not working KeyError: 'getpwuid(): uid not found: 1234'

Hi! :slight_smile:

Had similar problem before with everything else : My previous post which was fixed by nightly build.

So now is a turn for Ray Tune, for which following examples of getting started to test on cluster gives me multiple errors. Perhaps I am missing something out here… .

Thus for this basic code:

from ray import tune

# 1. Define an objective function.
def objective(config):
    score = config["a"] ** 2 + config["b"]
    return {"score": score}


# 2. Define a search space.
search_space = {
    "a": tune.grid_search([0.001, 0.01, 0.1, 1.0]),
    "b": tune.choice([1, 2, 3]),
}

# 3. Start a Tune run and print the best result.
analysis = tune.run(objective, config=search_space,
                   local_dir=local_dir)
print(analysis.get_best_config(metric="score", mode="min"))

It starts and runs one objective and then I get this error (the puid is the number from the machine i try to execute the code):

RayTaskError(TuneError): ray::run() (pid=15453, ip=10.99.11.75)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 877, in _on_training_result
    self._process_trial_results(trial, result)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 961, in _process_trial_results
    decision = self._process_trial_result(trial, result)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 1016, in _process_trial_result
    self._callbacks.on_trial_result(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/callback.py", line 268, in on_trial_result
    callback.on_trial_result(**info)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 577, in on_trial_result
    trial_syncer.sync_down_if_needed()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 352, in sync_down_if_needed
    return super(NodeSyncer, self).sync_down_if_needed(SYNC_PERIOD, exclude=exclude)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 237, in sync_down_if_needed
    self.sync_down(exclude)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 374, in sync_down
    logger.debug("Syncing from %s to %s", self._remote_path, self._local_dir)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 379, in _remote_path
    ssh_user = get_ssh_user()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/cluster_info.py", line 19, in get_ssh_user
    return getpass.getuser()
  File "/home/ray/anaconda3/lib/python3.8/getpass.py", line 169, in getuser
    return pwd.getpwuid(os.getuid())[0]
KeyError: 'getpwuid(): uid not found: 12574'

During handling of the above exception, another exception occurred:

Many thanks for help :wink:

Hello!
How did you launch the clusters? How many nodes are there? It seems the error happens when file is synced from worker node to head node. Just want to get the exact set up like you.

Currently I don’t see the same issue as you reported.

Hi,

I use a Domino Data Lab platform, where Ray is build-in to the ecosystem. I used 4x (1x head + 3x workers) of 4 core 15Gb Ram per each machine. As a user I don’t have access to the setting of ray cluster, only few parameters are allowed through domino GUI.

Not sure if that helps.
Btw i use Ray Nightly build (2.0.0.dev0 for py3.8)

Could it that be about the way that cluster head/nodes are setup so that the communication is broken somewhat?

Hello!
What do you get when you do “whoami”?
You may find this helpful: getpass — Portable password input — Python 3.10.4 documentation
You can try hardcode some env vars mentioned there.

Not sure if I follow you now. On whoami I have ubuntu as I ran all my stuff on ubuntu. The pw_uid is exactly the same one as the machine I schedule jobs from (not the cluster).

I feel like this is ray cluster set up related.
So you mentioned there is a machine that you schedule jobs from (not the cluster) - are you running your job on head node or using ray client mode?
Could you go to the head node and print out the result of getpwuid as well?
The error happens when the head node is trying to pull files from worker nodes. Is 12574 meaningful on head?

it is ray client mode, i did use the client connection. I figured out that ray.init('Ray://address_to_my_cluster:1234') works from few versions back (instead of ray.util.connect(...).

Anyways the uid was from the local machine truing to run script in “client mode”. I dont have ability (unless there is a trick i dont know) to find a uid of the machines in cluster, since we have it running via Domino Datalab Platform. Me as the platform user don’t have access to anything apart from Ray frontend or my workspace where i write the code . Plus mind i work in corporate environment where almost everything is locked, to make is safer :wink: aaaand our life more difficult :wink:

Error happens when i try to run analysis : ray.tune(...), i dont attempt to pull anything in or out.

Ok.
To mitigate the issue and to unblock you, you could just do tune.run(..., sync_config=SyncConfig(syncer=None), ...). This will disable the syncing between worker and head nodes.
Basically the syncing mechanism is there to provide fault tolerance. In case some worker node crashes, a trial can be recovered if the checkpoint file exists on head node.
Disabling this option would mean that no fault tolerance is in place.

As for the getpwuid issue itself, sounds like 12574 is from the ray client outside of the cluster. This is definitely not the correct behavior. What is expected is the uid of head node inside of the cluster. Is this a new script that you are running? Is this something you only recently observed on 1.12? How about previous versions of Ray?

Also it may be worthwhile to check with Domino platform team about their cluster set up.

1 Like