Basic examples not working KeyError: 'getpwuid(): uid not found: 1234'

magic-dlg · April 20, 2022, 4:20pm

Hi!

Had similar problem before with everything else : My previous post which was fixed by nightly build.

So now is a turn for Ray Tune, for which following examples of getting started to test on cluster gives me multiple errors. Perhaps I am missing something out here… .

Thus for this basic code:

from ray import tune

# 1. Define an objective function.
def objective(config):
    score = config["a"] ** 2 + config["b"]
    return {"score": score}


# 2. Define a search space.
search_space = {
    "a": tune.grid_search([0.001, 0.01, 0.1, 1.0]),
    "b": tune.choice([1, 2, 3]),
}

# 3. Start a Tune run and print the best result.
analysis = tune.run(objective, config=search_space,
                   local_dir=local_dir)
print(analysis.get_best_config(metric="score", mode="min"))

It starts and runs one objective and then I get this error (the puid is the number from the machine i try to execute the code):

RayTaskError(TuneError): ray::run() (pid=15453, ip=10.99.11.75)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 877, in _on_training_result
    self._process_trial_results(trial, result)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 961, in _process_trial_results
    decision = self._process_trial_result(trial, result)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 1016, in _process_trial_result
    self._callbacks.on_trial_result(
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/callback.py", line 268, in on_trial_result
    callback.on_trial_result(**info)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 577, in on_trial_result
    trial_syncer.sync_down_if_needed()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 352, in sync_down_if_needed
    return super(NodeSyncer, self).sync_down_if_needed(SYNC_PERIOD, exclude=exclude)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 237, in sync_down_if_needed
    self.sync_down(exclude)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 374, in sync_down
    logger.debug("Syncing from %s to %s", self._remote_path, self._local_dir)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 379, in _remote_path
    ssh_user = get_ssh_user()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/tune/cluster_info.py", line 19, in get_ssh_user
    return getpass.getuser()
  File "/home/ray/anaconda3/lib/python3.8/getpass.py", line 169, in getuser
    return pwd.getpwuid(os.getuid())[0]
KeyError: 'getpwuid(): uid not found: 12574'

During handling of the above exception, another exception occurred:

Many thanks for help

xwjiang2010 · April 20, 2022, 7:56pm

Hello!
How did you launch the clusters? How many nodes are there? It seems the error happens when file is synced from worker node to head node. Just want to get the exact set up like you.

Currently I don’t see the same issue as you reported.

magic-dlg · April 21, 2022, 8:29am

Hi,

I use a Domino Data Lab platform, where Ray is build-in to the ecosystem. I used 4x (1x head + 3x workers) of 4 core 15Gb Ram per each machine. As a user I don’t have access to the setting of ray cluster, only few parameters are allowed through domino GUI.

Not sure if that helps.
Btw i use Ray Nightly build (2.0.0.dev0 for py3.8)

Could it that be about the way that cluster head/nodes are setup so that the communication is broken somewhat?

xwjiang2010 · April 22, 2022, 3:12pm

Hello!
What do you get when you do “whoami”?
You may find this helpful: getpass — Portable password input — Python 3.10.4 documentation
You can try hardcode some env vars mentioned there.

magic-dlg · April 22, 2022, 4:19pm

Not sure if I follow you now. On whoami I have ubuntu as I ran all my stuff on ubuntu. The pw_uid is exactly the same one as the machine I schedule jobs from (not the cluster).

xwjiang2010 · April 24, 2022, 9:05pm

I feel like this is ray cluster set up related.
So you mentioned there is a machine that you schedule jobs from (not the cluster) - are you running your job on head node or using ray client mode?
Could you go to the head node and print out the result of getpwuid as well?
The error happens when the head node is trying to pull files from worker nodes. Is 12574 meaningful on head?

magic-dlg · April 25, 2022, 2:17pm

it is ray client mode, i did use the client connection. I figured out that ray.init('Ray://address_to_my_cluster:1234') works from few versions back (instead of ray.util.connect(...).

Anyways the uid was from the local machine truing to run script in “client mode”. I dont have ability (unless there is a trick i dont know) to find a uid of the machines in cluster, since we have it running via Domino Datalab Platform. Me as the platform user don’t have access to anything apart from Ray frontend or my workspace where i write the code . Plus mind i work in corporate environment where almost everything is locked, to make is safer aaaand our life more difficult

Error happens when i try to run analysis : ray.tune(...), i dont attempt to pull anything in or out.

xwjiang2010 · April 25, 2022, 2:54pm

Ok.
To mitigate the issue and to unblock you, you could just do tune.run(..., sync_config=SyncConfig(syncer=None), ...). This will disable the syncing between worker and head nodes.
Basically the syncing mechanism is there to provide fault tolerance. In case some worker node crashes, a trial can be recovered if the checkpoint file exists on head node.
Disabling this option would mean that no fault tolerance is in place.

As for the getpwuid issue itself, sounds like 12574 is from the ray client outside of the cluster. This is definitely not the correct behavior. What is expected is the uid of head node inside of the cluster. Is this a new script that you are running? Is this something you only recently observed on 1.12? How about previous versions of Ray?

Also it may be worthwhile to check with Domino platform team about their cluster set up.

Topic		Replies	Views
Unable to run example, returns error message	4	963	March 14, 2023
Simple hello_world example crashes badly Ray Core	6	413	December 29, 2023
Tune.run() on cluster failing with "'Worker' object has no attribute 'core_worker'" Ray Tune	6	1431	May 11, 2022
Getting errors while using documentation sample codes Debugging and performance tuning	0	74	April 22, 2024
TuneGridSearchCV error finding folder /home/ray/results Kubernetes	0	29	January 11, 2024

Basic examples not working KeyError: 'getpwuid(): uid not found: 1234'

Related topics