"ModuleNotFoundError: No module named in" when connecting in client mode

Ray v2.0.0dev, Python 3.8, Windows 10, I’m trying to run the following example in client mode (from unittest):

And get an error, full stack trace:

self = <ray.util.client.worker.Worker object at 0x000001E9731433D0>
task = name: “run”
payload_id: “\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\001\000\000.…ls": 0, "max_retries": 3, "resources": null, "accelerator_type": null, "num_returns": 1, "memory": null}”
}

def _call_schedule_for_task(
        self, task: ray_client_pb2.ClientTask) -> List[bytes]:
    logger.debug("Scheduling %s" % task)
    task.client_id = self._client_id
    try:
        ticket = self.server.Schedule(task, metadata=self.metadata)
    except grpc.RpcError as e:
        raise decode_exception(e.details)
    if not ticket.valid:
        try:
          raise cloudpickle.loads(ticket.error)

E ModuleNotFoundError: No module named ‘tests.tune_mnist_keras’

C:\Users\dm57337.conda\envs\py38tf\lib\site-packages\ray\util\client\worker.py:305: ModuleNotFoundError

When running in local_mode=True (ray.init(local_mode=True, num_cpus=1, num_gpus=1, include_dashboard=False)) all works fine.

Also, when connecting to cluster (spinned-up on local machine) the ‘normal’ way (ray.init(address=“auto”, _redis_password=ray_constants.REDIS_DEFAULT_PASSWORD)) I get another error:

2021-03-18 17:18:22,406 ERROR trial_runner.py:727 – Trial train_mnist_20a57_00000: Error processing event.
Result for train_mnist_20a57_00000:
Traceback (most recent call last):
{}
File “C:\Users\dm57337.conda\envs\py38tf\lib\site-packages\ray\tune\trial_runner.py”, line 697, in _process_trial

results = self.trial_executor.fetch_result(trial)

File “C:\Users\dm57337.conda\envs\py38tf\lib\site-packages\ray\tune\ray_trial_executor.py”, line 678, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File “C:\Users\dm57337.conda\envs\py38tf\lib\site-packages\ray_private\client_mode_hook.py”, line 47, in wrapper
return func(*args, **kwargs)
File “C:\Users\dm57337.conda\envs\py38tf\lib\site-packages\ray\worker.py”, line 1440, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=7096, ip=10.240.194.92)
File “python\ray_raylet.pyx”, line 501, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 444, in ray._raylet.execute_task.function_executor
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\ray_private\function_manager.py”, line 556, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\ray\tune\trainable.py”, line 173, in train_buffered
result = self.train()
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\ray\tune\trainable.py”, line 232, in train
result = self.step()
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\ray\tune\function_runner.py”, line 366, in step
self._report_thread_runner_error(block=True)
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\ray\tune\function_runner.py”, line 512, in _report_thread_runner_error
raise TuneError(
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train_buffered() (pid=7096, ip=10.240.194.92)
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\ray\tune\function_runner.py”, line 248, in run
self._entrypoint()
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\ray\tune\function_runner.py”, line 315, in entrypoint
return self._trainable_func(self.config, self._status_reporter,
File “C:\Users\dm57337.conda\envs\py38tf\lib\site-packages\ray\tune\function_runner.py”, line 580, in _trainable_func
output = fn()
File “C:\dev\ray-cluster\tests\tune_mnist_keras.py”, line 38, in train_mnist
model.fit(
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\tensorflow\python\keras\engine\training.py”, line 1100, in fit
tmp_logs = self.train_function(iterator)
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\tensorflow\python\eager\def_function.py”, line 828, in call
result = self._call(*args, **kwds)
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\tensorflow\python\eager\def_function.py”, line 888, in _call
return self._stateless_fn(*args, **kwds)
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\tensorflow\python\eager\function.py”, line 2942, in call
return graph_function._call_flat(
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\tensorflow\python\eager\function.py”, line 1918, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\tensorflow\python\eager\function.py”, line 555, in call
outputs = execute.execute(
File “c:\users\dm57337.conda\envs\py38tf\lib\site-packages\tensorflow\python\eager\execute.py”, line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(128, 784), b.shape=(784, 479), m=128, n=479, k=784
[[node sequential/dense/MatMul (defined at C:\dev\ray-cluster\tests\tune_mnist_keras.py:38) ]] [Op:__inference_train_function_506]

Function call stack:
train_function

Any ideas?

CC @ericl , any thoughts?

@ericl A friendly reminder re this issue - got any ideas?
I’m still stuck with this…

I got same issue when I use kubernetes on ubuntu 18.04.
When I do not use kubernetes, my app works, but when I use kubernetes for ray cluster, I got that issue, too.
My app is working out of the kubernetes.
Any ideas or suggestions?