Is this a bug in Ray? super(type, obj): obj must be an instance ... only in a container on the cluster and not on the local machine

Hi everyone,

It’s been quite a while that I am struggling with the following simplified code that runs fine and without any errors on my local machine but it fails inside a container on a head-less cluster that is managed with Slurm. I am using ray 1.9.2 and python 3.8.12 and torch 1.11.

import ray
from torch.utils.data import Dataset

class MainDataset(Dataset):
    def __init__(self):
        super().__init__()
        self.num_frames = 1
 
    def __len__(self):
        return 0
        
    def __getitem__(self, idx):
        return 0

@ray.remote
class RemoteMainDataset(MainDataset):
    def __init__(self):
        super().__init__()

    def get_num_frames(self):
        return self.num_frames
    
if __name__ == '__main__':
    ray.init(logging_level=30, local_mode=False, log_to_driver=False)
    dataset = RemoteMainDataset.remote()
    total_frames = ray.get(dataset.get_num_frames.remote())
    print(total_frames)

This implementation prints 1 on my local machine but it produces the following error on the cluster inside an enroot (or docker) container. The error is:

2022-02-09 10:44:40,790	WARNING utils.py:534 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/netscratch/toosi/world_on_rails/rails/ray_example.py", line 26, in <module>
    total_frames = ray.get(dataset.get_num_frames.remote())
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/ray/worker.py", line 1715, in get
    raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::RemoteMainDataset.__init__() (pid=3217031, ip=192.168.33.210)
  File "/netscratch/toosi/world_on_rails/rails/ray_example.py", line 18, in __init__
    super().__init__()
TypeError: super(type, obj): obj must be an instance or subtype of type

Even after I change super().init() to super(RemoteMainDataset,self).init(), I get another error:

2022-02-09 10:51:43,057	WARNING utils.py:534 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set `RAY_USE_MULTIPROCESSING_CPU_COUNT=1` as an env var before starting Ray. Set the env var: `RAY_DISABLE_DOCKER_CPU_WARNING=1` to mute this warning.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/netscratch/toosi/world_on_rails/rails/ray_example.py", line 25, in <module>
    dataset = RemoteMainDataset.remote()
  File "/opt/conda/lib/python3.8/site-packages/ray/actor.py", line 451, in remote
    return self._remote(args=args, kwargs=kwargs)
  File "/opt/conda/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 371, in _invocation_actor_class_remote_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/ray/actor.py", line 714, in _remote
    worker.function_actor_manager.export_actor_class(
  File "/opt/conda/lib/python3.8/site-packages/ray/_private/function_manager.py", line 397, in export_actor_class
    serialized_actor_class = pickle.dumps(Class)
  File "/opt/conda/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/opt/conda/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
    return Pickler.dump(self, obj)
_pickle.PicklingError: Can't pickle <functools._lru_cache_wrapper object at 0x7fc86634b550>: it's not the same object as typing.Generic.__class_getitem__

One interesting behavior is that I have this error when I am subclassing from torch.utils.data.Dataset. For instance, subclassing from nn.modules doesn’t make any errors and everything works fine.

It would be great if anyone who has had similar issues or has any idea, could help me with this problem. Thank you.

It seems like a known issue. You can find more details here: Actors do not work properly with subclasses that call super. · Issue #449 · ray-project/ray · GitHub