Multiple _DesignatedBlockOwner processes

Using Ray 1.9.2 and 3 nodes k8s cluster. I noticed that every time a job submit ray task it creates a new _DesignatedBlockOwner process. It did not get terminate after the job finished. See the attached screen shot

Is it known issue? Otherwise how do I fix it. Thanks.

Looks like this is a known issue: https://github.com/ray-project/ray/issues/21999. It looks like some fixes were recently merged to mitigate this, so you can try switching to a nightly build: Installing Ray — Ray v1.10.0

Thanks @ckw017. I will check it out and provide my feedback.

@ckw017: I tried ray-1.10.0 and still the same issue. I switched to ray-2.0.0 dev daily and no issue however noticed two issues

  1. 4 IDLE_SpillWorker processes are still running after the job completed. see attached screen shots. Noted: When I re-submit the job again and it did not create additional IDLE_SpillWorker processes.
  2. When I tried to load partition parquet data, pickled error but no issue in ray 1.9.2 and 1.10.0
File "/usr/local/lib/python3.7/site-packages/ray/data/read_api.py", line 372, in read_parquet\n    **arrow_parquet_args,\n  File "/usr/local/lib/python3.7/site-packages/ray/data/read_api.py", line 221, in read_datasource\n    datasource, ctx, parallelism, _wrap_s3_filesystem_workaround(read_args)\n  File "/usr/local/lib/python3.7/site-packages/ray/remote_function.py", line 166, in _remote_proxy\n    return self._remote(args=args, kwargs=kwargs)\n  File "/usr/local/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 303, in _invocation_remote_span\n    return method(self, args, kwargs, *_args, **_kwargs)\n  File "/usr/local/lib/python3.7/site-packages/ray/remote_function.py", line 314, in _remote\n    scheduling_strategy=scheduling_strategy,\n  File "/usr/local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 177, in client_mode_convert_function\n    return client_func._remote(in_args, in_kwargs, **kwargs)\n  File "/usr/local/lib/python3.7/site-packages/ray/util/client/common.py", line 295, in _remote\n    return self.options(**option_args).remote(*args, **kwargs)\n  File "/usr/local/lib/python3.7/site-packages/ray/util/client/common.py", line 569, in remote\n    return return_refs(ray.call_remote(self, *args, **kwargs))\n  File "/usr/local/lib/python3.7/site-packages/ray/util/client/api.py", line 109, in call_remote\n    return self.worker.call_remote(instance, *args, **kwargs)\n  File "/usr/local/lib/python3.7/site-packages/ray/util/client/worker.py", line 482, in call_remote\n    pb_arg = convert_to_arg(arg, self._client_id)\n  File "/usr/local/lib/python3.7/site-packages/ray/util/client/client_pickler.py", line 184, in convert_to_arg\n    out.data = dumps_from_client(val, client_id)\n  File "/usr/local/lib/python3.7/site-packages/ray/util/client/client_pickler.py", line 166, in dumps_from_client\n    cp.dump(obj)\n  File "/usr/local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump\n    return Pickler.dump(self, obj)\n  File "stringsource", line 2, in pyarrow._dataset.HivePartitioning.__reduce_cython__\nTypeError: self.hive_partitioning,self.partitioning,self.wrapped cannot be converted to a Python object for pickling\n'

cc @sangcho are lingering IDLE_SpillWorker's expected here?

@mmuru can you put together a minimal reproduction script for the pickling issue

@ckw017: Sure, here is the reproduction script ray-dataset-partition-pickle-issue

Ping me if you need more information. Thanks.

IDLE spill worker is expected to be there. It doesn’t tie to any job, and it is shared by the cluster right now. I think we can probably kill it after job is terminated to avoid confusion though. @ckw017 do you mind creating an issue and tag me and Clark there?

Tracking the pickling error here: [Bug][Datasets] TypeError: self.hive_partitioning,self.partitioning,self.wrapped cannot be converted to a Python object for pickling · Issue #22453 · ray-project/ray · GitHub, thanks for the repro!

@mmuru Can you try adding os.environ["RAY_DATASET_FORCE_LOCAL_METADATA"] = "1" and see if that helps with the serialization issue? Also cc’d you on the Github issue if you’d like to follow along.

@ckw017: Setting environment variable os.environ[“RAY_DATASET_FORCE_LOCAL_METADATA”] = “1”, no pickling error and the serialization works. Yes, I following the issue #22543. I will wait for proper fix. Thanks for your help.