Reading logs on worker nodes

I’m attempting to use ray job submit from my local machine to launch a hyper-parameter search using Ray Tune on a remote cluster. My job is failing with multiple error messages. Is there a recommended procedure to find logs for a given worker in the remote cluster? A tail of the logs is printed in the terminal on my local machine, but I believe that it’s the logs from all workers and I’d like to dig in deeper on just one worker. I can kubectl exec to the head node and find logs located in /tmp/ray/logs/session_latest , but there are hundreds of log files and it’s not clear to me which one to look at. Any help is appreciated!

Thanks,
Shane

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi @sbussmann,
What does your console output say about the error? Usually I find that pretty sufficient for me. If you can share that, I may be able to help take a look.

Hi @xwjiang2010, thanks for the response. I’d like to attach the console output, but it exceeds the character limit of this forum. Is there a workaround? Tried to upload, but I think only images are allowed. Here is a selection of the primary error I’m facing:

== Status ==
Current time: 2022-03-18 07:54:44 (running for 00:00:12.76)
Memory usage on this node: 5.5/124.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/58 CPUs, 1.0/2 GPUs, 0.0/148.4 GiB heap, 0.0/63.49 GiB objects
Result logdir: /home/ray/ray_results/tune_deeplearn
Number of trials: 1/1 (1 RUNNING)
+--------------------------------+----------+--------------------+--------------+-----------------+-------------+--------------+
| Trial name                     | status   | loc                |   batch_size |   initial_depth |          lr |   max_levels |
|--------------------------------+----------+--------------------+--------------+-----------------+-------------+--------------|
| train_deeplearn_tune_5105c_00000 | RUNNING  | 100.125.228.77:471 |           32 |              40 | 0.000353205 |            5 |
+--------------------------------+----------+--------------------+--------------+-----------------+-------------+--------------+


(pid=runtime_env) 2022-03-18 07:54:44,443       INFO conda_utils.py:198 -- Installing collected packages: certifi, affine, zipp, typing-extensions, setuptools, pyparsing, numpy, attrs, snuggs, importlib-metadata, click, cligj, click-plugins, rasterio
(train_deeplearn_tune pid=471, ip=100.125.228.77) Using native 16bit precision.
(train_deeplearn_tune pid=471, ip=100.125.228.77) GPU available: True, used: True
(train_deeplearn_tune pid=471, ip=100.125.228.77) TPU available: False, using: 0 TPU cores
(train_deeplearn_tune pid=471, ip=100.125.228.77) IPU available: False, using: 0 IPUs
(train_deeplearn_tune pid=471, ip=100.125.228.77) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
== Status ==
Current time: 2022-03-18 07:54:48 (running for 00:00:16.76)
Memory usage on this node: 5.2/124.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/58 CPUs, 1.0/2 GPUs, 0.0/148.4 GiB heap, 0.0/63.49 GiB objects
Result logdir: /home/ray/ray_results/tune_deeplearn
Number of trials: 1/1 (1 RUNNING)
+--------------------------------+----------+--------------------+--------------+-----------------+-------------+--------------+
| Trial name                     | status   | loc                |   batch_size |   initial_depth |          lr |   max_levels |
|--------------------------------+----------+--------------------+--------------+-----------------+-------------+--------------|
| train_deeplearn_tune_5105c_00000 | RUNNING  | 100.125.228.77:471 |           32 |              40 | 0.000353205 |            5 |
+--------------------------------+----------+--------------------+--------------+-----------------+-------------+--------------+


(pid=runtime_env) 2022-03-18 07:54:48,350       INFO conda_utils.py:198 -- ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
(pid=runtime_env) 2022-03-18 07:54:48,350       INFO conda_utils.py:198 -- raydp-nightly 2022.3.9.dev0 requires typing, which is not installed.
(pid=runtime_env) 2022-03-18 07:54:48,350       INFO conda_utils.py:198 -- xgboost-ray 0.1.4 requires numpy<1.20,>=1.16, but you have numpy 1.21.5 which is incompatible.
(pid=runtime_env) 2022-03-18 07:54:48,350       INFO conda_utils.py:198 -- tensorflow 2.6.0 requires numpy~=1.19.2, but you have numpy 1.21.5 which is incompatible.
(pid=runtime_env) 2022-03-18 07:54:48,350       INFO conda_utils.py:198 -- tensorflow 2.6.0 requires typing-extensions~=3.7.4, but you have typing-extensions 4.1.1 which is incompatible.
(pid=runtime_env) 2022-03-18 07:54:48,351       INFO conda_utils.py:198 -- fastapi 0.75.0 requires starlette==0.17.1, but you have starlette 0.16.0 which is incompatible.
(pid=runtime_env) 2022-03-18 07:54:48,351       INFO conda_utils.py:198 -- autogluon-core 0.1.0 requires numpy==1.19.5, but you have numpy 1.21.5 which is incompatible.
(pid=runtime_env) 2022-03-18 07:54:48,351       INFO conda_utils.py:198 -- aiobotocore 1.2.2 requires botocore<1.19.53,>=1.19.52, but you have botocore 1.24.16 which is incompatible.
(pid=runtime_env) 2022-03-18 07:54:48,351       INFO conda_utils.py:198 -- Successfully installed affine-2.3.0 attrs-21.4.0 certifi-2021.10.8 click-8.0.4 click-plugins-1.1.1 cligj-0.7.2 importlib-metadata-4.11.3 numpy-1.21.5 pyparsing-3.0.7 rasterio-1.2.10 setuptools-59.5.0 snuggs-1.4.7 typing-extensions-4.1.1 zipp-3.7.0
(pid=runtime_env) 2022-03-18 07:54:48,351       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/zipp.py already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,352       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/rasterio.libs already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,352       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/snuggs-1.4.7.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,352       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/certifi-2021.10.8.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,353       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/_distutils_hack already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,353       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/attrs already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,353       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/importlib_metadata already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,354       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/pyparsing-3.0.7.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,354       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/certifi already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,354       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/snuggs already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,355       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/pyparsing already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,355       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/setuptools already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,355       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/zipp-3.7.0.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,355       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runt
ime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/distutils-precedence.pth already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,356       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/cligj already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,356       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/pkg_resources already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,356       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/click already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,357       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/rasterio-1.2.10.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,357       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/cligj-0.7.2.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,357       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/typing_extensions.py already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,358       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/numpy already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,358       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/rasterio already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,358       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/setuptools-59.5.0.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,359       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/click_plugins already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,359       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/click-8.0.4.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,359       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/typing_extensions-4.1.1.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,360       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/numpy.libs already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,360       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/click_plugins-1.1.1.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,360       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/affine-2.3.0.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,360       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/attrs-21.4.0.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,361       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/numpy-1.21.5.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,361       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/__pycache__ already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,361       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runt
ime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/attr already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,362       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/importlib_metadata-4.11.3.dist-info already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,362       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/affine already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,362       INFO conda_utils.py:198 -- WARNING: Target directory /tmp/ray/session_2022-03-18_07-01-59_616525_420/runtime_resources/pip/0e34da22171326a78f99e618ffb1059fcebc59d2/bin already exists. Specify --upgrade to force replacement.
(pid=runtime_env) 2022-03-18 07:54:48,658       INFO working_dir.py:98 -- Setup working dir for gcs://_ray_pkg_b83624756c441550.zip
(train_deeplearn_tune pid=471, ip=100.125.228.77)
(train_deeplearn_tune pid=471, ip=100.125.228.77)   | Name      | Type      | Params
(train_deeplearn_tune pid=471, ip=100.125.228.77) ----------------------------------------
(train_deeplearn_tune pid=471, ip=100.125.228.77) 0 | train_acc | Accuracy  | 0
(train_deeplearn_tune pid=471, ip=100.125.228.77) 1 | valid_acc | Accuracy  | 0
(train_deeplearn_tune pid=471, ip=100.125.228.77) 2 | model     | NewModel | 12.1 M
(train_deeplearn_tune pid=471, ip=100.125.228.77) ----------------------------------------
(train_deeplearn_tune pid=471, ip=100.125.228.77) 12.1 M    Trainable params
(train_deeplearn_tune pid=471, ip=100.125.228.77) 0         Non-trainable params
(train_deeplearn_tune pid=471, ip=100.125.228.77) 12.1 M    Total params
(train_deeplearn_tune pid=471, ip=100.125.228.77) 48.551    Total estimated model params size (MB)
== Status ==
Current time: 2022-03-18 07:54:53 (running for 00:00:21.77)
Memory usage on this node: 5.3/124.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/58 CPUs, 1.0/2 GPUs, 0.0/148.4 GiB heap, 0.0/63.49 GiB objects
Result logdir: /home/ray/ray_results/tune_deeplearn
Number of trials: 1/1 (1 RUNNING)
+--------------------------------+----------+--------------------+--------------+-----------------+-------------+--------------+
| Trial name                     | status   | loc                |   batch_size |   initial_depth |          lr |   max_levels |
|--------------------------------+----------+--------------------+--------------+-----------------+-------------+--------------|
| train_deeplearn_tune_5105c_00000 | RUNNING  | 100.125.228.77:471 |           32 |              40 | 0.000353205 |            5 |
+--------------------------------+----------+--------------------+--------------+-----------------+-------------+--------------+


== Status ==
Current time: 2022-03-18 07:54:58 (running for 00:00:26.77)
Memory usage on this node: 5.4/124.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/58 CPUs, 1.0/2 GPUs, 0.0/148.4 GiB heap, 0.0/63.49 GiB objects
Result logdir: /home/ray/ray_results/tune_deeplearn
Number of trials: 1/1 (1 RUNNING)
+--------------------------------+----------+--------------------+--------------+-----------------+-------------+--------------+
| Trial name                     | status   | loc                |   batch_size |   initial_depth |          lr |   max_levels |
|--------------------------------+----------+--------------------+--------------+-----------------+-------------+--------------|
| train_deeplearn_tune_5105c_00000 | RUNNING  | 100.125.228.77:471 |           32 |              40 | 0.000353205 |            5 |
+--------------------------------+----------+--------------------+--------------+-----------------+-------------+--------------+


2022-03-18 07:55:00,904 ERROR trial_runner.py:920 -- Trial train_deeplearn_tune_5105c_00000: Error processing event.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 886, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 675, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1763, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train() (pid=471, ip=100.125.228.77, repr=train_deeplearn_tune)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trainable.py", line 319, in train
    result = self.step()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 381, in step
    self._report_thread_runner_error(block=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 532, in _report_thread_runner_error
    ("Trial raised an exception. Traceback:\n{}".format(err_tb_str)
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train() (pid=471, ip=100.125.228.77, repr=train_deeplearn_tune)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 262, in run
    self._entrypoint()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 331, in entrypoint
    self._status_reporter.get_checkpoint())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/function_runner.py", line 600, in _trainable_func
    output = fn()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/utils/trainable.py", line 371, in inner
    trainable(config, **fn_kwargs)
  File "deeplearn/scripts/ray/tune_deeplearn.py", line 63, in train_deeplearn_tune
    trainer.fit(deeplearn_model, datamodule=deeplearn_datamodule)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/mlflow/utils/autologging_utils/safety.py", line 532, in safe_patch_function
    patch_function(call_original, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/mlflow/utils/autologging_utils/safety.py", line 242, in patch_with_managed_run
    result = patch_function(original, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/mlflow/pytorch/_pytorch_autolog.py", line 293, in patched_fit
    result = original(self, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/mlflow/utils/autologging_utils/safety.py", line 513, in call_original
    return call_original_fn_with_event_logging(_original_fn, og_args, og_kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/mlflow/utils/autologging_utils/safety.py", line 456, in call_original_fn_with_event_logging
    original_fn_result = original_fn(*og_args, **og_kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/mlflow/utils/autologging_utils/safety.py", line 510, in _original_fn
    original_result = original(*_og_args, **_og_kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 922, in _run
    self._dispatch()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 990, in _dispatch
    self.accelerator.start_training(self)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1000, in run_stage
    return self._run_train()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1049, in _run_train
    self.fit_loop.run()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 118, in advance
    _, (batch, is_last) = next(dataloader_iter)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/profiler/base.py", line 104, in profile_iterable
    value = next(iterator)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 672, in prefetch_iterator
    for val in it:
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 589, in __next__
    return self.request_next_batch(self.loader_iters)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 617, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next_fn)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/utilities/apply_func.py", line 96, in apply_to_collection
    return function(data, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/supporters.py", line 604, in next_fn
    batch = next(iterator)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1183, in _next_data
    return self._process_data(data)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
TypeError: __init__() takes 1 positional argument but 2 were given

I think something is going wrong either at the end of loading the first batch of data or at the start of loading the second batch of data. I’m using a Pytorch Lightning LightningDataModule to load data. It streams data from S3 usingio.BytesIO.

I was able to dig in further by reducing the batch size, limiting the dataset size via limit_train_batches, using only one worker in the LightningDataModule data loader, and adding progress_bar_refresh_rate=1. I noticed that the error (TypeError: __init__() takes 1 positional argument but 2 were given) occurs when the validation loop starts in parallel with the training loop. Maybe I need to use some kind of FileLock? I’m confused as to why this doesn’t happen when I train using this same model and data on a dedicated EC2 instance.

Do you know for which function the error comes up (is there a stacktrace)? If you can share some code that would be helpful as well.
Also, which Ray version are you using?