This is a bug report (I think?)
This line: data/ at 4ea88d1fb4d279def9213a23b054b4e7d46d5b3d · pytorch/data · GitHub times out when using the TorchTrainer
This happens when using a data loader made out of torch data datapipes, with the fullsync
data pipe at the end.
Hey @Vedant_Roy , thanks for reporting this issue.
Could you provide us with a reproducible example so that we can debug this?
Sure thing, will do when I get a moment.
Edit : The bug happens intermittently, e.g, it won’t happen on the 1st run and then it will happen on the second run.
Here’s a reproduction (@bveeramani ) :
import ray
import ray.train.torch as ray_torch
import torchdata.datapipes.iter as pipes
from ray.air import ScalingConfig
from import DataLoader
def loader():
pipe = pipes.IterableWrapper(list(range(2000)))
pipe = pipe.batch(5)
pipe = pipe.fullsync()
return DataLoader(pipe, batch_size=None, num_workers=5)
def train_loop():
dl = loader()
x = next(iter(dl))
trainer = ray_torch.TorchTrainer(
scaling_config=ScalingConfig(num_workers=2, use_gpu=False),
Dockerfile: rayproject/ray:6f5f1e-py38-cu116
Note, the bug is flaky, so I would recommend starting a local cluster & running this script. If it succeeds, then run it again, and it should fail the second time.
Package Version Editable project location
-------------------------------------- ------------------------ -------------------------
adal 1.2.7
aiofiles 22.1.0
aiohttp 3.8.3
aiohttp-cors 0.7.0
aiorwlock 1.3.0
aiosignal 1.2.0
anyio 3.6.2
applicationinsights 0.11.10
argcomplete 1.12.3
asttokens 2.1.0
async-timeout 4.0.2
attrs 22.1.0
av 9.2.0
azure-cli-core 2.40.0
azure-cli-telemetry 1.0.8
azure-common 1.1.28
azure-core 1.26.0
azure-identity 1.10.0
azure-mgmt-compute 23.1.0
azure-mgmt-core 1.3.2
azure-mgmt-network 19.0.0
azure-mgmt-resource 20.0.0
backcall 0.2.0
backoff 1.10.0
bcrypt 4.0.1
black 22.10.0
blessed 1.19.1
boto3 1.26.8
boto3-stubs 1.25.5
botocore 1.29.8
botocore-stubs 1.28.5
brotlipy 0.7.0
cachetools 5.2.0
certifi 2022.9.24
cffi 1.15.1
charset-normalizer 2.0.4
click 8.0.4
cloudpickle 2.2.0
colorful 0.5.4
commonmark 0.9.1
conda 22.9.0
conda-package-handling 1.9.0
contourpy 1.0.5
cryptography 38.0.1
cursor 1.3.5
cycler 0.11.0
Cython 0.29.26
debugpy 1.5.1
decorator 5.1.1
dill 0.3.6
distlib 0.3.6
distributed-ml 0.0.0 /app
dm-tree 0.1.7
docker-pycreds 0.4.0
docutils 0.19
einops 0.6.0
entrypoints 0.4
executing 1.2.0
fastapi 0.85.1
filelock 3.8.0
flash-attn 0.1
flatbuffers 22.9.24
fonttools 4.37.4
frozenlist 1.3.1
fsspec 2022.10.0
gitdb 4.0.9
GitPython 3.1.29
google-api-core 2.10.2
google-api-python-client 1.7.8
google-auth 2.13.0
google-auth-httplib2 0.1.0
google-oauth 1.0.1
googleapis-common-protos 1.56.4
gpustat 1.0.0
grpcio 1.50.0
gym 0.23.1
gym-notices 0.0.8
h11 0.14.0
halo 0.0.29
httplib2 0.20.4
humanfriendly 10.0
humanize 4.4.0
idna 3.4
imageio 2.22.2
importlib-metadata 5.0.0
importlib-resources 5.10.0
ipykernel 6.15.2
ipython 8.6.0
iso8601 1.1.0
isodate 0.6.1
jedi 0.18.1
jmespath 0.10.0
jsonschema 4.16.0
jupyter_client 7.3.5
jupyter_core 4.11.2
kiwisolver 1.4.4
knack 0.10.0
kopf 1.35.6
kubernetes 24.2.0
libcst 0.4.9
log-symbols 0.0.14
logfmt 0.4
lz4 4.0.2
matplotlib 3.6.1
matplotlib-inline 0.1.6
moreorless 0.4.0
mpmath 1.2.1
msal 1.18.0b1
msal-extensions 1.0.0
msgpack 1.0.4
msrest 0.7.1
msrestazure 0.6.4
multidict 6.0.2
mypy-boto3-cloudformation 1.25.4
mypy-boto3-dynamodb 1.25.0
mypy-boto3-ec2 1.25.5
mypy-boto3-lambda 1.25.0
mypy-boto3-rds 1.25.1
mypy-boto3-s3 1.25.0
mypy-boto3-sqs 1.25.0
mypy-extensions 0.4.3
nest-asyncio 1.5.6
networkx 2.8.7
numpy 1.23.4
nvidia-ml-py 11.495.46
oauthlib 3.2.2
opencensus 0.11.0
opencensus-context 0.1.3
opentelemetry-api 1.1.0
opentelemetry-exporter-otlp 1.1.0
opentelemetry-exporter-otlp-proto-grpc 1.1.0
opentelemetry-proto 1.1.0
opentelemetry-sdk 1.1.0
opentelemetry-semantic-conventions 0.20b0
packaging 21.3
pandas 1.5.1
paramiko 2.11.0
parquet-tools 0.2.11
parso 0.8.3
pathspec 0.10.2
pathtools 0.1.2
pexpect 4.8.0
pickleshare 0.7.5
pillow 9.0.0
Pillow-SIMD 9.0.0.post1
pip 22.2.2
pkginfo 1.8.3
pkgutil_resolve_name 1.3.10
platformdirs 2.5.2
portalocker 2.6.0
prometheus-client 0.13.1
promise 2.3
prompt-toolkit 3.0.32
protobuf 3.19.6
psutil 5.9.3
psycopg2-binary 2.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
py-spy 0.3.14
pyarrow 6.0.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycosat 0.6.4
pycparser 2.21
pydantic 1.10.2
pydash 5.1.1
Pygments 2.13.0
PyJWT 2.6.0
PyNaCl 1.5.0
pyOpenSSL 22.0.0
pyparsing 3.0.9
pyrsistent 0.18.1
PySocks 1.7.1
python-dateutil 2.8.2
python-dotenv 0.21.0
python-json-logger 2.0.4
pytz 2022.5
PyWavelets 1.4.1
PyYAML 6.0
pyzmq 23.2.0
ray 3.0.0.dev0
redis 3.5.3
requests 2.28.1
requests-oauthlib 1.3.1
rich 12.6.0
rsa 4.9
ruamel-yaml-conda 0.15.100
s3transfer 0.6.0
scikit-image 0.19.3
scipy 1.9.3
sentry-sdk 1.10.1
setproctitle 1.3.2
setuptools 65.5.0
shortuuid 1.0.11
six 1.16.0
smart-open 6.2.0
smmap 5.0.0
sniffio 1.3.0
spinners 0.0.24
stack-data 0.6.1
starlette 0.20.4
stdlibs 2022.10.9
structlog 22.1.0
sympy 1.11.1
tabulate 0.8.10
tensorboardX 2.5.1
termcolor 2.1.0
thrift 0.13.0
tifffile 2022.10.10
tokenize-rt 5.0.0
toml 0.10.2
tomli 2.0.1
toolz 0.12.0
torch 1.14.0.dev20221027+cu116
torchdata 0.6.0.dev20221027
torchsnapshot-nightly 2022.10.29
torchvision 0.15.0a0+edb3a80
tornado 6.2
tqdm 4.64.1
trailrunner 1.2.1
traitlets 5.5.0
typer 0.6.1
types-awscrt 0.15.3
types-s3transfer 0.6.0.post4
typing_extensions 4.4.0
typing-inspect 0.8.0
uritemplate 3.0.1
urllib3 1.26.12
usort 1.0.5
uvicorn 0.19.0
virtualenv 20.16.5
wandb 0.13.4
wcwidth 0.2.5
websocket-client 1.4.1
wheel 0.37.1
yarl 1.8.1
zipp 3.9.0
December 20, 2022, 4:32am
Just closing the loop here, this seems to be an issue with Torch DataPipe.
See the full thread here: Allow passing in init_process_group kwargs to fullsync datapipe · Issue #868 · pytorch/data · GitHub