Horovod install/build failing with latest ray nightly

Trying to update my latest ray images, and I’m getting failures on trying to install horovod. The root error appears to be:

cmake: symbol lookup error: cmake: undefined symbol: archive_write_add_filter_zstd

unless something with my base image changed (which seems unlikely, but maybe not impossible), the only other thing that changed would have been the ray nightly. All my python lib versions are pinned.

Can you tell me more about this issue? We recently updated our docker images for ray-ml to include the CUDA dev toolkit.

well, effectively what I’m doing is:
pipenv install -r requirements.txt

where requirements.txt is:

ray[default] @ https://ray-wheels.s3-us-west-2.amazonaws.com/master/dde7cbd2885765a9fc61b05daca4f6f9973aed10/ray-2.0.0.dev0-cp36-cp36m-manylinux2014_x86_64.whl
cython==0.29.0
flatbuffers==1.12
dataclasses; python_version < '3.7'
tensorflow==2.4.1
#horovod[ray]==0.21.3
keras==2.4.3
scikit-learn==0.24.1
dask[arrays,dataframes]==2021.2.0
pandas==1.1.5
scipy==1.5.4

Technically, there is a first image build that installed the ray nightly, cython, flatbuffers, and dataclasses, and then there is a subsequent build that installs the remainder (ml stuff) on top.

Here’s my current state. I build using the build.sh script at the top. One should be able to replace podman build ... directly with docker build ...

BTW, maybe this is related to the most recent horovod release, which came out a couple days ago?

It’s unlikely that Ray will have any implications on the horovod build…

Looks like CMake was built to use a certain library that is missing at runtime. Can you run ldd /usr/bin/cmake and share the result?

By the way, might be worth checking out the Horovod on Ray Docker images we’re publishing as a point of comparison:

The fact that these are building correctly suggests it’s not an issue specific to Horovod or Ray or how they interact, but rather an environment issue.

1 Like

my local repro test is polluted, I will have to retest this later

@tgaddair you were right about cmake, so it WAS a base image problem. I revved to UBI-8.4 and it builds fine. Thanks!

1 Like