ImportError: cannot import name 'ParamSpec' from 'typing_extensions' when creating a cluster on Azure

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello.
I’m having a bit of trouble getting the cluster to work on Azure. The cluster is created successfully and I can attach to it and see the dashboard without any problems, however when I try to check the status or submit some code I always get the following error:

(base) ray@ray-default-head-bd9890040:~$ ray status
No cluster status.
The autoscaler failed with the following error:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 483, in run
    self._initialize_autoscaler()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 236, in _initialize_autoscaler
    prom_metrics=self.prom_metrics,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 212, in __init__
    self.reset(errors_fatal=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 973, in reset
    raise e
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 918, in reset
    self.config["provider"], self.config["cluster_name"]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/providers.py", line 241, in _get_node_provider
    provider_cls = _get_node_provider_cls(provider_config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/providers.py", line 217, in _get_node_provider_cls
    return importer(provider_config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/providers.py", line 40, in _import_azure
    from ray.autoscaler._private._azure.node_provider import AzureNodeProvider
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 9, in <module>
    from azure.mgmt.network import NetworkManagementClient
  File "/home/ray/anaconda3/lib/python3.7/site-packages/azure/mgmt/network/__init__.py", line 9, in <module>
    from ._network_management_client import NetworkManagementClient
  File "/home/ray/anaconda3/lib/python3.7/site-packages/azure/mgmt/network/_network_management_client.py", line 20, in <module>
    from ._operations_mixin import NetworkManagementClientOperationsMixin
  File "/home/ray/anaconda3/lib/python3.7/site-packages/azure/mgmt/network/_operations_mixin.py", line 19, in <module>
    from azure.core.polling import LROPoller, NoPolling, PollingMethod
  File "/home/ray/anaconda3/lib/python3.7/site-packages/azure/core/polling/__init__.py", line 28, in <module>
    from ._poller import LROPoller, NoPolling, PollingMethod
  File "/home/ray/anaconda3/lib/python3.7/site-packages/azure/core/polling/_poller.py", line 37, in <module>
    from azure.core.tracing.decorator import distributed_trace
  File "/home/ray/anaconda3/lib/python3.7/site-packages/azure/core/tracing/decorator.py", line 31, in <module>
    from typing_extensions import ParamSpec
ImportError: cannot import name 'ParamSpec' from 'typing_extensions' (/home/ray/anaconda3/lib/python3.7/site-packages/typing_extensions.py)

And

(base) ray@ray-default-head-bd9890040:~$ python -c 'import ray; ray.init(address="auto")'
2022-07-18 05:39:39,390 WARNING worker.py:1382 -- The autoscaler failed with the following error:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 483, in run
    self._initialize_autoscaler()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/monitor.py", line 236, in _initialize_autoscaler
    prom_metrics=self.prom_metrics,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 212, in __init__
    self.reset(errors_fatal=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 973, in reset
    raise e
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 918, in reset
    self.config["provider"], self.config["cluster_name"]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/providers.py", line 241, in _get_node_provider
    provider_cls = _get_node_provider_cls(provider_config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/providers.py", line 217, in _get_node_provider_cls
    return importer(provider_config)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/providers.py", line 40, in _import_azure
    from ray.autoscaler._private._azure.node_provider import AzureNodeProvider
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/_azure/node_provider.py", line 9, in <module>
    from azure.mgmt.network import NetworkManagementClient
  File "/home/ray/anaconda3/lib/python3.7/site-packages/azure/mgmt/network/__init__.py", line 9, in <module>
    from ._network_management_client import NetworkManagementClient
  File "/home/ray/anaconda3/lib/python3.7/site-packages/azure/mgmt/network/_network_management_client.py", line 20, in <module>
    from ._operations_mixin import NetworkManagementClientOperationsMixin
  File "/home/ray/anaconda3/lib/python3.7/site-packages/azure/mgmt/network/_operations_mixin.py", line 19, in <module>
    from azure.core.polling import LROPoller, NoPolling, PollingMethod
  File "/home/ray/anaconda3/lib/python3.7/site-packages/azure/core/polling/__init__.py", line 28, in <module>
    from ._poller import LROPoller, NoPolling, PollingMethod
  File "/home/ray/anaconda3/lib/python3.7/site-packages/azure/core/polling/_poller.py", line 37, in <module>
    from azure.core.tracing.decorator import distributed_trace
  File "/home/ray/anaconda3/lib/python3.7/site-packages/azure/core/tracing/decorator.py", line 31, in <module>
    from typing_extensions import ParamSpec
ImportError: cannot import name 'ParamSpec' from 'typing_extensions' (/home/ray/anaconda3/lib/python3.7/site-packages/typing_extensions.py)

This is my config file:

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty object means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"
    # Allow Ray to automatically detect GPUs

    # worker_image: "rayproject/ray-ml:latest-cpu"
    # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: azure
    # https://azure.microsoft.com/en-us/global-infrastructure/locations
    location: eastus2
    resource_group: RayCluster03
    # set subscription id otherwise the default from az cli will be used
    subscription_id: 00000000-0000-0000-0000-000000000000

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    # you must specify paths to matching private and public key pair files
    # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
    ssh_private_key: ~/.ssh/id_rsa
    # changes to this should match what is specified in file_mounts
    ssh_public_key: ~/.ssh/id_rsa.pub

# More specific customization to node configurations can be made using the ARM template azure-vm-template.json file
# See documentation here: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2019-03-01/virtualmachines
# Changes to the local file will be used during deployment of the head node, however worker nodes deployment occurs
# on the head node, so changes to the template must be included in the wheel file used in setup_commands section below

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The resources provided by this node type.
        resources: {"CPU": 2}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                priority: Low

    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 2
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 4
        # The resources provided by this node type.
        resources: {"CPU": 2}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D2s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Low
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
     "~/.ssh/id_rsa.pub": "~/.ssh/id_rsa.pub"
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    # enable docker setup
    - sudo usermod -aG docker $USER || true
    - sleep 10  # delay to avoid docker permission denied errors
    # get rid of annoying Ubuntu message
    - touch ~/.sudo_as_admin_successful

# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
# NOTE: rayproject/ray-ml:latest has azure packages bundled
head_setup_commands: []
    # - pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==14.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==10.2.0 azure-mgmt-resource==13.0.0

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}

Any ideas on how I can fix this?

Edit: I’m using ray version 1.13.0

Hi @dani. This looks like a version incompatibility with the typing_extensions dependency. I see a similar issue with the Azure SDK and system-installed typing_extensions here.

Could you run pip show typing-extensions and share the output?

cc @gramhagen who knows more about the Azure stack

was this installed from pypi or from local source?

Hi @cade,
This is the output from pip show typing-extensions.

(base) ray@ray-default-head-bd9890040:~$ pip show typing-extensions
Name: typing-extensions
Version: 3.7.4.3
Summary: Backported and Experimental Type Hints for Python 3.5+
Home-page: https://github.com/python/typing/blob/master/typing_extensions/README.rst
Author: Guido van Rossum, Jukka Lehtosalo, Lukasz Langa, Michael Lee
Author-email: levkivskyi@gmail.com
License: PSF
Location: /home/ray/anaconda3/lib/python3.7/site-packages
Requires: 
Required-by: aiohttp, aioitertools, anyio, argon2-cffi, asgiref, async-timeout, azure-core, bokeh, cmd2, GitPython, h11, huggingface-hub, importlib-metadata, jax, jsonschema, kiwisolver, kopf, nevergrad, onnx, pydantic, pytorch-lightning, starlette, tensorflow, torch, torchmetrics, uvicorn, yarl

Hi @gramhagen,
I used pypi to install the Ray package.

More specifically, I know Dani used:

pip install ray==1.13.0 
pip install ray[tune]==1.13.0
pip install ray[rllib]==1.13.0

to install Ray.

hmm, it seems like something could be out of sync with the conda env in the rayproject/ray-ml:latest-gpu image

if I use these values in my yaml file I do not see the errors. does this work for you? I’m not sure if reinstalling the azure packages makes the difference or if it’s ray or has to be both.

# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands:
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
# NOTE: rayproject/ray-ml:latest has azure packages bundled
head_setup_commands: 
    - pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==14.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==10.2.0 azure-mgmt-resource==13.0.0

@gramhagen what could cause the conda env in the rayprojcet/ray-ml:latest-gpu image to be out of sync?

cc @amogkam

Yes the error is gone after adding these lines to the yaml file. Thank you!

it seems like the version of azure packages in the docker image are more recent and could be leading to this issue with the type_extension package.