Autoscaling issue on Google cloud

Param_Jeet · July 5, 2021, 5:25am

I am trying to execute the train.py from this github link.

I have to train a custom model on much smaller dataset so I changed tpu_size = 8 in [configs/6B_roto_256.json]
and running the following command in command prompt
python3.8 train.py --tpu gptj6b-tpu1 --tpu_region us-central1-c --config configs/6B_roto_256.json --preemptible --new

When I execute this and gets the following error:
{‘name’: ‘projects/hellogenius-poc/locations/us-central1-c/nodes/gptj6b-tpu1’, ‘acceleratorType’: ‘v2-8’, ‘state’:
‘READY’, ‘runtimeVersion’: ‘v2-alpha’, ‘cidrBlock’: ‘10.128.0.0/20’, ‘createTime’: ‘2021-07-04T04:54:27.658431621Z’
, ‘schedulingConfig’: {‘preemptible’: True}, ‘networkEndpoints’: [{‘ipAddress’: ‘10.128.0.28’, ‘port’: 8470, ‘acces
sConfig’: {‘externalIp’: ‘34.135.201.151’}}], ‘health’: ‘HEALTHY’, ‘id’: ‘4392319268867105180’, ‘networkConfig’: {’
network’: ‘projects/hellogenius-poc/global/networks/default’, ‘subnetwork’: ‘projects/hellogenius-poc/regions/us-ce
ntral1/subnetworks/default’, ‘enableExternalIps’: True}, ‘serviceAccount’: {‘email’: ‘889652764764-compute@develope
r.gserviceaccount.com’, ‘scope’: []}, ‘apiVersion’: ‘V2_ALPHA1’}
2021-07-04 07:28:44,371 WARNING worker.py:1107 – The actor or task with ID ffffffffffffffff13878e716d2e442eb9333d0
501000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {tpu: 1.000000} for placement, however the cl
uster currently cannot provide the requested resources. The required resources may be added as autoscaling takes pl
ace or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.
(pid=14384, ip=10.128.0.28) 2021-07-04 07:28:44.753523: F external/org_tensorflow/tensorflow/core/tpu/tpu_executor_
init_fns.inc:110] TpuTransferManager_ReadDynamicShapes not available in this library.
Traceback (most recent call last):
File “train.py”, line 75, in
t = build_model(params, tpu_name, region, preemptible, version=args.version)
File “/home/param_jeet/content-intelligence/Mesh-Transformer/mesh_transformer/build_model.py”, line 58, in build_
model
t = TPUCluster((tpu_size // cores_per_replica, cores_per_replica), len(conns), model_fn)
File “/home/param_jeet/.local/lib/python3.8/site-packages/func_timeout/dafunc.py”, line 185, in
return wraps(func)(lambda *args, **kwargs : func_timeout(defaultTimeout, func, args=args, kwargs=kwargs))
File “/home/param_jeet/.local/lib/python3.8/site-packages/func_timeout/dafunc.py”, line 108, in func_timeout
raise_exception(exception)
File “/home/param_jeet/.local/lib/python3.8/site-packages/func_timeout/py3_raise.py”, line 7, in raise_exception
2021-07-04 07:28:44,930 WARNING worker.py:1107 – A worker died or was killed while executing task ffffffffffffffff
13878e716d2e442eb9333d0501000000.
raise exception[0] from None
File “/home/param_jeet/content-intelligence/Mesh-Transformer/mesh_transformer/TPU_cluster.py”, line 39, in __init
__
self.param_count = ray.get(params)[0]
File “/home/param_jeet/.local/lib/python3.8/site-packages/ray/_private/client_mode_hook.py”, line 47, in wrapper
return func(args, **kwargs)
File “/home/param_jeet/.local/lib/python3.8/site-packages/ray/worker.py”, line 1458, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. Check python-core-worker-.lo
g files for more information.

Topic		Replies	Views
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	729	November 20, 2023
Autoscaler scales cluster up and down all the time RLlib	6	458	May 12, 2021
Tune not autoscaling on Kubernetes Ray Tune	2	454	February 22, 2021
Cuda Error: invalid device ordinal during training on GCP cluster	0	199	September 11, 2024
GCP Cluster Worker Nodes fail to Initialize Ray Clusters	5	509	October 10, 2024

Autoscaling issue on Google cloud

Related topics