I am loading data from a gcs bucket path with large number of parquet files. I can see thousands of tasks called “_map_task” in PENDING_NODE_ASSIGNMENT mode, no task in Running state, but most of the workers’ CPU cores are more than 80% busy.
My code is like the following:
import gcsfs
data = ray.data.read_parquet("path", filesystem=gcsfs.GCSFileSystem()) # there are thousands of parquet files in the path
data = data.select_columns(cols=[...])
data.take(10)
How do I find out what the cluster is doing?