Transform_pyarrow.concat(tables) very slow

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

My scenario is reading a large tfrecords dataset (about 17T) and I found that the logs got stuck at this step very long time

(_execute_read_task_split pid=47722) /opt/conda/envs/abc/lib/python3.10/site-packages/ray/data/_internal/arrow_block.py:150: FutureWarning: promote has been superseded by mode='default'.
(_execute_read_task_split pid=47722)   return transform_pyarrow.concat(tables)

My questions are:

  • What does the transform_pyarrow.concat(tables) do in this step?
  • How can I speed up this part?