Ray submission server + tune

Hi,
is it possible to send multiple tune jobs to the new ray submission server? If so, is it possible to add a description to the documentation?

related to #18851 and #21329
Best,
Thorsten

cc @matthewdeng can you address the question?

Looks like this was discussed on Github(#18851, #21329) and Slack.

Summarizing:

  1. Tune jobs running on the same cluster may lead to undesirable behavior due to placement groups non-isolation. The TUNE_PLACEMENT_GROUP_CLEANUP_DISABLED flag is intended to address this but is not well tested.
  2. Job submission does not provide any particular resource isolation. Jobs submitted to the same Ray cluster will be run in separate subprocesses, but contend for the same cluster resources.
  3. One way to run parallel Tune jobs with placement group isolation is to spin up an individual Ray cluster per Tune job. The jobs can optionally be run using job submission.
  4. Another point to be aware of is that if multiple jobs are run with the same local checkpoint directory (I believe this would also apply to cloud checkpointing), then the data from each job may overwrite one another and lead to undesirable consequences.

@kai please let me know if I’ve missed anything!