RLlib in conjuncton with GPU env

Blubberblub · March 30, 2022, 10:34am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Setting: I have created a hierarchical multi-agent environment based on the python MultiAgentEnv class. The environment uses GPU resources for calling complex calculations embedded in a c++ library.
What is working: When i run the env on its own with simulated action dicts it runs fine.
Problem: When i want to train with tune the c++ library doesn’t detect my CUDA device anymore.
Question: How does ray RLLib/Tune affect the GPU availability when a cluster is started? Would it solve the problem to use an ExternalEnv?

Aidan_McLaughlin · March 28, 2023, 7:33pm

Having a similar issue; I opened a question too. It seems that RLLIB sets CUDA_VISIBLE_DEVICES = 0 when the env is initialized. This makes sense, as the requested resources is used.

However, (it sounds like this is your case, too), there seems to be a problems when specifying fractional resources, which should be the natural solution. Curious to hear if you’ve resolved this!

Blubberblub · March 29, 2023, 6:20am

@Aidan_McLaughlin I’m not completely sure what solved the issue in the end but with the newer ray versions i din’t have any issue regarding this. We had some raytracer built with NVIDIAoptix in our env. However we moved away to a different solution but our env still uses GPU. We are currently on 3.0.0.dev0 and everything works fine. What version are you using?

Topic		Replies	Views
RLlib slows down when gpu available but not used RLlib	0	357	April 7, 2021
Ray tune with environment using GPU RLlib	2	855	February 8, 2021
No ressources are requested ray v2.0.0.dev0 RLlib	0	244	July 2, 2021
Error when trying to use gpus during RL training RLlib	4	655	July 21, 2021
Example of A3C only use CPU for trainer RLlib	10	855	July 23, 2021

RLlib in conjuncton with GPU env

Related topics