How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I am currently using xgboost_ray to train models. The memory usage on the dashboard seems very confusing to me. For example, the total memory used on the node is around 54.91GB. However, when you expand it, the breakdown memory usage is less than 3 GB. I logged on the server and check the memory usage of ray processes. 3GB seems more reasonable. So where does the 54.91GB come from?
I am currently using ray 2.9.3
It might be due to the object store memory. Has the object store memory been used?
Yes, the object store memory has been used while jobs were running. It was around 10GB per Node, which is still way below 54.91GB. Also. below is a snapshot with no job running. There is almost no object store memory being used, but memory still takes up 51.91GB. This is aligned with the previous snapshot that has ~3GB for running jobs and 54.91GB in total.
As the tooltip said, once object store is used, it’ll will hold the memory. Even though the current object store memory usage is almost 0, it is stilling holding the memory.
If you check the memory usage on this host, can you see the shared memory?
The snapshot shows the memory usage on the host. When you say the shared memory, do you mean all SHR of processes related to raylet and python3.9…?
At this moment, there is roughly 24 GB occupied showing on the dashboard.
A follow-up question is how do I free those memories since they are not used by any jobs?
cc @sangcho for how to free them up
If all jobs are done, why does Ray try to hold those object store? It seems not very memory efficient. Are there any configurations to avoid such a pattern?
If all jobs are done, why does Ray try to hold those object store? It seems not very memory efficient
Checked with other Ray engs. I think this is possible but it’s a very hard memory recycling problem especially when discontinuous memory space is still occupied by some objects.
Unfortunately, I don’t think this is something that can be fixed in the short term.
So what’s the recycling policy for ray? Would those object store be kept in memory forever?
The memory will be reserved for object store once used until the node is shut down I believe.