1. Severity of the issue: (select one): MEDIUM
2. Environment:
- Ray version: 2.10
- Python version: 3.11
- OS: Ubuntu 18
- Cloud/Infrastructure: Native K8s on AWS EC2
3. What happened vs. what you expected:
- Expected:
- When running in THP madvise mode, Ray should be able to use hugepages for its large memory regions (plasma object store, Arrow/mmapped buffers) and see the same vCPU and page fault benefits as seen in always mode.
- This would allow us to run Ray jobs safely in multi-tenant environments, using system default (madvise) with no performance penalty.
- Actual:
- With THP in madvise mode, less than 1% of the Plasma Object storage, , Arrow/mmapped buffers was backed by Huge pages.
4. My ask
Please add MADV_HUGEPAGE (madvise(..., MADV_HUGEPAGE...)
) to the plasma object store and large mmap regions in Ray/Arrow.
- This change would enable Ray memory regions to be hugepage-eligible under THP madvise, letting us use production-safe THP settings and still get massive system CPU and performance wins.
Supporting results:
-
With THP=always: Object store system CPU dropped ~50% (eg. 8 vCPUs to 4 vCPUs), page faults fell from ~2M/sec to 100K/sec, and
AnonHugePages: 150,000,000 kB
(over 150GB) for plasma . -
With THP=madvise: System CPU dropped only 5% (P95: 65% to 60%), faults only dropped to 1.5M/sec, and
AnonHugePages: 18,432 kB
(18 MB)—almost no plasma coverage. -
Root cause: Ray and PyArrow do not call MADV_HUGEPAGE. Only heap allocations (not plasma/Arrow mmap) see / benefit from hugepages in madvise mode.
This fix would:
-
Let us run Ray (even for 100GB–200GB object store jobs) in production using madvise, with large CPU and cost savings, and without risky global THP toggling or complex node isolation.
-
Align Ray behavior with other data frameworks and emerging best practices for memory-mapped workloads in Linux.