Request for Ray to support hugepages in its memory regions (e.g. plasma object store) by using MADV_HUGEPAGE

1. Severity of the issue: (select one): MEDIUM

2. Environment:

  • Ray version: 2.10
  • Python version: 3.11
  • OS: Ubuntu 18
  • Cloud/Infrastructure: Native K8s on AWS EC2

3. What happened vs. what you expected:

  • Expected:
    • When running in THP madvise mode, Ray should be able to use hugepages for its large memory regions (plasma object store, Arrow/mmapped buffers) and see the same vCPU and page fault benefits as seen in always mode.
    • This would allow us to run Ray jobs safely in multi-tenant environments, using system default (madvise) with no performance penalty.
  • Actual:
    • With THP in madvise mode, less than 1% of the Plasma Object storage, , Arrow/mmapped buffers was backed by Huge pages.

4. My ask

Please add MADV_HUGEPAGE (madvise(..., MADV_HUGEPAGE...)) to the plasma object store and large mmap regions in Ray/Arrow.

  • This change would enable Ray memory regions to be hugepage-eligible under THP madvise, letting us use production-safe THP settings and still get massive system CPU and performance wins.

Supporting results:

  • With THP=always: Object store system CPU dropped ~50% (eg. 8 vCPUs to 4 vCPUs), page faults fell from ~2M/sec to 100K/sec, and AnonHugePages: 150,000,000 kB (over 150GB) for plasma .

  • With THP=madvise: System CPU dropped only 5% (P95: 65% to 60%), faults only dropped to 1.5M/sec, and AnonHugePages: 18,432 kB (18 MB)—almost no plasma coverage.

  • Root cause: Ray and PyArrow do not call MADV_HUGEPAGE. Only heap allocations (not plasma/Arrow mmap) see / benefit from hugepages in madvise mode.

This fix would:

  • Let us run Ray (even for 100GB–200GB object store jobs) in production using madvise, with large CPU and cost savings, and without risky global THP toggling or complex node isolation.

  • Align Ray behavior with other data frameworks and emerging best practices for memory-mapped workloads in Linux.