GCS Flushing Implementation in Early Ray Versions

Hi! I’m a student at Stanford. My teammates and I are currently working on a research project for our CS 244 class, where our assignment is to reimplement and recreate the result of a paper. We chose Ray because we really admire it, and are working on a miniaturized “Baby Ray” for us to try to recreate some of the figures in the paper!

We are particularly interested in the GCS flushing feature mentioned in the paper. Specifically, we would love to hear from the authors about how GCS flushing to disk was implemented in the release of Ray that pertains to this paper (2018) Section 5.1 GCS Flushing, which precedes the Ray v1 architecture (2020). Our main questions are:

  1. How was GCS flushing to disk implemented?: Did you use MySQL/SQLite? Did you use an early equivalent of object spilling whereby each individual object was stored as a file to disk? Or something else?
  2. How did the original implementation address cache misses when GCS flushing was implemented?

We noticed that the v1 white paper does not mention flushing, and we are curious about how GCS flushing was handled in the original Ray paper. Additionally, the v1 white paper discusses object spilling, which was introduced in v1.3, but we want to understand how these concepts were managed in the earlier implementations.

Any insights or advice on this matter would be greatly appreciated! Thank you!

Thanks for the question!

One thing to clarify first is that the GCS has only ever been used for object metadata not for data. In general, Ray avoids putting large data that is accessed frequently in the GCS. So the flushing referred to in the 2018 paper is only for object metadata such as its task lineage. Therefore, GCS flushing was never a high-priority feature since we assumed the metadata was relatively small.

Indeed, today’s GCS does not support flushing; metadata can be persisted via Redis but we do not swap. In today’s Ray, we also put even less metadata in the GCS. Per-object metadata like task lineage is stored directly in Python worker processes (see the more recent ownership paper for details).

  1. The early GCS was based on Redis with some custom extensions. I don’t remember the exact details here but I believe we just used Redis snapshotting.
  2. Unfortunately, I doubt that the original implementation addressed this correctly. :melting_face:

If you want to implement something similar for your project, I’d suggest implementing object (data) spilling, which has been a much more impactful feature in Ray. This paper is a good reference.

Thank you very much! This is very helpful!