Question on scalability

I’m looking at using ray as a possible solution to a massive HPC problem, and would welcome feedback and experiences.

My problem involves fragmenting molecules, generating parent child relationships. For each molecule I need to calculate its children (this is relatively fast) and store that information so that it can be output at the end of the process. Each child also needs to be processed, but it might have already been processed so we first need to look for that. The molecules can be easily split into a large number of shards. Conceptually ray looks very suited for this, maybe with one actor holding the child info for each molecule, and one actor for each shard holding the object references to the fragmented molecules.

I do have access to significant computing power (1000 cpu cores, several TB RAM distributed across multiple servers) but, I’m nervous about how this would perform at scale. I will end up with 10’s of billions of distinct molecules and 100’s of billions of total molecules (including duplicates).
Does anyone have experience operating at this sort of scale?

You can think of Ray’s scalability in terms of hardware and also complexity of the workload. For hardware, 1000 cores is well within supported scale: ray/README.md at master · ray-project/ray · GitHub

For the workload, it sounds like it depends on how you shard the molecules. Having 1 object per molecule will create billions of objects, which is probably too much. But Ray should work fine with low millions of objects given enough memory.