A little bit of broad context, we store all of our data in parquet and users want to be able to access it (several million rows returned in under 10s). Not really a big data problem. We’re experimenting with Ray to serve as our data access layer that reads in & transforms partitions and eventually returns back to user.
We’ve set up an ECS cluster that is running ray (default Ray cluster won’t quite serve oru needs for a few reasons). We’ve seen good success using Ray so far, however, we want to test what this will look like under more realistic circumstances (100s of users requesting data simultaneously). So we’ve created a job that opens up connections via client and executes jobs.
Slightly unrelated but the place where we’re seeing bottlenecks is at the ray head now. It seems slow to return results when working at scale. We’re quite sure that this isn’t a matter of resource constraints, are you aware of any parameters/configurations here that might be worth exploring?