I think the issue ended up being race conditions due to remote calls in a loop. The fix was to use ray.get as a synchronization barrier.
I have a feeling there might be something more fundamental going on for the core team to investigate but I cannot point to something more specific unfortunately.