How does ray/anyscale handle indexing for search?

Hi. I am new to the product and only watched a few demo videos. None of them mention many details about indexing part. Say I’ve trained a ML model and have done the batch processing with 1 million images. Now in order to search among this 1 million images, I need to index them right? Which part does Ray/Anyscale do this? Any specific documentation or examples? (I only see a finetuned_bert_faiss_index library used in the demo)

Thanks!

In the demo, we just use the off the shelf library to handle the indexing and searching. Both Ray and Anyscale are general purpose and specially Ray Serve can run arbitrary Python code, so we just use libraries for that. The code for indexing and searching is here.

class CoverColorIndex:
    def __init__(self, color_cursor):
        self.index = faiss.IndexIDMap(faiss.IndexFlatL2(3 * 6))

        # Query all the cover image palette
        self.id_to_arr = {
            row[0]: np.array(json.loads(row[1])).flatten()
            for row in color_cursor
        }

        # Build the index
        arr = np.stack(list(self.id_to_arr.values())).astype('float32')
        ids = np.array(list(self.id_to_arr.keys())).astype('int')
        self.index.add_with_ids(arr, ids)

    def search(self, request):
        liked_id = request.args["liked_id"]
        num_returns = int(request.args.get("count", 20))
        movies_shown = set(request.args.get("movies_shown", [liked_id]))

        # Perform nearest neighbor search
        source_color = self.id_to_arr[liked_id]
        source_color = np.expand_dims(source_color, 0).astype('float32')
        scores, ids = self.index.search(source_color,
                                        num_returns + len(movies_shown))
        neighbors = ids.flatten().tolist()[1:]

        return [str(n) for n in neighbors if str(n) not in movies_shown]

@simon-mo thanks!
A dumb question for follow up, does that mean the faiss library indexing will have access to all the machines’ memories in the clusters as well (if I use Anyscale)?

also anywhere I can access the code for the examples in the demo?

In Ray, you use call ray.put some object into its shared memory store and it will be available across all machines via ray.get. This is a Ray feature not Anyscale.

This repo https://github.com/anyscale/ray-summit-demo-2020 is the only one we are open sourcing right now :smiley: