Getting out of memory issue

We want to create embeddings for each pdf seperately.

pdf_documents( this have multiple pdf files )

for document in pdf_documents:
    RAY_SCHEDULER_EVENTS=0
    RAY_memory_monitor_refresh_ms=0
    RAY_memory_usage_threshold=0.7
    file_name=document.split('/')[-1]

    loader = PyPDFLoader(document)
    data = loader.load()
    futures = process_shard.remote(data)
    results = ray.get(futures)
    results.save_local(FAISS_INDEX_PATH/file_name.replace('.pdf',''))

So we tried to run a loop and call ray.get(futures) but it gives out of memory issue.
I checked and got to know we cant run ray.get() inside the loop as it loses the purpose of parallel processing. Is there any way to acheive this?

@Smitraj_Raut One of the anti-patterns for Ray is using ray.get(object) inside a for loop. You may
want to fetch all the results after they have materiailzed. That is use ray.get() outside the for loop.

  1. Anti-pattern: Calling ray.get unnecessarily harms performance — Ray 2.6.1
  2. Anti-pattern: Calling ray.get in a loop harms parallelism — Ray 2.6.1
  3. Use ray.wait and process only those finished: Pattern: Using ray.wait to limit the number of pending tasks — Ray 2.6.1
  4. Fetcing too many objects: Anti-pattern: Fetching too many objects at once with ray.get causes failure — Ray 2.6.1

Hope this helps

Thanks for your reply Jules we have already gone through the links that you shared but let me clarify my doubt again:

for document in pdf_documents:
    RAY_SCHEDULER_EVENTS=0
    RAY_memory_monitor_refresh_ms=0
    RAY_memory_usage_threshold=0.7
    file_name=document.split('/')[-1]

    loader = PyPDFLoader(document)
    data = loader.load()
    futures = process_shard.remote(data)
    results = ray.get(futures)
    results.save_local(FAISS_INDEX_PATH/file_name.replace('.pdf',''))

In the above code we are trying to create muliple db files inside FAISS folder 1 for each pdf document. Eg: FAISS/abc/abc.pkl , FAISS/123/123.pkl and so on.

in this case we tried using ray.wait and then passing the list elements into ray.get()

but still it gives out of memory error.

Is there a way I run ray embedding for multiple files one by one and create respective db folder consisting of vector i.e (.pkl)files for each file as mentioned above.

@Smitraj_Raut Perhaps an example code that shows how we did something akin to what you trying to do with LlamaIndex and how we use Ray to created embeddings in a distributed manner. Let me know if that helps

cc: @amogkam