Getting out of memory issue

Smitraj_Raut · August 23, 2023, 9:55am

We want to create embeddings for each pdf seperately.

pdf_documents( this have multiple pdf files )

for document in pdf_documents:
    RAY_SCHEDULER_EVENTS=0
    RAY_memory_monitor_refresh_ms=0
    RAY_memory_usage_threshold=0.7
    file_name=document.split('/')[-1]

    loader = PyPDFLoader(document)
    data = loader.load()
    futures = process_shard.remote(data)
    results = ray.get(futures)
    results.save_local(FAISS_INDEX_PATH/file_name.replace('.pdf',''))

So we tried to run a loop and call ray.get(futures) but it gives out of memory issue.
I checked and got to know we cant run ray.get() inside the loop as it loses the purpose of parallel processing. Is there any way to acheive this?

Jules_Damji · August 23, 2023, 3:50pm

@Smitraj_Raut One of the anti-patterns for Ray is using ray.get(object) inside a for loop. You may
want to fetch all the results after they have materiailzed. That is use ray.get() outside the for loop.

Anti-pattern: Calling ray.get unnecessarily harms performance — Ray 2.6.1
Anti-pattern: Calling ray.get in a loop harms parallelism — Ray 2.6.1
Use ray.wait and process only those finished: Pattern: Using ray.wait to limit the number of pending tasks — Ray 2.6.1
Fetcing too many objects: Anti-pattern: Fetching too many objects at once with ray.get causes failure — Ray 2.6.1

Hope this helps

Smitraj_Raut · August 24, 2023, 12:30pm

Thanks for your reply Jules we have already gone through the links that you shared but let me clarify my doubt again:

for document in pdf_documents:
    RAY_SCHEDULER_EVENTS=0
    RAY_memory_monitor_refresh_ms=0
    RAY_memory_usage_threshold=0.7
    file_name=document.split('/')[-1]

    loader = PyPDFLoader(document)
    data = loader.load()
    futures = process_shard.remote(data)
    results = ray.get(futures)
    results.save_local(FAISS_INDEX_PATH/file_name.replace('.pdf',''))

In the above code we are trying to create muliple db files inside FAISS folder 1 for each pdf document. Eg: FAISS/abc/abc.pkl , FAISS/123/123.pkl and so on.

in this case we tried using ray.wait and then passing the list elements into ray.get()

but still it gives out of memory error.

Is there a way I run ray embedding for multiple files one by one and create respective db folder consisting of vector i.e (.pkl)files for each file as mentioned above.

Jules_Damji · August 24, 2023, 4:07pm

@Smitraj_Raut Perhaps an example code that shows how we did something akin to what you trying to do with LlamaIndex and how we use Ray to created embeddings in a distributed manner. Let me know if that helps

cc: @amogkam

Topic		Replies	Views
Cannot pickle '_thread.lock' object Ray Data	2	2294	September 26, 2023
[Core] How to reslove RayOutOfMemoryError in python for ray package? Ray Core	5	951	April 29, 2021
Ray Out of Memory Issue Ray Tune	1	190	April 30, 2024
[Data][ray2.2.0] Out of Memory when using ray.data.from_torch Ray Data	0	498	February 8, 2023
Running out of memory while processing a lot of remote tasks Ray Core	2	378	January 11, 2022

Getting out of memory issue

Related topics