We want to create embeddings for each pdf seperately.
pdf_documents( this have multiple pdf files )
for document in pdf_documents:
RAY_SCHEDULER_EVENTS=0
RAY_memory_monitor_refresh_ms=0
RAY_memory_usage_threshold=0.7
file_name=document.split('/')[-1]
loader = PyPDFLoader(document)
data = loader.load()
futures = process_shard.remote(data)
results = ray.get(futures)
results.save_local(FAISS_INDEX_PATH/file_name.replace('.pdf',''))
So we tried to run a loop and call ray.get(futures) but it gives out of memory issue.
I checked and got to know we cant run ray.get() inside the loop as it loses the purpose of parallel processing. Is there any way to acheive this?
Thanks for your reply Jules we have already gone through the links that you shared but let me clarify my doubt again:
for document in pdf_documents:
RAY_SCHEDULER_EVENTS=0
RAY_memory_monitor_refresh_ms=0
RAY_memory_usage_threshold=0.7
file_name=document.split('/')[-1]
loader = PyPDFLoader(document)
data = loader.load()
futures = process_shard.remote(data)
results = ray.get(futures)
results.save_local(FAISS_INDEX_PATH/file_name.replace('.pdf',''))
In the above code we are trying to create muliple db files inside FAISS folder 1 for each pdf document. Eg: FAISS/abc/abc.pkl , FAISS/123/123.pkl and so on.
in this case we tried using ray.wait and then passing the list elements into ray.get()
but still it gives out of memory error.
Is there a way I run ray embedding for multiple files one by one and create respective db folder consisting of vector i.e (.pkl)files for each file as mentioned above.
@Smitraj_Raut Perhaps an example code that shows how we did something akin to what you trying to do with LlamaIndex and how we use Ray to created embeddings in a distributed manner. Let me know if that helps
cc: @amogkam