Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refacto/file #2544

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Refacto/file #2544

wants to merge 4 commits into from

Conversation

AmineDiro
Copy link

Description

Hey,

Here's a breakdown of what I've done:

  • Reducing the number of opened fd and memory footprint: Previously, for each uploaded file, we were opening a temporary NamedTemporaryFile to write existing content read from Supabase. However, due to the dependency on langchain loader classes, we couldn't use memory buffers for the loaders. Now, with the changes made, we only open a single temporary file for each process_file_and_notify, cutting down on excessive file opening, read syscalls, and memory buffer usage. This could cause stability issues when ingesting and processing large volumes of documents. Unfortunately, there is still reopening of temporary files in some code paths but this can be improved further in later work.
  • Removing UploadFile class from File: The UploadFile ( a FastAPI abstraction over a SpooledTemporaryFile for multipart upload) was redundant in our File setup since we already downloaded the file from remote storage and read it into memory + wrote the file into a temp file. By removing this abstraction, we streamline our code and eliminate unnecessary complexity.
  • async function Adjustments: I've removed the async labeling from functions where it wasn't truly asynchronous. For instance, calling filter_file for processing files isn't genuinely async, ass async file reading isn't actually asynchronous—it uses a threadpool for reading the file . Given that we're already leveraging celery for parallelism (one worker per core), we need to ensure that reading and processing occur in the same thread, or at least minimize thread spawning. Additionally, since the rest of the code isn't inherently asynchronous, our bottleneck lies in CPU operations rather than asynchronous processing.

These changes aim to improve performance and streamline our codebase.
Let me know if you have any questions or suggestions for further improvements!

Checklist before requesting a review

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have ideally added tests that prove my fix is effective or that my feature works

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 4, 2024
Copy link

vercel bot commented May 4, 2024

Someone is attempting to deploy a commit to the Quivr-app Team on Vercel.

A member of the Team first needs to authorize it.

@dosubot dosubot bot added the area: backend Related to backend functionality or under the /backend directory label May 4, 2024
@StanGirard
Copy link
Collaborator

Thanks a lot ! I'll review it and let you know if there is anything

@StanGirard
Copy link
Collaborator

Thanks a lot! It works great except for when you upload URLs ;) I'll fix that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: backend Related to backend functionality or under the /backend directory size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants