Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic Multimodal Support #1021

Merged
merged 15 commits into from
May 15, 2024

Conversation

Saghen
Copy link
Collaborator

@Saghen Saghen commented Apr 16, 2024

Adds support for multimodal with Anthropic by increasing the maximum file size, adjusting the message.files type to support mime and removing the assumptions around TGI.

  • Changed from base64 or hash string[] to { type: 'hash' | 'base64', value: string, mime: string }
  • Moved TGI specific image resizing and markdown ![](base64) prompting to TGI endpoint code
  • Changed maximum file size from 2mb -> 10mb
    • Thinking of reducing this and adding back the in-browser image resizing
    • Likely should be configurable

I'd like to move the file upload logic out of the UI code and begin uploading immediately upon selecting a file, but that's outside the scope of this PR. However, that should allow for processing files earlier, which could be particularly useful for non-images (i.e. making embeddings for PDFs).

  • Test the TGI endpoints
  • Ensure clients receive a useful error message when their files are incompatible (with respect to mime types)

image

@flexchar
Copy link
Contributor

This is amazing. When this is merged, please ping me. I would like to adapt it for OpenAI + Gemini 1.5 Pro. ✌️

@Saghen Saghen marked this pull request as ready for review April 24, 2024 20:53
@Ichigo3766
Copy link

Ichigo3766 commented Apr 26, 2024

This is amazing! Would be nice to extend this to openai api as well if possible.

@nsarrazin nsarrazin self-requested a review April 29, 2024 14:04
@Extremys
Copy link

Extremys commented May 1, 2024

Yes amazing! It would be so great to have also OpenAI-like API compatibility, so many Open sources multimodal models are available like Idefics2, Llava, llama-3-vision, ... :)

@nsarrazin
Copy link
Collaborator

Hey @Saghen, PR looking great from my local testing!

We changed a few things last week since we switched our docker image to a new build process. That probably introduced some conflicts but I don't mind fixing them for you since I created them 😅 If you're ok giving me write-access on the PR then I can just do the merge commit directly.

@Saghen
Copy link
Collaborator Author

Saghen commented May 5, 2024

@nsarrazin that'd be great, thanks! granted you permission

Copy link
Collaborator

@nsarrazin nsarrazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good! Left some comments, let me know what you think.

src/routes/conversation/[id]/+page.svelte Show resolved Hide resolved
Comment on lines 77 to 83

const module = await import("browser-image-resizer");
// currently, only IDEFICS is supported by TGI
// the size of images is hardcoded to 224x224 in TGI
// this will need to be configurable when support for more models is added
const resizedImages = await Promise.all(
files.map(async (file) => {
return await module
.readAndCompressImage(file, {
maxHeight: 224,
maxWidth: 224,
quality: 1,
})
.then(async (el) => await file2base64(el as File));
})
const base64Files = await Promise.all(
(files ?? []).map((file) =>
file2base64(file).then((value) => ({ type: "base64" as const, value, mime: file.type }))
)
);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to be able to do the resizing in browser but it should be configurable indeed. Would save quite some bandwidth I think at huggingchat scale

Maybe model.multimodal could be true | { maxSize?: number, preferredMimeType?: string } ? That would give us a place to store multimodal specific settings, and the client has access to the model so we could do the resizing there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a more generic image processor in c8814f4 and put the multimodal options at endspoints[*].multimodal. What do you think?

Copy link
Collaborator Author

@Saghen Saghen May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WRT to bandwidth, it might make sense to do some image processing on the server on upload to convert to a more suitable format (i.e. put everything in AVIF/WEBP?) but I found the existing upload limit to get in the way

src/routes/conversation/[id]/+server.ts Outdated Show resolved Hide resolved
@nsarrazin
Copy link
Collaborator

And thanks for exposing the mime type in files 🔥 that's gonna be super handy down the road as we support more modalities

@Saghen
Copy link
Collaborator Author

Saghen commented May 13, 2024

@Ichigo3766 @Extremys @flexchar heads up that it was trivial so I added support for OpenAI in this PR as well

@Saghen Saghen changed the title Multimodal Anthropic Claude 3 Support Generic Multimodal Support May 13, 2024
@gary149 gary149 requested a review from mishig25 May 14, 2024 09:32
@mishig25
Copy link
Collaborator

@Saghen I will review it soon. Could you merge/rebase with the main so that the merge conflicts are gone ❤️

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
src/lib/types/Message.ts Outdated Show resolved Hide resolved
Copy link
Collaborator

@mishig25 mishig25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some nits. I think we are close to merge 🚀

Copy link
Collaborator

@mishig25 mishig25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides the last two nits I've left, LGTM 🚀

Co-authored-by: Mishig <mishig.davaadorj@coloradocollege.edu>
Copy link
Collaborator

@mishig25 mishig25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! testing it one more time before merge

@mishig25 mishig25 merged commit 57f8934 into huggingface:main May 15, 2024
3 checks passed
mishig25 added a commit that referenced this pull request May 15, 2024
mishig25 added a commit that referenced this pull request May 15, 2024
Revert "Generic Multimodal Support (#1021)"

This reverts commit 57f8934.
@nsarrazin nsarrazin added the enhancement New feature or request label May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants