New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Feature: automatic document translation #6386
base: dev
Are you sure you want to change the base?
Conversation
Hello @bjesus, Thank you very much for submitting this PR to us! This is what will happen next:
You'll be hearing from us soon, and thank you again for contributing to our project. |
Thanks for the PR! Before we get to the implementation questions, I think the first question really needs to be answered. That said, I’m not sure what the definitive answer is (or whether there is a definitive answer). I think there are two philosophical issues using a commercial service for translation. The first is that I have come to realize that pngx users are extremely privacy-focused, (for example go read thread about integrating ChatGPT). Second issue is that it would be supporting a closed-source commercial service, which is sort of against the ethos of the project. On the other hand, of course, if DeepL is head-and-shoulders above the rest then maybe the question is moot. I have almost no knowledge of the translation ecosystem (FOSS or otherwise). Of course this is opt-in, which is good, but my gut tells me if there’s a viable FOSS alternative out there we should use it, or at least have it as an option. |
At the very least, I think this should be "pluggable", with initial support for at least one foss, local service. That makes it easier to expand to new providers if users are wanting it. |
To be honest, adding support for multiple providers isn't difficult at all, at the end it's just a question of changing the request scheme a little bit. At the end of the day LibreTranslate or even using an LLM (which could be open source) are all working over HTTP. I'll update the PR to include support for more providers and update here! |
It's also not currently opt in. It will always run when updating the archive file, which is undesirable. It should be controlled by some setting(s) and only be run if the user has configured it |
Oh yea I hadnt looked that closely, thought it would be optional in terms of requiring the settings ( |
Yes will definitely make it an opt-in! |
hey, I made two important changes now:
Overall I think that since this runs locally the risk of users objecting to this feature is a quite lower now. There are still two things I'd love to add:
Both of these features generally require some understanding of what language the document is in. I'm trying to figure out what' the best way to do that. |
auto detecting the language seem to be working pretty well with https://github.com/pemistahl/lingua-py . I'll add that soon and then we can support multiple languages and not run the translation if it isn't needed. please let me know what other concerns or ideas you might have! @hendrik1120 I noticed you voted the feature down - I'd be happy to hear what objections you have to it 🙏 |
One note in the meantime, I’d be careful about package bloat for what is an optional feature at the moment. I haven’t looked at either of the two mentioned ones yet, might be fine. Also I don’t think bergamot is properly added atm, eg not in pipfile/lock Finally, we already have |
Great, I'll try using |
@bjesus sure, my response is almost identical to the one from shamoon. From the recent changes, I see that all of my concerns seem to be have been mitigated already. |
I still have some things I think need to be addressed before really reviewing:
|
I think this would be an excellent feature, and I'd definitely use it if it runs locally. About the workflow, paperless-ngx already has PAPERLESS_OCR_LANGUAGES and PAPERLESS_OCR_LANGUAGE environment variables at hand. Prior can be used as a default source languages, and the latter can be used as a target language. Of course, these could be the defaults, and I think you can introduce PAPERLESS_TRA_LANGUAGES, PAPERLESS_TRA_LANGUAGE and PAPERLESS_TRA_BACKEND environment variables, of which the first two would default to the previous OCR_ equivalents, and the last one would default to None or the backend of the translating engine depending on the community desire. I'd like to see a new tab in the document edit page named as '$PAPERLESS_TRA_LANGUAGE' to show the translated content as well. Great idea, I really like to see it happening. TLDR: I guess additional variables per backend should be defined as well |
Proposed change
This PR adds support for automatically translating imported documents to whatever destination language is configured. The translated content can then be used when searching, just like the normal
content
field. I think it is very useful when having a lot of content in a language you aren't fluent in. For example, I live in the Netherlands and searching for "tax" is much easier for me than searching for "belastingdienst". The PR currently uses DeepL for translation. It is a POC stage - I wanted to check with you that you'd be interested in such a feature to begin with, as well as ask some questions regarding your prefered implementation method.Closes #269
Type of change
Checklist:
The PR is a WIP so I haven't added tests yet.
pre-commit
hooks, see documentation.My questions
translation
field to the Document model?update_document_archive_file
but it probably isn't the best place.translate_content
live insidedocuments/tasks.py
, or should I separate this even further?Thank you very much for Paperless-ngx!