Enhancing Paperless-ngx with ChatGPT API Integration for Improved OCR and Document Classification #2900
Replies: 35 comments 51 replies
-
Would this involve sending possibly confidential/private documents (in image or text form) to some remote server? |
Beta Was this translation helpful? Give feedback.
-
I really don't want OpenAI receiving my banking or medical documents. Opt-in configuration option if it happens or no-go. |
Beta Was this translation helpful? Give feedback.
-
Maybe just the empty chatgpt from openai. It's open source and you could build it in and it might learn better than the current AI? |
Beta Was this translation helpful? Give feedback.
-
I think this really needs to be implemented. There seem to be many ways to run models locally. I am no expert when it comes to AI models but i saw some models on huggingface that are able to analyse documents. What i believe should be implemented (in the long run)
I believe that sooner or later there will be DMS that will incorporate such features and i really would love to see them on the paperless ngx roadmap |
Beta Was this translation helpful? Give feedback.
-
You could build up on https://github.com/hwchase17/langchain which uses LLMs and from what I understood vector information from the data (i.e. PDF documents) which could also be stored locally. And then use one of the LLMs which are out there (not neccessarily OpenAI's GPT). Useful for querying documents on the paperless-ngx interface. |
Beta Was this translation helpful? Give feedback.
-
Really just plugging in with your choice of LLM would be good. Bring your own API key essentially. For example, azure or Amazon cognitive services provide much MUCH better OCR, entity detection, document classification...etc right out of the box than what we see with paperless today. PII detection & redaction is also very possible, even for images of text. Which opens up the ability to secure store PII separately from other OCRd data ...etc So many things. They have pretty bog-standard outputs as well, so translation layers between providers & paperless can be reasonably light & straight forward. |
Beta Was this translation helpful? Give feedback.
-
possible "GPT4All" https://gpt4all.io/index.html |
Beta Was this translation helpful? Give feedback.
-
This could be a great feature!
|
Beta Was this translation helpful? Give feedback.
-
Leaving this here for consideration: |
Beta Was this translation helpful? Give feedback.
-
Just my personal opinion: I would never use this feature. My reasons:
Just my 2 cents, to balance the discussion :-) |
Beta Was this translation helpful? Give feedback.
-
Nobody says they want to entrust their private documents to the cloud. |
Beta Was this translation helpful? Give feedback.
-
Is someone working on this? |
Beta Was this translation helpful? Give feedback.
-
I’m working on something that would run in a separate container and access the paperless API to do the things mentioned in this thread. I have not played around with changing the models too much, but I have not had great luck with getting good results for things like extracting dates, correspondents, etc, from one of the models in privateGPT. If I can get it to work decently I would make the repo public. Has anyone had gotten good results from any models in particular? |
Beta Was this translation helpful? Give feedback.
-
For Azure OpenAI, you can fill out this form to have them disable the abuse monitoring for your account, which makes it so they don't store your data: I have an exemption, and it went through pretty fast. That said, I'd rather run a local model, but the stuff I'm doing doesn't work well the local models that I've tried. Basically, I'm feeding it unstructured text blobs scraped from websites, and then having it extract data and put it into a JSON template that I provide as part of the prompt, and then using the results to call a REST API to insert the data into an application. It works really really well, and I'm just using GPT-3.5-turbo. That said, I came here to suggest this exact same feature. Basically just take the extracted text from the searchable PDF, feed it to the AI model, along with a prompt that says to extract the correspondent, document type, creation date, and relevant tags from it... And then have some logic that checks for the correspondent, document type, and tags and creates them if they don't exist and then assigns them to the document. Maybe it's something that doesn't happen automatically, where you have to click a button for each document so you can review what it came up with and the proposed changes (creating correspondents, document types, etc). And then you can set it to run automatically when you are confident that it's not on drugs. |
Beta Was this translation helpful? Give feedback.
-
Its a no from me. The privacy trade off if it goes off site or performance/compute requirements if on site would render the benefits not worthwhile. This would also be pretty complex to implement when the dev time could likely be spent implementing even more quality of life improvements. |
Beta Was this translation helpful? Give feedback.
-
models like llava which will describe what is in the picture, and even give the scan a name based on metadata, OCR and actual image content, could really improve paperless substantially. Ollama now has easy template integration, and even openai API syntax. At least to rename the files, this could be great |
Beta Was this translation helpful? Give feedback.
-
Absolutely agree 👍
|
Beta Was this translation helpful? Give feedback.
-
Regardless of if we use OpenAi a LLM Integration would be fun, but i think ppl think too big, but rather an little feature that'd enhance auto-tagging based on its contents. BUT also a note: Besides that, nowadays APIs like Anthropic are more Superior anyway and local models are insanely good. |
Beta Was this translation helpful? Give feedback.
-
Hi folks, looks like there's a lot of back and forth about this. I wrote a post-consumption script which sends the document to ChatGPT for OCR and then classifies the doc based on tags., correspondents, and doc types that already exist. It does a back off retry when tokens are rate limited as vision processing takes a lot of tokens. It also does not update the content (but does other fields) if ChatGPT refuses to OCR the document, e.g. tax docs will be refused from ChatGPT. It does not respect any tags or correspondents already applied to the doc, which is why its best used as a post-consumption script. To improve the classification, it sends both the original doc, and the OCRed text to chatGPT. This however consumes a bunch of extra tokens, and you can remove the original doc if you are sensitive to it. Also, removing the original doc would allow you to use something other than the gpt-vision model which has some limitations. You can also define the endpoint for the LLM so you can use something locally hosted. I personally recommend https://github.com/go-skynet/LocalAI which is what I use. There's still some work that I need to do to clean it up, as its very rough, but it works. |
Beta Was this translation helpful? Give feedback.
-
I wrote a small shell script to rename PDFs and summarize PDFs using llama2 and ollama running on a localhost, via curl request. it should be fairly easy for someone with more knowledge to put this into a pull request, unfortunately I don't see where I could place this script effectively (neither pre nor postprocessing seems to fit the bill?) Script here: #6193 |
Beta Was this translation helpful? Give feedback.
-
there are interesting projects like litellm that provides a unique library handling dozens of API including "local" API if you're yourself running Ollama or localai. Given the pace of developpment and compressing of LLMs by the end of the year we can expect current consumer hardware to be completely up to the task to many of the things done in paperless-ngx : better tagging, better search via embeddings, more importantly better date matching (no more thinking the date is your birthdate :) ) |
Beta Was this translation helpful? Give feedback.
-
From my own experiments: About a month ago, I went paperless with over 2000 documents - invoices, contracts, notes, everything I had, plus everything paper went through OCR and also made it into my paperless system. It was costly. Trying to figure out how to cut costs, I came up with this solution:
And now the trick: 'gpt-3.5-turbo-1106' starts losing the thread and begins writing nonsense after about 10-15 inquiries, so every 10 documents I start a new session for it, and it works smoothly. Excluding tests, it cost me about $30, and I couldn't have done it better myself (I wouldn't even want to try). I am very happy with results. |
Beta Was this translation helpful? Give feedback.
-
Yeah but within the same chat it sends all the previous tokens with every message. So once you reach the 10th doc it sends the entire previous back and forth with it. That’s madly inefficient for cost and for context.
Gesendet von Outlook für iOS<https://aka.ms/o0ukef>
…________________________________
Von: Wojciech Zieliński ***@***.***>
Gesendet: Sunday, April 7, 2024 5:19:10 PM
An: paperless-ngx/paperless-ngx ***@***.***>
Cc: Samuel François ***@***.***>; Comment ***@***.***>
Betreff: Re: [paperless-ngx/paperless-ngx] Enhancing Paperless-ngx with ChatGPT API Integration for Improved OCR and Document Classification (Discussion #2900)
gpt-3.5-turbo-1106 is crazy cheap
—
Reply to this email directly, view it on GitHub<#2900 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7I5F5VC2C7KVP3F6YMIQBDY4FPW5AVCNFSM6AAAAAAV42LEGGVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TAMZWGU4TC>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I'm really surprised no one mentions https://github.com/B-urb/doclytics. I'm using it for tagging documents. Extracting for example the price from an invoice works like a charm. I'm running mixtral:instruct in RAM on CPU. It is much cheaper than ChatGPT. However, I'm missing a nice GUI. |
Beta Was this translation helpful? Give feedback.
-
I just started messing around with LocalAI on my gaming PC and it has a lot of potential. I feel like i need to get the prompting right and determine the correct model. If this works i'll probably end up throwing a GPU in my server just for local LLM's The above mentioned doclytics looks like it might be great to use once i deploy this. |
Beta Was this translation helpful? Give feedback.
-
Yeah it’s pretty nice! Feels like endless possibilities on our own hardware. Are you testing it in general or specifically for paperless?
Gesendet von Outlook für iOS<https://aka.ms/o0ukef>
…________________________________
Von: DarkPhyber-hg ***@***.***>
Gesendet: Wednesday, April 10, 2024 12:00:40 AM
An: paperless-ngx/paperless-ngx ***@***.***>
Cc: Samuel François ***@***.***>; Comment ***@***.***>
Betreff: Re: [paperless-ngx/paperless-ngx] Enhancing Paperless-ngx with ChatGPT API Integration for Improved OCR and Document Classification (Discussion #2900)
I just started messing around with LocalAI on my gaming PC and it has a lot of potential. I feel like i need to get the prompting right and determine the correct model. If this works i'll probably end up throwing a GPU in my server just for local LLM's
—
Reply to this email directly, view it on GitHub<#2900 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7I5F5V4BUBK4PKRMUIYHHDY4RQIRAVCNFSM6AAAAAAV42LEGGVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TANRUGE4TG>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Ah you got this confused I guess. Whisper is OpenAI’s state of the art speech to text model. You can get a glimpse on using it local here: https://github.com/openai/whisper
But I’m sure there are better implementations of it based on your needs :) would be interesting to test if it performs better then piper in your cases.
Gesendet von Outlook für iOS<https://aka.ms/o0ukef>
…________________________________
Von: DarkPhyber-hg ***@***.***>
Gesendet: Wednesday, April 10, 2024 12:56:24 AM
An: paperless-ngx/paperless-ngx ***@***.***>
Cc: Samuel François ***@***.***>; Comment ***@***.***>
Betreff: Re: [paperless-ngx/paperless-ngx] Enhancing Paperless-ngx with ChatGPT API Integration for Improved OCR and Document Classification (Discussion #2900)
my understanding is whisper is the opposite of piper. Whisper converts text into speech. While Piper will turn an audio recording into text.
—
Reply to this email directly, view it on GitHub<#2900 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7I5F5RVOJKLUK5DHBAHKHLY4RWZRAVCNFSM6AAAAAAV42LEGGVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TANRUGQZTS>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi everyone, I'm intrigued by the ongoing discussion about enhancing Paperless-ngx with ChatGPT (better local llama) API integration for improved OCR and document classification. I understand the concerns about privacy and the interest in local processing. Could someone kindly share a possible solution or guide on how to test these features using a Docker installation? I believe a step-by-step guide or any tips on setting this up would be greatly beneficial for the community. Thank you in advance for your support and collaboration! |
Beta Was this translation helpful? Give feedback.
-
Hey, one thing I stumbled upon during my own experimentation with local LLMs and Paperless is that there currently seems to be no way to store longer custom text in a searchable way. I don't want to touch the content field, notes are not searchable afaik, and a custom_field of type text is limited to 128 characters:
I wanted to use AI for generating a clean summary of my documents. The output of the OCR is very erratic, especially for documents that have info-boxes and a mixture of horizontal and vertical layouts. |
Beta Was this translation helpful? Give feedback.
-
TL;DR I think something like this could be opt-in, useful, and private. An OpenAI API endpoint could be locally generated on another computer and not on the Paperless server through a variety of methods i.e. LM Studio. Phi-3 is fast and useful for this task on modest hardware. I'm happy to help but don't know anything; everything I say comes from excited ignorance. Hello all! I just wanted to communicate my interest in some sort of "Opt-In" option as well. I, like many others, have no interest in sharing my documents with another entity (OpenAI or anyone else). However, I also see great value in running a local LLM to generate Titles, Correspondents, Document Type, Tags, and Created Dates based upon my already OCR'd content. Maybe I am alone in this problem.... but I have almost 3000 documents that need to be tagged. I have no interest in doing this by hand (I have started and failed) but I'd be happy to leave my gaming machine on for a few days to have it done for me (even if imperfectly). On the LLM side of things, I have been playing with LM Studio and specifically Microsoft's recently released Phi-3 model. I have a modest machine (Ryzen 5 3600, 16gb ram, 5700xt) and it still generates ~13 tok/s and takes ~5 seconds per document to generate the tags. That could turn a tedious task into something done in 5ish hours overnight. Inspired by Signal15 above, the system prompt I've found to have the best success with is: You are document review and classification bot. Your purpose is to review documents and to output following items in this order in .json format: TITLE: "TITLE" should be in this format: Correspondent + DOCUMENT_TYPE + CREATED_DATE. For example: Home Depot_Receipt_04/04/24 "CORRESPONDENT" is the provider of the document. For example with a receipt it might be "Home Depot". "DOCUMENT_TYPE:" can only be one of the following: Application, Bill, Business Cards, Chat Record, Check or Check Stub, Contract, ID Card, Invoice, Letter, Manual, Medical Record, Note, Policy, Quote, Receipt, Statement, Tracking Number. "TAGS" could be one or more of the following tags: [Fill in your tags] "CREATED_DATE" will always be in MM/DD/YYYY format. "AMOUNT" should be populated only if the document is an invoice, receipt or anything similar and must only include the final total in this format $XXXX.XX. If it is a refund show it as a negative amount. I have found that I consistently get quick and accurate results with the temperature turned to .2 and top P Sampling turned to .5. That said, I am continuing to experiment. Since LM Studio can create an OpenAI API endpoint it would seem a theoretical connection could be made as suggested above? Maybe something similar to what I understand Doclytics does - (https://github.com/B-urb/doclytics). I should note - I am not a developer, I don't know how to code, and I don't really know what the heck I'm doing. But I'd be really happy to contribute whatever I could to helping to make my 3000 documents go from a never completed task to one that takes 4-5 hours. Finally, I just want to say that I really appreciate Paperless-NGX and the folks who keep working on it! |
Beta Was this translation helpful? Give feedback.
-
As a Paperless-ngx user, I would like to suggest integrating the ChatGPT API into the document management system to enhance its functionality. By leveraging ChatGPT's OCR capabilities or, in the future, GPT-4's image recognition technology, I believe we can significantly improve the accuracy and efficiency of document recognition and classification within Paperless-ngx. This integration would allow users like myself to better manage our documents by enabling ChatGPT to quickly understand the context of a document and assign it to the appropriate category.
Proposed Features:
Enhanced OCR:
Utilize ChatGPT's OCR capabilities to accurately extract text from scanned documents.
Improve the overall OCR performance in Paperless-ngx by reducing errors and providing more precise text recognition for users.
Document Classification:
Leverage ChatGPT's natural language understanding to analyze and classify documents based on their content.
Automatically assign documents to relevant categories, tags, or user-defined labels, streamlining the organization process for Paperless-ngx users.
Image Recognition (Future Development):
Integrate GPT-4's image recognition technology, once available, to identify and classify documents based on visual elements.
Enhance the classification process by recognizing logos, images, or specific formatting styles within documents.
Search Functionality:
Improve Paperless-ngx's search capabilities by incorporating ChatGPT's contextual understanding of documents.
Allow users to search for documents using natural language queries, making it easier for us to locate specific files.
Summary and Metadata Extraction:
Utilize ChatGPT to automatically generate summaries for documents, providing users with a quick overview of the content.
Extract metadata from documents to make it easier for users to manage and organize their files in Paperless-ngx.
I believe that integrating the ChatGPT API into Paperless-ngx can significantly improve the user experience and make managing digital documents more efficient and accurate. I hope the Paperless-ngx development team considers implementing this idea.
Beta Was this translation helpful? Give feedback.
All reactions