Enhancing Paperless-ngx with ChatGPT API Integration for Improved OCR and Document Classification #2900

syberx · 2023-03-16T07:30:10Z

syberx
Mar 16, 2023

As a Paperless-ngx user, I would like to suggest integrating the ChatGPT API into the document management system to enhance its functionality. By leveraging ChatGPT's OCR capabilities or, in the future, GPT-4's image recognition technology, I believe we can significantly improve the accuracy and efficiency of document recognition and classification within Paperless-ngx. This integration would allow users like myself to better manage our documents by enabling ChatGPT to quickly understand the context of a document and assign it to the appropriate category.

Proposed Features:

Enhanced OCR:

Utilize ChatGPT's OCR capabilities to accurately extract text from scanned documents.
Improve the overall OCR performance in Paperless-ngx by reducing errors and providing more precise text recognition for users.
Document Classification:

Leverage ChatGPT's natural language understanding to analyze and classify documents based on their content.
Automatically assign documents to relevant categories, tags, or user-defined labels, streamlining the organization process for Paperless-ngx users.
Image Recognition (Future Development):

Integrate GPT-4's image recognition technology, once available, to identify and classify documents based on visual elements.
Enhance the classification process by recognizing logos, images, or specific formatting styles within documents.
Search Functionality:

Improve Paperless-ngx's search capabilities by incorporating ChatGPT's contextual understanding of documents.
Allow users to search for documents using natural language queries, making it easier for us to locate specific files.
Summary and Metadata Extraction:

Utilize ChatGPT to automatically generate summaries for documents, providing users with a quick overview of the content.
Extract metadata from documents to make it easier for users to manage and organize their files in Paperless-ngx.
I believe that integrating the ChatGPT API into Paperless-ngx can significantly improve the user experience and make managing digital documents more efficient and accurate. I hope the Paperless-ngx development team considers implementing this idea.

jonaswinkler · 2023-03-16T10:49:00Z

jonaswinkler
Mar 16, 2023
Collaborator

Would this involve sending possibly confidential/private documents (in image or text form) to some remote server?

3 replies

syberx Mar 16, 2023
Author

Good objection, it would have to be some kind of option. Many also have their e-mails on google. They can also be used "otherwise".

JacobCarrell Nov 16, 2023

There are multiple OpenAI compatible API implementations that are On Premises using different LLM models. Targeting OpenAI API (to the public- ChatGPT) is a sound decision. Additionally, since OpenAI charges for API access, this would be Opt-In and simply adding configuration options (primarily endpoint) would ameliorate the privacy concern.

ddez Jan 29, 2024

I hate the idea of sending confidential data to a server that doesn't belong to me, but there are alternatives to chatGPT like Jan.ai ( https://jan.ai ). It's opensource and self-hosted. From there, it would be interesting to "connect" paperless-ngx with jan.ai and make them work together. Paperless could benefit from an LLM to analyze the documents, and Jan.ai would have a support base to be more relevant within the company or family.

virtadpt · 2023-03-16T18:55:40Z

virtadpt
Mar 16, 2023

I really don't want OpenAI receiving my banking or medical documents. Opt-in configuration option if it happens or no-go.

1 reply

tvanroo Jul 11, 2023

Any integration would require a "bring-your-own API key" and paid subscription, it would take intentional action to connect with OpenAI.

syberx · 2023-03-17T06:40:10Z

syberx
Mar 17, 2023
Author

Maybe just the empty chatgpt from openai. It's open source and you could build it in and it might learn better than the current AI?

5 replies

virtadpt Mar 17, 2023

Not everybody has the processing power to make that feasible. A lot of folks are running SBCs.

DarkPhyber-hg Apr 10, 2023

Don't make it a mandatory feature then. Give folks the option to use it. I imagine many people would would use it if it had better results. Leveraging a self hosted Language Model might be the better solution though.

Additionally the ability to use other OCR engines than tesseract may be desirable to some. Such as interfacing with Goggle/Azure/Amazon/etc's API's for their paid ocr engines. Maybe not everyone's cup of tea, but for some use cases may be desireable.

alva-seal Apr 15, 2023

For me it is essentially that it is all done locally, my fear with options to do this things externally I fear a focus on this external interfaces and neglect of the local options. That’s why I’m against change in this direction

brandtjo Jun 28, 2023

I like the idea of having it locally too. There are open source options for achieving that. But of course it shouldn't be mandatory to have a graphics card or whatever many resources. In addition, dealing with this topic seems inevitable, at least in mid to long term. Or am I maybe already too hyped about this LLM stuff?

praetorianer777 Oct 12, 2023

There are also some models that work fast enough with inferencing when using only a CPU und don't require that much RAM (4gb?)

MrCoala · 2023-04-23T20:00:01Z

MrCoala
Apr 23, 2023

I think this really needs to be implemented. There seem to be many ways to run models locally. I am no expert when it comes to AI models but i saw some models on huggingface that are able to analyse documents.

What i believe should be implemented (in the long run)

Automatic Date retrieval based on (local) AI models
Automatic Tagging based on (local) AI models
Automatic retrieval of corespondents etc.
OCR (hand written?!) based on (local?) AI models
Analysis of the content and training of a local model to ask questions

I believe that sooner or later there will be DMS that will incorporate such features and i really would love to see them on the paperless ngx roadmap

1 reply

brandtjo Jun 28, 2023

maybe a short description

BjoernSchotte · 2023-04-29T15:18:28Z

BjoernSchotte
Apr 29, 2023

You could build up on https://github.com/hwchase17/langchain which uses LLMs and from what I understood vector information from the data (i.e. PDF documents) which could also be stored locally. And then use one of the LLMs which are out there (not neccessarily OpenAI's GPT). Useful for querying documents on the paperless-ngx interface.

2 replies

tvanroo Jul 11, 2023

Yep, this is the right approach to allow any number of LLM option (local, OpenAI, Azure OpenAI, etc..) along with so many other AI options. This would allow all of the mentioned objections to be addressed. Without intentional action there would be no "outside" connection, but for those that would use it, the benefits would be immense.

For example, you can have a conversation with a chatbot (e.g. ChatGPT) that is aware of the personalized information in your document library. There are checks in place to sufficiently secure private data that are implemented by large organizations and government departments showing trust that data can remain private and not be used for model training or otherwise revealed externally. Azure OpenAI is a clear example.

Other AI functionality (e.g. Azure Forms Recognizer) could be used to improve the metadata and OCR (type and handwriting) results. Again, optionally and with a "BYO-API Key" approach.

icereed Oct 9, 2023

Just this week I found out how easy it is to run your own LLM locally using Ollama: https://ollama.ai/
It's easy like Docker, but runs an entire LLM on your machine. You choose a model and then do ollama run llama2:latest. That's it. It also integrates with langchain mentioned above.

Also another pointer: Mistral is a 7B parameter model that seems to be very good at many tasks and is also runnable via ollama.

I guess more and more stuff for the tasks suitable for paperless could be run locally as time progresses. Using Langchain or other frameworks to support multiple LLMs would be a future-proof solution.

douglasg14b · 2023-04-30T21:24:45Z

douglasg14b
Apr 30, 2023

Really just plugging in with your choice of LLM would be good. Bring your own API key essentially.

For example, azure or Amazon cognitive services provide much MUCH better OCR, entity detection, document classification...etc right out of the box than what we see with paperless today.

PII detection & redaction is also very possible, even for images of text. Which opens up the ability to secure store PII separately from other OCRd data ...etc So many things.

They have pretty bog-standard outputs as well, so translation layers between providers & paperless can be reasonably light & straight forward.

1 reply

tvanroo May 19, 2023

this azure saample might be helpful for context, being able to provide an LLM api and then chat against my custom data, along with paperless's app/ui maturity would be killer: https://github.com/Azure-Samples/azure-search-openai-demo

BoBBer446 · 2023-05-20T12:06:29Z

BoBBer446
May 20, 2023

possible "GPT4All" https://gpt4all.io/index.html
its only local with 8b llm but the function is not compare with chat-gpt or bard

0 replies

LeoColomb · 2023-05-22T09:22:36Z

LeoColomb
May 22, 2023

This could be a great feature!
What about https://github.com/imartinez/privateGPT?

Private
Local
Focused on documents
The best mix of current AI tech

0 replies

estahn · 2023-05-24T07:57:14Z

estahn
May 24, 2023

Leaving this here for consideration:
https://github.com/go-skynet/LocalAI

2 replies

lenaxia Mar 21, 2024

I currently run this locally and it is a good option for this as it supports multiple input modes with a single API, including image and text. It is ChatGPT compatible so it is a drop in replacement, just need to replace the endpoint.

thiswillbeyourgithub Apr 28, 2024

Also LocalAi can be run in docker, so it could be fairly easy to add the localai container to the docker compose right?

Also LocalAI can forward request to outside API for people that really don't care about sending their data to OpenAI OR that have a special business deal with them

ph1248 · 2023-05-30T14:11:23Z

ph1248
May 30, 2023

Just my personal opinion:

I would never use this feature. My reasons:

I would never trust a cloud instance of a third party with my personal documents
I really appreciate that paperless-ngx is lightweight enough to run on cheap hardware with low power consumption. Adding a local instance of like LocalAI would for sure require a lot more processing power and power consumption.
I don't believe that this has a lot of benefits. More and more we hear that the output of ChatGPT is not really reliable. Sometines it's quite accurate, and sometimes it simply makes things up. So I would not trust its output.

Just my 2 cents, to balance the discussion :-)

1 reply

estahn May 31, 2023

As @BoBBer446 already stated, people are aware of the privacy concerns. It would be more sensible to adjust the title to reflect this, but then you'll get fewer hits on Google - so it's good to catch the current hype-train marketing-wise.

paperless-ngx is not lightweight at all. It's modular. In that sense, you can enable and disable features as you see fit, e.g. OCR. It was also mentioned this can be added as an added feature.

I think @MrCoala provided some good key points that would augment the experience in a positive way:

Automatic Date retrieval based on (local) AI models

Automatic Tagging based on (local) AI models

Automatic retrieval of corespondents etc.

OCR (hand written?!) based on (local?) AI models

Analysis of the content and training of a local model to ask questions

I also see improving OCR as a possible benefit, since it is a large-language model after all and should be able to figure out issues based on probability.

Is it "perfect" (100% correctness, 100% reliability, etc)? No. But the same could be said for OCR right now. And you possibly still use it.

BoBBer446 · 2023-05-30T14:18:32Z

BoBBer446
May 30, 2023

Nobody says they want to entrust their private documents to the cloud.
We are talking about on prem - i.e. locally.
That is the challenge of the matter..

0 replies

brandtjo · 2023-06-28T20:08:22Z

brandtjo
Jun 28, 2023

Is someone working on this?

2 replies

syberx Jun 29, 2023
Author

I like to know it, too. I have over 1000 documents open

tvanroo Jul 11, 2023

I am but with limited success thus far. maybe more like slow progress.

samanthavbarron · 2023-07-07T15:27:48Z

samanthavbarron
Jul 7, 2023

I’m working on something that would run in a separate container and access the paperless API to do the things mentioned in this thread.

I have not played around with changing the models too much, but I have not had great luck with getting good results for things like extracting dates, correspondents, etc, from one of the models in privateGPT.

If I can get it to work decently I would make the repo public. Has anyone had gotten good results from any models in particular?

4 replies

tvanroo Jul 11, 2023

I'd be interested in connecting with you on this. Want to connect?

signal15 Aug 24, 2023

I've had good luck experimenting with gpt-3.5 and gpt-4 to get it to pull out this stuff and identify the document type, tags, etc. This is just pasting in the content from a scanned document and asking it to pull things out. The openai api and playground does not use this data for learning and training... but that doesn't mean that they don't keep it or that people aren't reviewing it.

With something else I'm doing, I'm having gpt-3.5-turbo give me a TAGS: line with comma separated tags, and I'm putting this into json and shoving it into an app via the api.

With this, you'd just tell it to pull out the following lines:
CORRESPONDENT:
DOCUMENT_TYPE:
TAGS:
CREATED_DATE:

Then you'd check to see if the correspondent existed, and create it if it doesn't (unless the paperless api will create correspondents that don't exist when doing an update for the document. Same thing with document type, you don't want it setting the document type to to "receipts" when "receipt" already exists, so there need to be some checking there to see if any existing doc types fit.

As noted below in my other post, you can fill out a form with Azure OpenAI to opt-out of abuse monitoring, which means your stuff should not be stored or seen by anyone else.

I'll have a private 70B llama2 model running shortly, but that doesn't do others much good because it takes a few grand worth of equipment.

Edit: Actually, I just ran a quick test with gpt-3.5-turbo, and this is what I used and what it came up with:

System prompt:
`You are DocumentClassificationBot. Your purpose is to review documents and pull out the following items:

CORRESPONDENT:
DOCUMENT_TYPE:
TAGS:
CREATED_DATE:
AMOUNT:

After the colon in each of these, you will provide the relevant information. Note that correspondent is usually going to be the author/provider of the document (for example, if it's a receipt from Home Depot, then the correspondent would be Home Depot. Amount should be populated only if the document is an invoice, receipt, purchase agreement, or anything similar.`

User prompt:
`Review this document according to your purpose.

`

And here's the output (I sanitized the correspondent and amount):

CORRESPONDENT: EvilBank Loan Services, LLC
DOCUMENT_TYPE: Billing Statement
TAGS: Mortgage, Loan
CREATED_DATE: July 5, 2023
AMOUNT: $x,xxx.xx

Uli-Z Dec 6, 2023

I've just crafted a separate tool for analyzing scanned PDF files using GPT(-Vision) to populate a JSON database with metadata.

It becomes truly useful with the use of GPT4-Vision (especially for scanned documents), as there are no other suitable models available to my knowledge. We might have to wait a while for notable alternative local models.

It's still an early prototype and I hope we'll be able to use local models soon. But if you can't wait to get a glimpse of the possibilities, give it a try:
https://github.com/Uli-Z/autoPDFtagger/tree/main

fret423 Dec 28, 2023

Uli-Zs tool integrated with the rest of the powerful features of Paperless would be amazing

signal15 · 2023-08-21T17:36:23Z

signal15
Aug 21, 2023

For Azure OpenAI, you can fill out this form to have them disable the abuse monitoring for your account, which makes it so they don't store your data:

https://customervoice.microsoft.com/Pages/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbR7en2Ais5pxKtso_Pz4b1_xURE01NDY1OUhBRzQ3MkQxMUhZSE1ZUlJKTiQlQCN0PWcu

I have an exemption, and it went through pretty fast. That said, I'd rather run a local model, but the stuff I'm doing doesn't work well the local models that I've tried. Basically, I'm feeding it unstructured text blobs scraped from websites, and then having it extract data and put it into a JSON template that I provide as part of the prompt, and then using the results to call a REST API to insert the data into an application. It works really really well, and I'm just using GPT-3.5-turbo.

That said, I came here to suggest this exact same feature. Basically just take the extracted text from the searchable PDF, feed it to the AI model, along with a prompt that says to extract the correspondent, document type, creation date, and relevant tags from it... And then have some logic that checks for the correspondent, document type, and tags and creates them if they don't exist and then assigns them to the document.

Maybe it's something that doesn't happen automatically, where you have to click a button for each document so you can review what it came up with and the proposed changes (creating correspondents, document types, etc). And then you can set it to run automatically when you are confident that it's not on drugs.

0 replies

jamess60 · 2023-10-04T12:16:01Z

jamess60
Oct 4, 2023

Its a no from me.

The privacy trade off if it goes off site or performance/compute requirements if on site would render the benefits not worthwhile. This would also be pretty complex to implement when the dev time could likely be spent implementing even more quality of life improvements.

0 replies

street-grease-coder · 2024-02-18T18:16:53Z

street-grease-coder
Feb 18, 2024

models like llava which will describe what is in the picture, and even give the scan a name based on metadata, OCR and actual image content, could really improve paperless substantially.

Ollama now has easy template integration, and even openai API syntax.

At least to rename the files, this could be great

0 replies

sfkmk · 2024-02-18T21:24:09Z

sfkmk
Feb 18, 2024

Absolutely agree 👍

0 replies

LumiWasTaken · 2024-03-16T19:38:06Z

LumiWasTaken
Mar 16, 2024

Regardless of if we use OpenAi a LLM Integration would be fun, but i think ppl think too big, but rather an little feature that'd enhance auto-tagging based on its contents.

BUT also a note:
Utilizing OpenAi's LLM API they claim to not save any Data by using their API. What you need to know: Using ChatGPT ittself COLLECTS data but not the (paid for) API.

Besides that, nowadays APIs like Anthropic are more Superior anyway and local models are insanely good.

7 replies

sfkmk Mar 16, 2024

nice! do you plan to use it automatically after consumption?

LumiWasTaken Mar 16, 2024

Currently not possible, it's essentially a python script that utilizes the API to fetch all documents and then take the contents create a JSON with something like
{ "filename_suggestion": "John_Smith_Account_Change_NewBank", "tags": ["KontoWechsel", "Banking", "AccountChange", "DeutscheBank"], "language": "ger" }

Clemens-E Mar 17, 2024

I just wrote a similar script (using typescript, I can write python, but ts is more fun for me) and it generates new titles by sending the content through a LLM (locally hosted using ollama, not very fast, but easy to set up and good enough)

Works surprisingly well, but I still need new changes to not process the same file twice and also create tags.
I created a repo if anyone wants to take a look.

sfkmk Mar 18, 2024

Thank you! I will take a look at it.

SpuriousGer May 12, 2024

Can someone tell me at which point in the paperless chain you integrate? Is this a postscript where you call an API that then sends back data?

lenaxia · 2024-03-25T01:33:28Z

lenaxia
Mar 25, 2024

Hi folks, looks like there's a lot of back and forth about this.

I wrote a post-consumption script which sends the document to ChatGPT for OCR and then classifies the doc based on tags., correspondents, and doc types that already exist. It does a back off retry when tokens are rate limited as vision processing takes a lot of tokens. It also does not update the content (but does other fields) if ChatGPT refuses to OCR the document, e.g. tax docs will be refused from ChatGPT. It does not respect any tags or correspondents already applied to the doc, which is why its best used as a post-consumption script.

To improve the classification, it sends both the original doc, and the OCRed text to chatGPT. This however consumes a bunch of extra tokens, and you can remove the original doc if you are sensitive to it. Also, removing the original doc would allow you to use something other than the gpt-vision model which has some limitations.

You can also define the endpoint for the LLM so you can use something locally hosted. I personally recommend https://github.com/go-skynet/LocalAI which is what I use.

https://github.com/lenaxia/home-ops-prod/blob/9d72a2b519fb0afd2b28bc7dd04e3ceac9c907be/cluster/apps/storage/paperless/app/config/openai.py

There's still some work that I need to do to clean it up, as its very rough, but it works.

2 replies

TimetravelerDD Mar 30, 2024

that sounds exactly like what I need.
would you mind sharing the current state?

SpuriousGer May 12, 2024

Would you mind sharing the postscript code?

street-grease-coder · 2024-03-26T16:42:06Z

street-grease-coder
Mar 26, 2024

I wrote a small shell script to rename PDFs and summarize PDFs using llama2 and ollama running on a localhost, via curl request.

it should be fairly easy for someone with more knowledge to put this into a pull request, unfortunately I don't see where I could place this script effectively (neither pre nor postprocessing seems to fit the bill?)

Script here: #6193

0 replies

thiswillbeyourgithub · 2024-04-05T08:57:38Z

thiswillbeyourgithub
Apr 5, 2024

there are interesting projects like litellm that provides a unique library handling dozens of API including "local" API if you're yourself running Ollama or localai.

Given the pace of developpment and compressing of LLMs by the end of the year we can expect current consumer hardware to be completely up to the task to many of the things done in paperless-ngx : better tagging, better search via embeddings, more importantly better date matching (no more thinking the date is your birthdate :) )

0 replies

Baael · 2024-04-07T13:41:24Z

Baael
Apr 7, 2024

From my own experiments:

About a month ago, I went paperless with over 2000 documents - invoices, contracts, notes, everything I had, plus everything paper went through OCR and also made it into my paperless system.
Then I came up with the idea to use ChatGPT for tagging and naming the documents.

It was costly.

Trying to figure out how to cut costs, I came up with this solution:

I use the 'gpt-3.5-turbo-1106' model (because it's the cheapest)
I trim the document to 1000 tokens (using a tiktoken)
I ask the model to write a note, tags, and a title about the document, and to rate the quality of this description on a scale from 1 to 10 (the model self-evaluates)
I save the note, the model is prompted to create notes only for AI classifier purposes
If the title and tags are rated above 8, I also save them
Then, all documents rated below 8 points are switched over to gpt-4-turbo, based just on the notes

And now the trick: 'gpt-3.5-turbo-1106' starts losing the thread and begins writing nonsense after about 10-15 inquiries, so every 10 documents I start a new session for it, and it works smoothly.

Excluding tests, it cost me about $30, and I couldn't have done it better myself (I wouldn't even want to try).

I am very happy with results.

6 replies

Baael Apr 7, 2024

I guess, but it would also require much more work and time, and it means it would cost me more (currently I don't have time and willpower). Also I am using polish language, that means it is more complex task.

thiswillbeyourgithub Apr 7, 2024

And now the trick: 'gpt-3.5-turbo-1106' starts losing the thread and begins writing nonsense after about 10-15 inquiries, so every 10 documents I start a new session for it, and it works smoothly.

Wait what? You're using the same chat across different documents?! That would explain the cost

Baael Apr 7, 2024

gpt-3.5-turbo-1106 is crazy cheap

Clemens-E Apr 7, 2024

As others noticed, you increased the price dramatically by reusing sessions. With each new document in one session you added all previous 1-9 documents to be processed again and billed for a second (or up to 9th) time. This probably would have been a fraction of the cost if you didn't reuse chat sessions.

30$, in my opinion, is not cheap enough to justify not improving this. Reusing sessions is probably more work implementation wise than just starting a new one every time

Baael Apr 8, 2024

I totally agree, the issue with the sessions arose when ChatGPT started hallucinating. Regarding costs, it's a subjective matter - I could have spent much more time and made it perfect, saving on the bill, but just as well, I could have spent that time at work earning money while ChatGPT classified documents. Had I not used any model, I would have spent long hours extending over weeks sorting documents. This price is a tiny fraction of the official costs related to real estate and taxes that I incurred due to the mess in the documents, which completely changed my perspective.

Never again! ;)

And the only thing this indicates is the way I solved it. Maybe it will help someone, lead them to a better solution.
So far, I haven't found a better way to stop ChatGPT's hallucinations.

And no, I didn't resend the documents, I opened a thread and sent one document after another, then I opened a new thread and sent the next documents. The only thing I repeated was sending the instructions. There are a total of 2236 documents, most of which are long and complex contracts and official information.

sfkmk · 2024-04-07T15:30:08Z

sfkmk
Apr 7, 2024

Yeah but within the same chat it sends all the previous tokens with every message. So once you reach the 10th doc it sends the entire previous back and forth with it. That’s madly inefficient for cost and for context. Gesendet von Outlook für iOS<https://aka.ms/o0ukef>

…

________________________________ Von: Wojciech Zieliński ***@***.***> Gesendet: Sunday, April 7, 2024 5:19:10 PM An: paperless-ngx/paperless-ngx ***@***.***> Cc: Samuel François ***@***.***>; Comment ***@***.***> Betreff: Re: [paperless-ngx/paperless-ngx] Enhancing Paperless-ngx with ChatGPT API Integration for Improved OCR and Document Classification (Discussion #2900) gpt-3.5-turbo-1106 is crazy cheap — Reply to this email directly, view it on GitHub<#2900 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7I5F5VC2C7KVP3F6YMIQBDY4FPW5AVCNFSM6AAAAAAV42LEGGVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TAMZWGU4TC>. You are receiving this because you commented.Message ID: ***@***.***>

1 reply

Baael Apr 8, 2024

Why would I resend previous ones? I open a new thread, send the same instructions, and send new documents.
Can you propose a solution? For example, a step-by-step algorithm, maybe we'll reach a nice and sensible consensus and create something.

Develop clear, reusable instructions for the AI, detailing the tasks to be performed.
Divide documents into manageable batches (10 docs) based on AI capacity and document complexity.
Limit each document to 1000 tokens.
Open a new session for each batch, send instructions, then documents, and store AI responses.
Refine and reapply instructions to subsequent batches, using new sessions to avoid context carryover.
Keep records of the process and outcomes to refine future projects.

$30 is for everything: gpt3 base processing and gpt4 correction for difficult cases.

hydroid7 · 2024-04-08T13:17:59Z

hydroid7
Apr 8, 2024

I'm really surprised no one mentions https://github.com/B-urb/doclytics. I'm using it for tagging documents. Extracting for example the price from an invoice works like a charm. I'm running mixtral:instruct in RAM on CPU. It is much cheaper than ChatGPT.

However, I'm missing a nice GUI.

1 reply

sfkmk Apr 8, 2024

Thank you for the hint! It sounds promising! In which language are you using the mistral instruct model?

DarkPhyber-hg · 2024-04-09T22:00:14Z

DarkPhyber-hg
Apr 9, 2024

I just started messing around with LocalAI on my gaming PC and it has a lot of potential. I feel like i need to get the prompting right and determine the correct model. If this works i'll probably end up throwing a GPU in my server just for local LLM's

The above mentioned doclytics looks like it might be great to use once i deploy this.

0 replies

sfkmk · 2024-04-09T22:16:14Z

sfkmk
Apr 9, 2024

Yeah it’s pretty nice! Feels like endless possibilities on our own hardware. Are you testing it in general or specifically for paperless? Gesendet von Outlook für iOS<https://aka.ms/o0ukef>

…

________________________________ Von: DarkPhyber-hg ***@***.***> Gesendet: Wednesday, April 10, 2024 12:00:40 AM An: paperless-ngx/paperless-ngx ***@***.***> Cc: Samuel François ***@***.***>; Comment ***@***.***> Betreff: Re: [paperless-ngx/paperless-ngx] Enhancing Paperless-ngx with ChatGPT API Integration for Improved OCR and Document Classification (Discussion #2900) I just started messing around with LocalAI on my gaming PC and it has a lot of potential. I feel like i need to get the prompting right and determine the correct model. If this works i'll probably end up throwing a GPU in my server just for local LLM's — Reply to this email directly, view it on GitHub<#2900 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7I5F5V4BUBK4PKRMUIYHHDY4RQIRAVCNFSM6AAAAAAV42LEGGVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TANRUGE4TG>. You are receiving this because you commented.Message ID: ***@***.***>

3 replies

DarkPhyber-hg Apr 9, 2024

Currently I'm testing it with the intent of using it for paperless, so i'm feeding it the OCR'ed contents of a few documents trying to get it to output some useful data, title, tags, document date, and a document summary to use as a note.

We also record all of our calls on freepbx. Whenever we have a customer complaint, i upload the call to gcp to do speech to text to read the transcript of the call. Using piper i could do local speech to text and then use an LLM to summarize the call, and upload both the summary and full transcript to my CRM automatically.

Edit: I have a small business and use a paperless instance for it, and another for my personal documents.

sfkmk Apr 9, 2024

Sound interesting. I never tried piper myself. Is it better than Whisper? You can run Whisper locally too. Related to prompting I would suggest taking a look into the projects already mentionend above, maybe someone already thought it out quite well to base your implementation on? Will you share your progress on github?

DarkPhyber-hg Apr 9, 2024

my understanding is whisper is the opposite of piper. Whisper converts text into speech. While Piper will turn an audio recording into text.

sfkmk · 2024-04-09T23:04:57Z

sfkmk
Apr 9, 2024

Ah you got this confused I guess. Whisper is OpenAI’s state of the art speech to text model. You can get a glimpse on using it local here: https://github.com/openai/whisper But I’m sure there are better implementations of it based on your needs :) would be interesting to test if it performs better then piper in your cases. Gesendet von Outlook für iOS<https://aka.ms/o0ukef>

…

________________________________ Von: DarkPhyber-hg ***@***.***> Gesendet: Wednesday, April 10, 2024 12:56:24 AM An: paperless-ngx/paperless-ngx ***@***.***> Cc: Samuel François ***@***.***>; Comment ***@***.***> Betreff: Re: [paperless-ngx/paperless-ngx] Enhancing Paperless-ngx with ChatGPT API Integration for Improved OCR and Document Classification (Discussion #2900) my understanding is whisper is the opposite of piper. Whisper converts text into speech. While Piper will turn an audio recording into text. — Reply to this email directly, view it on GitHub<#2900 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/A7I5F5RVOJKLUK5DHBAHKHLY4RWZRAVCNFSM6AAAAAAV42LEGGVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TANRUGQZTS>. You are receiving this because you commented.Message ID: ***@***.***>

1 reply

DarkPhyber-hg May 14, 2024

Thanks, I've been using whisper and I was totally confused. It works great, especially with a diarization project I found that's leveraging whisper.

BoBBer446 · 2024-04-10T05:11:22Z

BoBBer446
Apr 10, 2024

Hi everyone, I'm intrigued by the ongoing discussion about enhancing Paperless-ngx with ChatGPT (better local llama) API integration for improved OCR and document classification. I understand the concerns about privacy and the interest in local processing. Could someone kindly share a possible solution or guide on how to test these features using a Docker installation? I believe a step-by-step guide or any tips on setting this up would be greatly beneficial for the community. Thank you in advance for your support and collaboration!

0 replies

nemhods · 2024-04-26T21:44:00Z

nemhods
Apr 26, 2024

Hey, one thing I stumbled upon during my own experimentation with local LLMs and Paperless is that there currently seems to be no way to store longer custom text in a searchable way. I don't want to touch the content field, notes are not searchable afaik, and a custom_field of type text is limited to 128 characters:

{'non_field_errors': ['Ensure this value has at most 128 characters (it has 719).']}

I wanted to use AI for generating a clean summary of my documents. The output of the OCR is very erratic, especially for documents that have info-boxes and a mixture of horizontal and vertical layouts.

0 replies

kevinscottcoleman · 2024-04-28T03:07:53Z

kevinscottcoleman
Apr 28, 2024

TL;DR I think something like this could be opt-in, useful, and private. An OpenAI API endpoint could be locally generated on another computer and not on the Paperless server through a variety of methods i.e. LM Studio. Phi-3 is fast and useful for this task on modest hardware. I'm happy to help but don't know anything; everything I say comes from excited ignorance.

Hello all! I just wanted to communicate my interest in some sort of "Opt-In" option as well. I, like many others, have no interest in sharing my documents with another entity (OpenAI or anyone else). However, I also see great value in running a local LLM to generate Titles, Correspondents, Document Type, Tags, and Created Dates based upon my already OCR'd content.

Maybe I am alone in this problem.... but I have almost 3000 documents that need to be tagged. I have no interest in doing this by hand (I have started and failed) but I'd be happy to leave my gaming machine on for a few days to have it done for me (even if imperfectly).

On the LLM side of things, I have been playing with LM Studio and specifically Microsoft's recently released Phi-3 model. I have a modest machine (Ryzen 5 3600, 16gb ram, 5700xt) and it still generates ~13 tok/s and takes ~5 seconds per document to generate the tags. That could turn a tedious task into something done in 5ish hours overnight.

Inspired by Signal15 above, the system prompt I've found to have the best success with is:

You are document review and classification bot. Your purpose is to review documents and to output following items in this order in .json format:

TITLE:
CORRESPONDENT:
DOCUMENT_TYPE:
TAGS:
CREATED_DATE:
AMOUNT:

"TITLE" should be in this format: Correspondent + DOCUMENT_TYPE + CREATED_DATE. For example: Home Depot_Receipt_04/04/24

"CORRESPONDENT" is the provider of the document. For example with a receipt it might be "Home Depot".

"DOCUMENT_TYPE:" can only be one of the following: Application, Bill, Business Cards, Chat Record, Check or Check Stub, Contract, ID Card, Invoice, Letter, Manual, Medical Record, Note, Policy, Quote, Receipt, Statement, Tracking Number.

"TAGS" could be one or more of the following tags: [Fill in your tags]

"CREATED_DATE" will always be in MM/DD/YYYY format.

"AMOUNT" should be populated only if the document is an invoice, receipt or anything similar and must only include the final total in this format $XXXX.XX. If it is a refund show it as a negative amount.

I have found that I consistently get quick and accurate results with the temperature turned to .2 and top P Sampling turned to .5. That said, I am continuing to experiment.

Since LM Studio can create an OpenAI API endpoint it would seem a theoretical connection could be made as suggested above? Maybe something similar to what I understand Doclytics does - (https://github.com/B-urb/doclytics).

I should note - I am not a developer, I don't know how to code, and I don't really know what the heck I'm doing. But I'd be really happy to contribute whatever I could to helping to make my 3000 documents go from a never completed task to one that takes 4-5 hours.

Finally, I just want to say that I really appreciate Paperless-NGX and the folks who keep working on it!

1 reply

thiswillbeyourgithub Apr 28, 2024

Hi, I see you said you're not a coder but as you are trying to force a smallish LLM to follow a json formatted output you might want to takr a look at llamacpp's grammar output.

Also i'd recommend always adding a tag "llm" or something to know which document were automatically tagged. That would help re running them once we have a better llm.

Enhancing Paperless-ngx with ChatGPT API Integration for Improved OCR and Document Classification #2900

Replies: 35 comments · 51 replies

jonaswinkler Mar 16, 2023 Collaborator

syberx Mar 16, 2023 Author

syberx Mar 17, 2023 Author

syberx Jun 29, 2023 Author

Replies: 35 comments 51 replies

jonaswinkler
Mar 16, 2023
Collaborator

syberx Mar 16, 2023
Author

syberx
Mar 17, 2023
Author

syberx Jun 29, 2023
Author