Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with reading Arabic #9

Open
muratulashozturk opened this issue Oct 14, 2023 · 2 comments
Open

Problems with reading Arabic #9

muratulashozturk opened this issue Oct 14, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@muratulashozturk
Copy link

Description

When reading a PDF that contains Arabic text, it can't read. It outputs a text such as ͯ̀௛̜̺͙ ͳ̮   /

Library version

3.0.2

Node version

v18.17.1

Typescript version (if you are using it)

No response

@muratulashozturk muratulashozturk added the bug Something isn't working label Oct 14, 2023
@muratulashozturk
Copy link
Author

I've noticed that there are problems with the PDF itself too. When I copy an Arabic text to a PDF created in Acrobat, it extracts the text but the order is mixed.

@gamemaker1
Copy link
Owner

Hi,

Sorry for the late reply.

This library uses pdf-parse to parse pdf text content. You could open an issue on its repo, or try using a different pdf parsing library (maybe pdfreader?) with a custom extractor:

import { type Buffer } from 'node:buffer'
import { TextExtractor, type TextExtractionMethod } from 'office-text-extractor'
import { PdfReader } from 'pdfreader'

const parser = new PdfReader()

class PdfExtractor implements TextExtractionMethod {
  mimes = ['application/pdf']
  apply = async (input: Buffer): Promise<string> {
    const text = await new Promise((resolve, reject) => {
      parser.parseBuffer(input, (error, pdf) => {
        if (error) reject(error)
        resolve(item?.text ?? 'blank pdf')
      })
    })

    return text
  }
}

const extractor = new TextExtractor()
extractor.addMethod(new PdfExtractor())

const text = await extractor.extractText({ input: '...', type: '...' }
console.log(text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants