Handling embedded fonts in PDFs #2618

Woody1193 · 2024-05-02T11:09:42Z

Woody1193
May 2, 2024

I have a pretty generic script designed to extract text from PDFs I receive. Recently, however, two PDFs failed this process. When I tried debugging it, none of the text page.extract_text() was returning matched what I could actually see in the document itself.

Digging deeper, I discovered three base fonts, which were also embedded:

{
  140012660527056: '/GEAKCA+Arial',
  140012660532816: '/GEAKDA+俵俽俹僑僔僢僋',
  140012660454864: '/GEAKBP+俵俽僑僔僢僋'
}

These fonts are different from what I've had to work with otherwise and I would assume that this is what resulted in the errors. However,
I'm unable to find any references to these fonts online and I'm not sure how to deal with this issue as I'm not particularly well-versed with the PDF standard. What steps should I take to debug this further and how could I resolve this issue?

stefan6419846 · 2024-05-02T11:18:39Z

stefan6419846
May 2, 2024
Collaborator

It is rather unlikely that we will be able to help you with this unless you provide a corresponding PDF file with an offending page for further analysis.

4 replies

Woody1193 May 2, 2024
Author

Sorry, I'd love to provide it but it's confidential so I'm afraid I cannot. However, I also tried selecting the text and copying it to notepad and noticed the same issue so I think it's a problem with my machine missing the font itself rather than an issue with the PDF's encoding.

pubpub-zz May 2, 2024
Collaborator

Fonts are embedded in PDF. The issue is morelikely due to missing translation table (in '/ToUnicode' field of the font)

pubpub-zz May 2, 2024
Collaborator

Isn't there a front page or index you could provide that could not contain confidential data?

stefan6419846 May 2, 2024
Collaborator

When copying fails, this either is some unexpectedly missing translation table as already pointed out, some limitation of the application (different application might handle this differently) or an intended obfuscation technique.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling embedded fonts in PDFs #2618

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Handling embedded fonts in PDFs #2618

Woody1193 May 2, 2024

Replies: 1 comment · 4 replies

stefan6419846 May 2, 2024 Collaborator

Woody1193 May 2, 2024 Author

pubpub-zz May 2, 2024 Collaborator

pubpub-zz May 2, 2024 Collaborator

stefan6419846 May 2, 2024 Collaborator

Woody1193
May 2, 2024

Replies: 1 comment 4 replies

stefan6419846
May 2, 2024
Collaborator

Woody1193 May 2, 2024
Author

pubpub-zz May 2, 2024
Collaborator

pubpub-zz May 2, 2024
Collaborator

stefan6419846 May 2, 2024
Collaborator