Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partition_msg is unable to process attachments #3006

Closed
MthwRobinson opened this issue May 13, 2024 · 4 comments · Fixed by #3142
Closed

partition_msg is unable to process attachments #3006

MthwRobinson opened this issue May 13, 2024 · 4 comments · Fixed by #3142
Labels
needs follow up pptx Related to Microsoft PowerPoint (.pptx) file format

Comments

@MthwRobinson
Copy link
Contributor

MthwRobinson commented May 13, 2024

To reproduce

from unstructured.partition.auto import partition
from unstructured.partition.msg import partition_msg
import traceback

filename = "example-docs/fake-email-multiple-attachments.msg"
try:
  elements = partition_msg(
    filename=filename, process_attachments=True, attachment_partitioner=partition
  )
except RuntimeError as e:
  print(e)
except:
    print("ERROR about attachments")
    traceback.print_exc()

output_filename = "msg_mul_attach.json"
elements_to_json(elements, filename=output_filename)

The error tha arise is:
ValueError: Invalid file /tmp/tmpo99fe8l4/Engineering Onboarding.pptx. The FileType.ZIP file type is not supported in partition.

@MthwRobinson MthwRobinson added bug Something isn't working pptx Related to Microsoft PowerPoint (.pptx) file format awaiting-response labels May 13, 2024
@MthwRobinson
Copy link
Contributor Author

image

OS: Windows 11 in my PC but I´m running the code in Databricks Notebooks or in Google Colab with the same result.
Python Version: Python 3.10.6
Unstructured Version: lasta (0.13.7) because I´ve applied %pip install -q --upgrade unstructured but I don´t know the comment to know directly the version.

@scanny
Copy link
Collaborator

scanny commented May 13, 2024

@MthwRobinson This appears to be either a corrupted message or a defect or limitation in msg_parser in how it extracts attachments. I'd say next step is either to chalk this up to a fluke and wait for recurrence or to try another .msg file that is "known good" (shown to work with Outlook). In any case, this probably shouldn't be a file in example-docs/ unless it's to demonstrate how we handle this sort of failure and named accordingly.


Diagnostics

I'm able to reproduce this error.

I was able to detach the attachment using msg_parser, as partition_msg() does. However the attachment is not a valid PPTX file and cannot be opened with PowerPoint.

On attempt to open it with PowerPoint it signals a "repair error":

PowerPoint found a problem with content in Engineering Onboarding.pptx.
PowerPoint can attempt to repair the presentation.
If you trust the source of this presentation, click Repair.

When clicking "Repair" it states:

Sorry, PowerPoint can't read Engineering Onboarding.pptx.

On inspection, the attachment binary appears to be a zip archive (first two bytes of file are "PK"). However it cannot be unzipped and fails with this message:

$ unzip Engineering\ Onboarding.pptx
Archive:  Engineering Onboarding.pptx
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of Engineering Onboarding.pptx or
        Engineering Onboarding.pptx.zip, and cannot find Engineering Onboarding.pptx.ZIP, period.

@MthwRobinson
Copy link
Contributor Author

Sounds good, I'll update this issue to reflect removing that document from example-docs, and we can keep an eye out for other examples of corrupted files.

@MthwRobinson MthwRobinson changed the title PowerPoint (PPTX) attachment is detected as a ZIP file Remove fake-email-multiple-attachments.msg from example-docs May 13, 2024
@MthwRobinson MthwRobinson added needs follow up and removed awaiting-response bug Something isn't working labels May 13, 2024
@MthwRobinson MthwRobinson changed the title Remove fake-email-multiple-attachments.msg from example-docs partition_msg is unable to process attachments May 15, 2024
@scanny
Copy link
Collaborator

scanny commented Jun 4, 2024

This turned out to be a defect in msg_parser. When replaced with python-oxmsg for parsing MSG files this and other attachments are extracted fine.

github-merge-queue bot pushed a commit that referenced this issue Jun 5, 2024
**Summary**
`partition_msg()` previously used the `msg_parser` library for parsing
Outlook MSG email files (.msg files). The `msg_parser` library is
unmaintained and has several major shortcomings such as not being able
to parse MSG files with 8-bit encoded strings and not reliably
extracting attachments.

Use the new and permissively licenced `python-oxmsg` library instead.

**Additional Context**
For reviewability purposes, this PR temporarily places the new
`partition_msg()` implementation in `new_msg.py` and references that
implementation from `msg.py`. `new_msg.py` will be renamed to `msg.py`
in a closely following PR. This avoids a very messy interleaving of
hunks in a diff between the old and re-written `partition_msg()`
implementation.

Fixes #2481 
Fixes #3006
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs follow up pptx Related to Microsoft PowerPoint (.pptx) file format
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants