Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for large OOXML files #108327

Open
masseyke opened this issue May 6, 2024 · 1 comment
Open

Add support for large OOXML files #108327

masseyke opened this issue May 6, 2024 · 1 comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team team-discuss

Comments

@masseyke
Copy link
Member

masseyke commented May 6, 2024

Description

When running in streaming mode as we do in the attachment processor (vs reading a file), Tika detects the type of an OOXML file by looking for a zip entry in the file named [Content_Types].xml. But that zip entry has to be in the first 16 MB read from the zip file. So it is possible that an OOXML file larger than 16MB that is sent to the attachment processor will effectively be ignored because we do not detect its type and parse it as OOXML files.
It would be possible to change the limit to something larger than 16 MB. We could potentially expose a new config property to set this higher. The following (hack) code in TikaImpl for example sets the limit to 30 MB:

((DefaultZipContainerDetector)((DefaultDetector)TIKA_INSTANCE.getDetector()).getDetectors().get(2)).setMarkLimit(30 * 1024 * 1024);
@masseyke masseyke added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP team-discuss Team:Data Management Meta label for data/management team labels May 6, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement Team:Data Management Meta label for data/management team team-discuss
Projects
None yet
Development

No branches or pull requests

2 participants