Add support for large OOXML files #108327
Labels
:Data Management/Ingest Node
Execution or management of Ingest Pipelines including GeoIP
>enhancement
Team:Data Management
Meta label for data/management team
team-discuss
Description
When running in streaming mode as we do in the attachment processor (vs reading a file), Tika detects the type of an OOXML file by looking for a zip entry in the file named
[Content_Types].xml
. But that zip entry has to be in the first 16 MB read from the zip file. So it is possible that an OOXML file larger than 16MB that is sent to the attachment processor will effectively be ignored because we do not detect its type and parse it as OOXML files.It would be possible to change the limit to something larger than 16 MB. We could potentially expose a new config property to set this higher. The following (hack) code in TikaImpl for example sets the limit to 30 MB:
The text was updated successfully, but these errors were encountered: