Cannot read .doc file #10

abedshaaban · 2023-11-26T09:37:53Z

Description

An error occurred when reading a .doc file.

Error: text-extractor: could not find a method to handle application/x-cfb

I looked into the code and the type declaration application/x-cfb is not included in the MimeType or doc in FileExtension.

Library version

3.0.2

Node version

20.9.0

Typescript version (if you are using it)

No response

The text was updated successfully, but these errors were encountered:

gamemaker1 · 2023-11-29T07:48:40Z

Hi,

office-text-extractor uses mammoth under the hood to parse ms word files.

mammoth does not support extracting text from docx files.

I tried to write an extractor for it myself, however, I was not able to successfully extract the xml contents from the .doc file. Here is the code, if you want to play with it:

// source/parsers/docx.ts
// The text extracter for DOCX/DOC files.

import { type Buffer } from 'node:buffer'
import { extractRawText as parseWordFile } from 'mammoth'
import { unzip } from 'fflate'
import { parseStringPromise as xmlToJson } from 'xml2js'
import encoding from 'text-encoding'

import type { TextExtractionMethod } from '../lib.js'

export class DocExtractor implements TextExtractionMethod {
	/**
	 * The type(s) of input acceptable to this method.
	 */
	mimes = [
		'application/x-cfb',
		'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
	]

	/**
	 * Extract text from a DOCX/DOC file if possible.
	 *
	 * @param payload The input and its type.
	 * @returns The text extracted from the input.
	 */
	apply = async (input: Buffer): Promise<string> => {
		try {
			// Convert the DOCX to text and return the text.
			const parsedDocx = await parseWordFile({ buffer: input })
			return parsedDocx.value
		} catch (caughtError: unknown) {
			// If the file is a DOC file, then JSZIP will fail to unzip it.
			const error = caughtError as Error
			if (error.message?.includes('Corrupted zip or bug')) {
				const contents = await unzipBuffer(input)
				const json = await xmlToJson(contents)
				const lines = await parseDocSection(json)

				const formattedText = lines?.join('\n') + ''
				return formattedText
			} else {
				// If it is not a DOC file, let the error propagate.
				throw caughtError
			}
		}
	}
}

/**
 * Unzip a DOC file, and return the XML in it.
 *
 * @param buffer The buffer containing the file.
 *
 * @returns The XML.
 */
const unzipBuffer = async (input: Buffer): Promise<Buffer> => {
	// Convert the buffer to a uint-8 array, and pass it to the unzip function.
	const zipBuffer = new Uint8Array(input.buffer)
	const doc = (await new Promise((resolve, reject) => {
		unzip(zipBuffer, (error, result) => {
			if (error) reject(error)
			else resolve(result)
		})
	})) as any

	const file = doc['word/document.xml']
	if (!file) throw new Error('Invalid .doc file, could not find document.xml.')

	return file
}

/**
 * Extracts text from a section of the document, recursively.
 *
 * @param docSection The section of the doc, converted to JSON from XML.
 * @param collectedText The lines of text parsed from the document so far.
 *
 * @returns The lines of text in the document.
 */
const parseDocSection = async (
	docSection: any,
	collectedText?: string[],
): Promise<string[] | undefined> => {
	// Keep track of the text being collected.
	const beingCollectedText = collectedText ?? []

	// Parse the section according to what type it is.
	if (Array.isArray(docSection)) {
		// If it is, loop through the elements of the array.
		for (const element of docSection) {
			// Collect all the pieces of text from the array.
			if (typeof element === 'string' && element !== '') {
				beingCollectedText.push(element)
			} else {
				// However, if it is an object or another array, call this function
				// again to parse that.
				await parseDocSection(element, beingCollectedText)
			}
		}

		// Finally, return the collected text.
		return beingCollectedText
	}

	// If the section is an object, loop through its properties.
	if (typeof docSection === 'object') {
		for (const property of Object.keys(docSection)) {
			// Get the value of the property.
			const value = docSection[property]

			// The `docx` format stores the actual text inside the `w:t` or `_`
			// properties, so extract text from those properties.

			// Check if it is a string or array that contains a string. If it is
			// either, then collect the text content.
			if (typeof value === 'string') {
				if ((property === 'w:t' || property === '_') && value !== '') {
					beingCollectedText.push(value)
				}
			} else if (typeof value[0] === 'string') {
				if ((property === 'w:t' || property === '_') && value[0] !== '') {
					beingCollectedText.push(value[0])
				}
			} else {
				// However, if it is an object or another array, call this function
				// again to parse that.
				await parseDocSection(value, beingCollectedText)
			}
		}

		// Finally, return the collected text.
		return beingCollectedText
	}
}

The unzip library, fflate, throws the following error:

Error {
  code: 14,
  message: 'unknown compression type 2346',
}

If you can fix it or work around it in any way, please do let me know!

abedshaaban added the bug Something isn't working label Nov 26, 2023

gamemaker1 closed this as completed Nov 29, 2023

gamemaker1 reopened this Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot read .doc file #10

Cannot read .doc file #10

abedshaaban commented Nov 26, 2023

gamemaker1 commented Nov 29, 2023

Cannot read .doc file #10

Cannot read .doc file #10

Comments

abedshaaban commented Nov 26, 2023

Description

Library version

Node version

Typescript version (if you are using it)

gamemaker1 commented Nov 29, 2023