Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read .doc file #10

Open
abedshaaban opened this issue Nov 26, 2023 · 1 comment
Open

Cannot read .doc file #10

abedshaaban opened this issue Nov 26, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@abedshaaban
Copy link

Description

An error occurred when reading a .doc file.

Error: text-extractor: could not find a method to handle application/x-cfb

I looked into the code and the type declaration application/x-cfb is not included in the MimeType or doc in FileExtension.

Library version

3.0.2

Node version

20.9.0

Typescript version (if you are using it)

No response

@abedshaaban abedshaaban added the bug Something isn't working label Nov 26, 2023
@gamemaker1
Copy link
Owner

Hi,

office-text-extractor uses mammoth under the hood to parse ms word files.

mammoth does not support extracting text from docx files.

I tried to write an extractor for it myself, however, I was not able to successfully extract the xml contents from the .doc file. Here is the code, if you want to play with it:

// source/parsers/docx.ts
// The text extracter for DOCX/DOC files.

import { type Buffer } from 'node:buffer'
import { extractRawText as parseWordFile } from 'mammoth'
import { unzip } from 'fflate'
import { parseStringPromise as xmlToJson } from 'xml2js'
import encoding from 'text-encoding'

import type { TextExtractionMethod } from '../lib.js'

export class DocExtractor implements TextExtractionMethod {
	/**
	 * The type(s) of input acceptable to this method.
	 */
	mimes = [
		'application/x-cfb',
		'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
	]

	/**
	 * Extract text from a DOCX/DOC file if possible.
	 *
	 * @param payload The input and its type.
	 * @returns The text extracted from the input.
	 */
	apply = async (input: Buffer): Promise<string> => {
		try {
			// Convert the DOCX to text and return the text.
			const parsedDocx = await parseWordFile({ buffer: input })
			return parsedDocx.value
		} catch (caughtError: unknown) {
			// If the file is a DOC file, then JSZIP will fail to unzip it.
			const error = caughtError as Error
			if (error.message?.includes('Corrupted zip or bug')) {
				const contents = await unzipBuffer(input)
				const json = await xmlToJson(contents)
				const lines = await parseDocSection(json)

				const formattedText = lines?.join('\n') + ''
				return formattedText
			} else {
				// If it is not a DOC file, let the error propagate.
				throw caughtError
			}
		}
	}
}

/**
 * Unzip a DOC file, and return the XML in it.
 *
 * @param buffer The buffer containing the file.
 *
 * @returns The XML.
 */
const unzipBuffer = async (input: Buffer): Promise<Buffer> => {
	// Convert the buffer to a uint-8 array, and pass it to the unzip function.
	const zipBuffer = new Uint8Array(input.buffer)
	const doc = (await new Promise((resolve, reject) => {
		unzip(zipBuffer, (error, result) => {
			if (error) reject(error)
			else resolve(result)
		})
	})) as any

	const file = doc['word/document.xml']
	if (!file) throw new Error('Invalid .doc file, could not find document.xml.')

	return file
}

/**
 * Extracts text from a section of the document, recursively.
 *
 * @param docSection The section of the doc, converted to JSON from XML.
 * @param collectedText The lines of text parsed from the document so far.
 *
 * @returns The lines of text in the document.
 */
const parseDocSection = async (
	docSection: any,
	collectedText?: string[],
): Promise<string[] | undefined> => {
	// Keep track of the text being collected.
	const beingCollectedText = collectedText ?? []

	// Parse the section according to what type it is.
	if (Array.isArray(docSection)) {
		// If it is, loop through the elements of the array.
		for (const element of docSection) {
			// Collect all the pieces of text from the array.
			if (typeof element === 'string' && element !== '') {
				beingCollectedText.push(element)
			} else {
				// However, if it is an object or another array, call this function
				// again to parse that.
				await parseDocSection(element, beingCollectedText)
			}
		}

		// Finally, return the collected text.
		return beingCollectedText
	}

	// If the section is an object, loop through its properties.
	if (typeof docSection === 'object') {
		for (const property of Object.keys(docSection)) {
			// Get the value of the property.
			const value = docSection[property]

			// The `docx` format stores the actual text inside the `w:t` or `_`
			// properties, so extract text from those properties.

			// Check if it is a string or array that contains a string. If it is
			// either, then collect the text content.
			if (typeof value === 'string') {
				if ((property === 'w:t' || property === '_') && value !== '') {
					beingCollectedText.push(value)
				}
			} else if (typeof value[0] === 'string') {
				if ((property === 'w:t' || property === '_') && value[0] !== '') {
					beingCollectedText.push(value[0])
				}
			} else {
				// However, if it is an object or another array, call this function
				// again to parse that.
				await parseDocSection(value, beingCollectedText)
			}
		}

		// Finally, return the collected text.
		return beingCollectedText
	}
}

The unzip library, fflate, throws the following error:

Error {
  code: 14,
  message: 'unknown compression type 2346',
}

If you can fix it or work around it in any way, please do let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants