Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull sheet names as part of text from xlsx files? #14

Open
chazzmoney opened this issue Mar 18, 2024 · 5 comments
Open

pull sheet names as part of text from xlsx files? #14

chazzmoney opened this issue Mar 18, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@chazzmoney
Copy link

chazzmoney commented Mar 18, 2024

Description

When pulling text from a spreadsheet, the current extractor does not return the sheet names in the text. It would be GREAT if there was an options to preface the sheet text by the sheet name.

Why

Often, important contextual information is included in sheet names.

It would be easy to implement - in the office-text-extractor code, you are pulling them already as the sheet data is accessed via the sheet name. Adding a simple boolean flag on whether or not to output the sheet names into the === separator denoting new sheet text could be a solution? It could be set to false by default for backward compatibility.

Alternatives

I mean, I love the all in one nature of office-text-extractor, but I could process the files myself instead.

@chazzmoney chazzmoney added the enhancement New feature or request label Mar 18, 2024
@chazzmoney
Copy link
Author

(I'd be happy to create a pull request for this, but I'm not sure where you would prefer to place such a boolean. If you let me know, I'd be happy to create one.

@gamemaker1
Copy link
Owner

Hi,

Thanks for opening this issue!

I would definitely like this to be the default behaviour of the library, not sure why I hadn't done this in the first place. A PR that appends the sheet name near the === separator (on the same line? or the next line? let me know what would be better) sounds good.

I suppose we could add a boolean option to configure this, in the constructor of the ExcelExtractor class. But I don't think it is needed.

Regards,
Vedant

@chazzmoney
Copy link
Author

chazzmoney commented Mar 19, 2024

I know this would could be a breaking change which was the intent of the boolean. Not sure how many users you have that need no format changes.

Speaking of formats, what format is this? I see the row by row conversion to YAML and the row / sheet separators. I know '---' is the document header syntax, but I'm not familiar with '==='. Also, is there a reason you picked YAML instead of, say CSV?

I'm not being critical here - I'm just curious what you had in mind and the use cases. Want to make sure that whatever I put in aligns with the plans.

@gamemaker1
Copy link
Owner

Not sure how many users you have that need no format changes.

I have no idea either, but you're right - it is a breaking change, and to be safe we should hide it behind a boolean flag that is false by default.

Speaking of formats, what format is this? I see the row by row conversion to YAML and the row / sheet separators. I know '---' is the document header syntax, but I'm not familiar with '==='. Also, is there a reason you picked YAML instead of, say CSV?

I did not follow a format, I made my own 😅 The --- to separate rows and === to separate sheets is completely arbitrary.

I chose YAML because it maintained a text-sense of structure instead of a grid-sense of structure, i.e., col-header:value instead of value,value,value.

The 'text-based structure' was actually useful to me in the project I wrote this package for; where I was extracting text from files to identify 'topics', primarily based on position and frequency of words.

@gamemaker1
Copy link
Owner

That said, I am open to adding more options that configure the format of the output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants