Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiline Cells do not have normalized line endings #140

Open
shuebner opened this issue Aug 28, 2021 · 3 comments
Open

Multiline Cells do not have normalized line endings #140

shuebner opened this issue Aug 28, 2021 · 3 comments

Comments

@shuebner
Copy link

Hi all,
when mapping multiline cells to a string property, the string does not have the usual line endings of a .NET string in the runtime environment, but will always have LF line endings. This is because NPOI returns it that way. I assume this stems from NPOI's XML processor replacing all line endings in the XLSX's internal XML files with LF on reading (as it should be according to the XML Spec, see https://www.w3.org/TR/REC-xml/#sec-line-ends).

(Interestingly enough, multiline cells from the old XLS format will also be returned by NPOI with LF line endings. Not sure why that is, but hey, at least it is consistent across formats.)

This leads to .NET users of ExcelMapper who want to get the lines of a multiline cell having to be aware of this implementation detail of XLSX. They cannot just split the string with Environment.NewLine as they would with pretty much every other string they encounter, but have to split with \n. Even worse, people would not even notice on Linux, but get subtle bugs on Windows.

I already tried to address this issue at NPOI, but they dismissed it. To my great surprise their goal apparently is not to abstract away how the XLSX format works under the hood, but still require users to know intricate details such as this.

Maybe ExcelMapper is the right layer of abstraction to get rid of this pesky subtlety.
If it is, I suggest adding a default mapping mechanism to a string[] property that will map multiline string cells.
I also suggest introducing the ability to have line endings normalized to Environment.NewLine when a multiline cell is mapped to a string property. As this is a breaking change, it could be enabled by configuration.
To make this even more general (and probably simpler to code), there could be a default mapping to an array of any type by parsing the individual lines of a multiline string cell. This would also work well with custom value parsers.

What do you think?

@mganss
Copy link
Owner

mganss commented Aug 29, 2021

Your suggestions sound very good to me. Perhaps this could be extended to all property types that implement ICollection, not only arrays. From the top of my hat this shouldn't be hard to implement, perhaps even just a few additional lines here:

case CellType.String:
default:
if (targetColumn.Json)
return JsonSerializer.Deserialize(cell.StringCellValue, targetColumn.PropertyType);
else
return cell.StringCellValue;

Would you like to take a crack at it?

BTW the newline behavior is the same in other parser libraries, e.g. AngleSharp or Linq to XML:

var xml = "<x>12\r\n34</x>";
var e = XElement.Parse(xml);
var s = e.Value; // -> "12\n34";

@shuebner
Copy link
Author

I glad to hear it. Yes, I would like a crack at it. But as you can tell from the crawling pace on the other issue over at XmlSchemaClassGenerator, I do not have much time. This will probably change for the next two weeks, though.

Regarding the XML parser behavior, I take no issue with something that just parses XML (without even deserializing into .NET objects) leaving the strings in the state in which the XML processor left them as per the XML spec. The fact that some parsers use .NET strings at all instead of, say a UTF-8 byte[], is probably for convenience or an artifact of their own helper libraries.
But something that performs mapping between the XML world and the .NET world should arguably normalize the strings into the expected .NET line endings. This is one layer up from XML parsing and should be about .NET semantics (like environment-dependent line ending) and not about XML syntax anymore. This goes even more for libraries providing access to the content of Excel files where the fact that XLSX happens to be XML-based should IMHO be of no concern to the user at all.

@mganss
Copy link
Owner

mganss commented Aug 30, 2021

No worries, we don't have any deadlines 😄

It's an interesting topic which I hadn't given much thought in this context (git's autocrlf options have been driving me mad every once in a while, though 🙄). Regarding serializers, both XmlSerializer and JsonSerializer leave the newline characters alone, i.e. they pass through what's in the original XML or JSON. This is probably the best approach as it's the only way to allow roundtripping. Unfortunately, it's out of the question here due to NPOI's behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants