You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug description
When working with a flat file with UTF-8 that contains extended characters the line si tokenized in a wrong way because string.substring is used instead of working with byte arrays.This happens because say an "è" character is made up of two bytes(so two "positions" on the file) but working with a string you see it as one position getting a wrong token.
Environment
All versions
Steps to reproduce
Use a fixed file lenght with some fields and add in one field data a text like "aleè"
The text was updated successfully, but these errors were encountered:
Hi, I thought I had the same issue with the FixedLengthTokenizer, but it turned out that the encoding on the FlatFileItemReader had to be explicitly set to UTF-8 in order for the line to be read correctly from the UTF-8 encoded file I was reading. After this, the FixedLengthTokenizer worked fine for me.
So, flatFileItemReader.setEncoding("UTF-8"); did the trick for me.
When working with a flat file with UTF-8 that contains extended characters the line si tokenized in a wrong way
In this case, you need to explicitly set the encoding of the reader to UTF-8 , because the default as of v4 is set to the JVM's default encoding which could be different than UTF-8 (and I guess that's what happening in your case).
The default encoding has been changed to UTF-8 in FlatFileItemReader as of v5 (see df8dac1), which should fix this issue without any code change. But as mentioned by @kunivan , you can and should set the encoding of the reader to be the same as the encoding of your input file.
Bug description
When working with a flat file with UTF-8 that contains extended characters the line si tokenized in a wrong way because string.substring is used instead of working with byte arrays.This happens because say an "è" character is made up of two bytes(so two "positions" on the file) but working with a string you see it as one position getting a wrong token.
Environment
All versions
Steps to reproduce
Use a fixed file lenght with some fields and add in one field data a text like "aleè"
The text was updated successfully, but these errors were encountered: