FixedLengthTokenizer wrong tokenization with utf-8 extended characters #3714

jtremiel · 2020-05-18T10:10:51Z

Bug description
When working with a flat file with UTF-8 that contains extended characters the line si tokenized in a wrong way because string.substring is used instead of working with byte arrays.This happens because say an "è" character is made up of two bytes(so two "positions" on the file) but working with a string you see it as one position getting a wrong token.

Environment
All versions

Steps to reproduce
Use a fixed file lenght with some fields and add in one field data a text like "aleè"

kunivan · 2021-05-27T10:59:47Z

Hi, I thought I had the same issue with the FixedLengthTokenizer, but it turned out that the encoding on the FlatFileItemReader had to be explicitly set to UTF-8 in order for the line to be read correctly from the UTF-8 encoded file I was reading. After this, the FixedLengthTokenizer worked fine for me.
So, flatFileItemReader.setEncoding("UTF-8"); did the trick for me.

fmbenhassine · 2022-05-17T20:32:47Z

@jtremiel

When working with a flat file with UTF-8 that contains extended characters the line si tokenized in a wrong way

In this case, you need to explicitly set the encoding of the reader to UTF-8 , because the default as of v4 is set to the JVM's default encoding which could be different than UTF-8 (and I guess that's what happening in your case).

The default encoding has been changed to UTF-8 in FlatFileItemReader as of v5 (see df8dac1), which should fix this issue without any code change. But as mentioned by @kunivan , you can and should set the encoding of the reader to be the same as the encoding of your input file.

jtremiel added status: waiting-for-triage Issues that we did not analyse yet type: bug labels May 18, 2020

fmbenhassine added this to the 5.0.0 milestone May 5, 2022

fmbenhassine removed the status: waiting-for-triage Issues that we did not analyse yet label May 5, 2022

fmbenhassine closed this as completed May 17, 2022

fmbenhassine added the status: superseded Issues that are superseded by other issues label May 17, 2022

fmbenhassine modified the milestones: 5.0.0, 5.0.0-M3 May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FixedLengthTokenizer wrong tokenization with utf-8 extended characters #3714

FixedLengthTokenizer wrong tokenization with utf-8 extended characters #3714

jtremiel commented May 18, 2020

kunivan commented May 27, 2021

fmbenhassine commented May 17, 2022 •

edited

FixedLengthTokenizer wrong tokenization with utf-8 extended characters #3714

FixedLengthTokenizer wrong tokenization with utf-8 extended characters #3714

Comments

jtremiel commented May 18, 2020

kunivan commented May 27, 2021

fmbenhassine commented May 17, 2022 • edited

fmbenhassine commented May 17, 2022 •

edited