Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FixedLengthTokenizer wrong tokenization with utf-8 extended characters #3714

Closed
jtremiel opened this issue May 18, 2020 · 2 comments
Closed
Labels
status: superseded Issues that are superseded by other issues type: bug
Milestone

Comments

@jtremiel
Copy link

Bug description
When working with a flat file with UTF-8 that contains extended characters the line si tokenized in a wrong way because string.substring is used instead of working with byte arrays.This happens because say an "è" character is made up of two bytes(so two "positions" on the file) but working with a string you see it as one position getting a wrong token.

Environment
All versions

Steps to reproduce
Use a fixed file lenght with some fields and add in one field data a text like "aleè"

@jtremiel jtremiel added status: waiting-for-triage Issues that we did not analyse yet type: bug labels May 18, 2020
@kunivan
Copy link

kunivan commented May 27, 2021

Hi, I thought I had the same issue with the FixedLengthTokenizer, but it turned out that the encoding on the FlatFileItemReader had to be explicitly set to UTF-8 in order for the line to be read correctly from the UTF-8 encoded file I was reading. After this, the FixedLengthTokenizer worked fine for me.
So, flatFileItemReader.setEncoding("UTF-8"); did the trick for me.

@fmbenhassine fmbenhassine added this to the 5.0.0 milestone May 5, 2022
@fmbenhassine fmbenhassine removed the status: waiting-for-triage Issues that we did not analyse yet label May 5, 2022
@fmbenhassine
Copy link
Contributor

fmbenhassine commented May 17, 2022

@jtremiel

When working with a flat file with UTF-8 that contains extended characters the line si tokenized in a wrong way

In this case, you need to explicitly set the encoding of the reader to UTF-8 , because the default as of v4 is set to the JVM's default encoding which could be different than UTF-8 (and I guess that's what happening in your case).

The default encoding has been changed to UTF-8 in FlatFileItemReader as of v5 (see df8dac1), which should fix this issue without any code change. But as mentioned by @kunivan , you can and should set the encoding of the reader to be the same as the encoding of your input file.

@fmbenhassine fmbenhassine added the status: superseded Issues that are superseded by other issues label May 17, 2022
@fmbenhassine fmbenhassine modified the milestones: 5.0.0, 5.0.0-M3 May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: superseded Issues that are superseded by other issues type: bug
Projects
None yet
Development

No branches or pull requests

3 participants