Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding mismatch between writing and reading result files #1075

Open
pyZerrenner opened this issue Mar 22, 2024 · 1 comment
Open

Encoding mismatch between writing and reading result files #1075

pyZerrenner opened this issue Mar 22, 2024 · 1 comment
Labels
automation Procedures, Experiment, and other automated things bug

Comments

@pyZerrenner
Copy link
Contributor

pyZerrenner commented Mar 22, 2024

When using special characters in the naming of my procedure's DATA_COLUMNS (e.g. when writing '°C'), the result file is written correctly, but PyMeasure can fail to read it back in and no curve is displayed in the plotter window.

When writing the result files, PyMeasure uses the open function (in Results.__init__ for the header and column titles) and logging.FileHandler (in Recorder.__init__) which both default to the standard encoding of the system (locale.getencoding()). On my German Windows 10 machine, this is 'cp1252'. However, reading the data back in uses pandas' read_csv function which uses encoding='utf-8' by default. This causes an error for special characters (such as '°'), that is silently ignored by PyMeasure and an empty DataFrame is returned instead. Special characters in the header do not pose an issue, as read_csv ignores the commented lines and Results.header() uses encode("unicode_escape") to replace tricky characters (e.g. µL is written as \xb5L).

Solution 1: Avoid special characters

Of course, I can just write degC and uL (also for possible compatibility with other software). But that is not a real solution, more a workaround.

Solution 2: Set the global default encoding to UTF-8

I did not check how this can be done in the operating system itself. But in Python, the UTF-8 mode can be used to change most encoding defaults to 'utf-8', e.g. by setting the environment variable PYTHONUTF8=1. Again, this is a user-side workaround and not a fix.

Solution 3: Using the default encoding in Results

Wherever pandas is used to read the results file, add the argument encoding=locale.getencoding(). I found two occurrences of pd.read_csv in Results.reload and Results.data, but maybe there are other places? At least for me, adding the new argument in both these places fixed the issue for me. (And of course, import locale has to be added to the script.)

Solution 4: Explicitly specify the encoding

Set an encoding when creating a Results instance (either as class parameter or argument) and pass it down to any function interacting with the file. This would probably be the most robust solution, as it does not implicitly rely on all functions to use the default encoding (which, as we have seen here, already failed with pandas). But I do not have the confidence to say, in how many places this has to be implemented, so I rather not mess around with this solution.

Thanks for your support.

@BenediktBurger
Copy link
Member

Thanks for finding that bug.

I personally prefer to make it all in utf-8, as the files will be portable between operating systems.
Using the locale for reading and writing can be an issue, if you read and write on different machines.
However, backward compatibility requires the local encoding.

@BenediktBurger BenediktBurger added bug automation Procedures, Experiment, and other automated things labels Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation Procedures, Experiment, and other automated things bug
Projects
None yet
Development

No branches or pull requests

2 participants