Encoding mismatch between writing and reading result files #1075

pyZerrenner · 2024-03-22T18:59:58Z

When using special characters in the naming of my procedure's DATA_COLUMNS (e.g. when writing '°C'), the result file is written correctly, but PyMeasure can fail to read it back in and no curve is displayed in the plotter window.

When writing the result files, PyMeasure uses the open function (in Results.__init__ for the header and column titles) and logging.FileHandler (in Recorder.__init__) which both default to the standard encoding of the system (locale.getencoding()). On my German Windows 10 machine, this is 'cp1252'. However, reading the data back in uses pandas' read_csv function which uses encoding='utf-8' by default. This causes an error for special characters (such as '°'), that is silently ignored by PyMeasure and an empty DataFrame is returned instead. Special characters in the header do not pose an issue, as read_csv ignores the commented lines and Results.header() uses encode("unicode_escape") to replace tricky characters (e.g. µL is written as \xb5L).

Solution 1: Avoid special characters

Of course, I can just write degC and uL (also for possible compatibility with other software). But that is not a real solution, more a workaround.

Solution 2: Set the global default encoding to UTF-8

I did not check how this can be done in the operating system itself. But in Python, the UTF-8 mode can be used to change most encoding defaults to 'utf-8', e.g. by setting the environment variable PYTHONUTF8=1. Again, this is a user-side workaround and not a fix.

Solution 3: Using the default encoding in `Results`

Wherever pandas is used to read the results file, add the argument encoding=locale.getencoding(). I found two occurrences of pd.read_csv in Results.reload and Results.data, but maybe there are other places? At least for me, adding the new argument in both these places fixed the issue for me. (And of course, import locale has to be added to the script.)

Solution 4: Explicitly specify the encoding

Set an encoding when creating a Results instance (either as class parameter or argument) and pass it down to any function interacting with the file. This would probably be the most robust solution, as it does not implicitly rely on all functions to use the default encoding (which, as we have seen here, already failed with pandas). But I do not have the confidence to say, in how many places this has to be implemented, so I rather not mess around with this solution.

Thanks for your support.

The text was updated successfully, but these errors were encountered:

BenediktBurger · 2024-03-23T08:09:29Z

Thanks for finding that bug.

I personally prefer to make it all in utf-8, as the files will be portable between operating systems.
Using the locale for reading and writing can be an issue, if you read and write on different machines.
However, backward compatibility requires the local encoding.

BenediktBurger added bug automation Procedures, Experiment, and other automated things labels Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding mismatch between writing and reading result files #1075

Encoding mismatch between writing and reading result files #1075

pyZerrenner commented Mar 22, 2024 •

edited

BenediktBurger commented Mar 23, 2024

Encoding mismatch between writing and reading result files #1075

Encoding mismatch between writing and reading result files #1075

Comments

pyZerrenner commented Mar 22, 2024 • edited

Solution 1: Avoid special characters

Solution 2: Set the global default encoding to UTF-8

Solution 3: Using the default encoding in Results

Solution 4: Explicitly specify the encoding

BenediktBurger commented Mar 23, 2024

pyZerrenner commented Mar 22, 2024 •

edited

Solution 3: Using the default encoding in `Results`