Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide parity for both escC and unescC to support more awkward filenames #220

Open
Earnestly opened this issue Aug 28, 2023 · 4 comments
Open

Comments

@Earnestly
Copy link

Earnestly commented Aug 28, 2023

I have been experimenting with awkward filenames, using -@ and #[CSTR] to support them. When my filename contains \b and \f for example, I and attempt to store this under the xmp-dc:source tag while creating a MIE file, it converts the backspace and formfeed into . dots making it impossible to recover the original filename.

Looking at the codebase I notice that there is support for a much wider range of typical C-style escapes when unescaping them, but few for escaping:

# lookup for C-style escape sequences
my %escC = ( "\n" => '\n', "\r" => '\r', "\t" => '\t', '\\' => '\\\\');
my %unescC = ( a => "\a", b => "\b", f => "\f", n => "\n", r => "\r",
               t => "\t", 0 => "\0", '\\' => '\\' );

Is there a reason why escC couldn't be at parity with unescC (or potentially one derived from the other)?

I.e.

# lookup for C-style escape sequences
my %unescC = ( a => "\a", b => "\b", f => "\f", n => "\n", r => "\r",
               t => "\t", 0 => "\0", '\\' => '\\' );
my %escC = reverse %unescC

However just adding more mappings to escC or editting unescapeChar doesn't seem to be enough as my local tests show the same result with unrecognised escapes being turned into ..

@StarGeekSpaceNerd
Copy link
Collaborator

Try the -b (-binary) option. From the docs

This option is mainly used for extracting embedded images or other binary data, but it may also be useful for some text strings since control characters (such as newlines) are not replaced by '.' as they are in the default output.

Anything else will require Phil's attention, but he is currently away until mid-September.

@Earnestly
Copy link
Author

Earnestly commented Aug 28, 2023

Seems like -b doesn't help with this. A simple demonstration (using bash for $'dollar quote' feature):

$ printf content\\n > $'\a\f\n\b\r\t\e\v\\\"'

$ exiftool -tagsfromfile $'\a\f\n\b\r\t\e\v\\\"' -o test.mie -xmp-dc:source'<${filename}'
    1 image files created

$ exiftool -p '${source}' test.mie | od -An -tc
   .   .  \n   .  \r  \t   .   .   \   "  \n

The expected result would have been more like:

   \a  \f  \n  \b  \r  \t 033  \v   \   "  \n

The \033 (or \e) escape was thrown in to consider how it might approach arbitrary bytes, as a filename can contain any except NUL and /.

PS: I'll look into applying https://exiftool.org/faq.html#Q21 to see if that helps.

@boardhead
Copy link
Contributor

This topic must be handled carefully because code injection from malicious file names is a real possibility. At the moment, ExifTool doesn't do more than necessary because this is the safest way to proceed. I would have to dedicate a good block of time to expanding this to cover all possible characters/escapes, and without a real-life use case I don't know if this would be a worthwhile way to spend my time. Your tests seem to be theoretical -- have you seen file names like this in the wild?

@Earnestly
Copy link
Author

Earnestly commented Sep 19, 2023

Rarely, but I do try to write software that handles the datatypes as they are (at least on unix filenames are defined as any sequence of bytes except / and NUL). Currently I include a check which excludes filenames containing most of these problem characters as a compromise.

For some prior art, imagemagick also interprets filenames but provides the -define filename:literal=true option to disable the feature. Exiftool differs here in that it doesn't seem to store the bytes literally?

(I don't really mind if it can't print them nicely using C-escape encode but I would want the bytes that go in to be the same as the bytes out even if that has to go through an encode/decode layer, e.g. to and from xml entities. As much as possible anyway.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants