Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for "data" URIs in ImageConverter.guess_mimetypes() #10073

Closed
mgeier opened this issue Jan 9, 2022 · 7 comments · Fixed by #10099
Closed

Support for "data" URIs in ImageConverter.guess_mimetypes() #10073

mgeier opened this issue Jan 9, 2022 · 7 comments · Fixed by #10099
Labels
type:enhancement enhance or introduce a new feature
Milestone

Comments

@mgeier
Copy link
Contributor

mgeier commented Jan 9, 2022

Is your feature request related to a problem? Please describe.

This problem came up when using the sphinxcontrib.rsvgconverter extension, which is supposed to automatically convert SVG images to PDF when using the LaTeX builder, which it does with the help of the external program rsvg-convert.

When using the HTML builder, the extension is supposed to do nothing.
Previously, there was a problem with this (missinglinkelectronics/sphinxcontrib-svg2pdfconverter#8), which was fixed in #8649.

However, there is still one error case remaining:

When using data URIs instead of image files, this is still broken.
The data type is not recognized in this case and therefore it is handled as unsupported by the builder, even if the builder actually supports it.

Example reST code:

.. image:: 

Here is a test repo that shows the problem: https://github.com/mgeier/rsvgconverter-bug.

Describe the solution you'd like

The MIME type of data URIs should be checked and correctly recognized so that sphinxcontrib.rsvgconverter (and probably other extensions) knows if a supported data type is used.

Describe alternatives you've considered

One could just ignore this problem, but then the program rsvg-convert has to be installed and it is executed even if it is absolutely not necessary (e.g. for the HTML builder).

Additional context

This is where the MIME type is guessed:

def guess_mimetypes(self, node: nodes.image) -> List[str]:
if '?' in node['candidates']:
return []
elif '*' in node['candidates']:
return [guess_mimetype(node['uri'])]
else:
return node['candidates'].keys()

The value of node['candidates'] is {'?': 'data:image/png;base64,...'} in my case.
I'm not sure what the question mark means exactly.

There is already a helper function parse_data_uri(), which can probably be used to implement this:

def parse_data_uri(uri: str) -> Optional[DataURI]:
if not uri.startswith('data:'):
return None
# data:[<MIME-type>][;charset=<encoding>][;base64],<data>
mimetype = 'text/plain'
charset = 'US-ASCII'
properties, data = uri[5:].split(',', 1)
for prop in properties.split(';'):
if prop == 'base64':
pass # skip
elif prop.startswith('charset='):
charset = prop[8:]
elif prop:
mimetype = prop
image_data = base64.b64decode(data)
return DataURI(mimetype, charset, image_data)

@mgeier
Copy link
Contributor Author

mgeier commented Jan 9, 2022

I've created a possible fix: #10074.

@tk0miya
Copy link
Member

tk0miya commented Jan 10, 2022

If my understanding is correct, data URIs are extracted to temporary files and detected its mimetype by DataURIExtractor before a process of image converters.

class DataURIExtractor(BaseImageConverter):
default_priority = 150
def match(self, node: nodes.image) -> bool:
if self.app.builder.supported_remote_images == []:
return False
elif self.app.builder.supported_data_uri_images is True:
return False
else:
return node['uri'].startswith('data:')
def handle(self, node: nodes.image) -> None:
image = parse_data_uri(node['uri'])
ext = get_image_extension(image.mimetype)
if ext is None:
logger.warning(__('Unknown image format: %s...'), node['uri'][:32],
location=node)
return
ensuredir(os.path.join(self.imagedir, 'embeded'))
digest = sha1(image.data).hexdigest()
path = os.path.join(self.imagedir, 'embeded', digest + ext)
self.app.env.original_image_uri[path] = node['uri']
with open(path, 'wb') as f:
f.write(image.data)
node['candidates'].pop('?')
node['candidates'][image.mimetype] = path
node['uri'] = path
self.app.env.images.add_file(self.env.docname, path)

Therefore, no additional code is needed to process them, AFAIK.

For example, I succeeded to embed SVG image into PDF document using sphinx.ext.imgconverter without any changes.

.. image:: 

@mgeier
Copy link
Contributor Author

mgeier commented Jan 10, 2022

Thanks for looking into this @tk0miya!

The DataURIExtractor is only used when data: URIs are not supported by the builder (via supported_data_uri_images).

In my case it is not used, because the HTML builder can handle data: URIs.

I succeeded to embed SVG image into PDF document using sphinx.ext.imgconverter without any changes.

I know that it works for unsupported types (e.g. SVG) in the LaTeX builder.

I probably wasn't clear in my description, but the the problem is with supported types (in my case PNG) in the HTML builder.

In this case, the converter is supposed to do nothing, but it is called nevertheless, which is wrong.
This is causing my original problem (in case you are interested: spatialaudio/nbsphinx#559).

@tk0miya
Copy link
Member

tk0miya commented Jan 12, 2022

I think the expected behavior is that rsvgconverter will not be invoked for the embedded PNG image, right? If so, no reason to "guess mimetypes". How about skipping without guessing?

@mgeier
Copy link
Contributor Author

mgeier commented Jan 12, 2022

the expected behavior is that rsvgconverter will not be invoked for the embedded PNG image, right?

Yes.

If so, no reason to "guess mimetypes". How about skipping without guessing?

Don't we have to guess first to be able to check whether the guessed MIME type is supported or not?

elif set(self.guess_mimetypes(node)) & set(self.app.builder.supported_image_types):
# builder supports the image; no need to convert
return False

That's what I'm trying to implement in #10074.

@tk0miya
Copy link
Member

tk0miya commented Jan 13, 2022

I think all embedded images should be skipped. So it's better to skip them all:

if node["uri"].startswith("data:"):
    return False

Is any reason to obtain the mime-type from the image? What do you do if the builder does not support the image?

@mgeier
Copy link
Contributor Author

mgeier commented Jan 13, 2022

Yes, thanks, I think you are right!

Nobody would use an unsupported MIME type in a "data" URI anyway.

I've created an alternative fix based on your suggestion in #10099.

@tk0miya tk0miya added this to the 4.4.0 milestone Jan 15, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type:enhancement enhance or introduce a new feature
Projects
None yet
2 participants