Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct export into more formats #9

Open
danijar opened this issue Aug 5, 2019 · 12 comments
Open

Direct export into more formats #9

danijar opened this issue Aug 5, 2019 · 12 comments
Labels

Comments

@danijar
Copy link
Owner

danijar commented Aug 5, 2019

The output format is currently HTML, which can then be printed as PDF from the browser. It could be nice to directly export to PDF, ipynb and other formats.

@danijar danijar added the feature label Aug 5, 2019
@epogrebnyak
Copy link
Contributor

epogrebnyak commented Aug 6, 2019

As for PDF I think we need a little roadmap here as there are tradeoffs. I tried weasyprint, xhtml2pdf and wkhtmltopdf (or pdfkit, which is the wrapper for the same engine).

  • weasyprint and xhtml2pdf are easily installable via pip, so they can be a dependency in the python project, but they mess the output on current CSS. Some dark magic needs to be applied to HTML styling for the current html to appear well in these engines.

  • wkhtmltopdf/pdfkit needs a binary to be installed (or an apt-get call), but the pdf looks decent:

изображение

  • There are different routes with pandoc, but it is also a hard dependency.

  • There may be other options I did not try.

We can also put a little section "what you can do next with html" in README or docs and let the pdf export wait until a better choice emerges.

Export pagination may be a barrier, might need a method for Handout class for page break.

For disussion of tools I refered to manubot. These guys have a fallback procedure on generating the pdf based on tools that are available of the system (athena is there is Docker).

The relation of tools to engines is as below:

  • weasyprint - own?
  • xhtml2pdf - ReportLab
  • wkhtmltopdf/pdfkit - WebKit

This was referenced Aug 7, 2019
@danijar
Copy link
Owner Author

danijar commented Aug 9, 2019

I thought a bit about exporting into more formats. There will be some formats that HTML can easily be converted to and some that can't.

Converting from HTML won't work if the user is interested in the exported source rather than just the rendered result. This is the case for e.g. LaTeX which you may want to include in a larger LaTeX document and keep editing. Another example could be exporting to Jupyter notebooks.

For this, we need to support multiple exporters. HTML, LaTeX, and ipynb come to mind for which this would be useful. Display formats like PDF are easy to generate from any of the three. To convert to LaTeX, we'll need a Markdown to LaTeX converter to format multi-line comments.

I would like to keep all base features available for all output formats, e.g. we should robustly embed videos into LaTeX. Besides this, the will be doc.add_html() and doc.add_latex() etc, which will just be no-ops when exporting to a different format.

It will take a little bit of planning to decide how to structure the code for this. For example, we may want to make the blocks (blocks.py) independent of the output format and then have one class per output type that with methods that "visits" the blocks in the document.

class Exporter:
  __init__(directory)
  visit_comment(block)  # e.g. block.text
  visit_text(block)  # e.g. block.text
  visit_image(block)  # e.g. block.filename, block.width
  visit_video(block)  # e.g. block.filename, block.width
  save()

What do you think? Do you have other ideas how the code should be structured?

@epogrebnyak
Copy link
Contributor

epogrebnyak commented Aug 10, 2019

I agree with sequence of formats - there are 'immediate output' formats ones like html, ipynb, latex and display formats like pdf which is based on processing either html or latex.

As I do not fully understand Exporter class above yet, so let me elaborate in prose on the program structure:

  1. we have some source of the report, which consists of .py script body and calls to add_somthing() funcitons
  2. currently add_something's create html on the fly, as specified by blocks.py classes
  3. we want to support different export formats (html, latex, ipynb, maybe varieties of markdown)
  4. we possibly may want to allow 'scriptless'/'interactive' use Handout class as in Support interactive environment in PyCharm where there is no source file #25
  5. source parts may need different preprocessing depending on export format (a lot of that happens in jupyter).

Source script processing and add_x() calls should result in a list of blocks holding Message, Text, Image, Video, Code instances. These blocks would hold just the content such as text, filename, display parameters.

Then there is a render_html(), render_latex(), render_markdown() function or method that converts each block type into a new format.

Finally there is a functionality that assembles converted blocks into html, latex, ipynb or markdown document.

# Handout class is exposed to the user: 
# - the user inits a handout in a script to display script comments and code in output
# - the user adds elements as images and video to the output inside a script
# - alternatively, the user plays with instance in interactive session, 
#   just using the add_x() methods 

class Handout:
    def __init___(directory, title, interactive=False):
            pass

# blocks represent report contents units
# they hold values and display configurations
# maybe blacks can be dataclasses, to make the constructors cleaner
 
class Block:
    pass

class Message(Block):
     pass

class Text(Block): #this is for multiline comments
     pass

class Image(Block):
     pass

class Video(Block):
     pass

class Code(Block):
     pass

# something is done to produce internal representation of the document
# as a list of blocks. this is what Handout class does now, but it is tightly 
# bundled with html output

class Document:
     self.title: str
     self.blocks: [Block]

# document can be exported to different formats

def to_html(doc: Document) -> str:
    pass

def to_latex(doc: Document) -> str:
    pass

def to_notebook(doc: Document) -> str:
    pass

def to_markdown(doc: Document) -> str:
    pass

@epogrebnyak
Copy link
Contributor

epogrebnyak commented Aug 10, 2019

As a sidenote the role of markdown is still to be discussed:

  • in some workflows people might want markdown output to be embedded in larger markdown documents, eg a make part of README file
  • in some cases - exporting to swaeve/pwaeve/jupytext markdown reports may be desired (raised in Comparison to Pweave #12 for example).

We can start with simplest type of markdown.

@danijar
Copy link
Owner Author

danijar commented Aug 10, 2019

Thanks for your example. What I had in mind is the visitor pattern, which seems like a better solution to me. What do you think?

# The user API gets a new constructor argument:
class Handout:
  __init__(directory, title='Handout', format='html', source=True)
  @_blocks
  add_text(string)  # Add Text block.
  add_image(tensor, width=None, format='png')  # Save to disk and add Image block.
  add_html(string)  # Add HTML block.
  show()  # Iterate over blocks and call according exporter methods.
  _find_source()  # Find user's Python source; can be extracted out some day.

# Blocks are independent of output format:
Text = namedtuple('Text', 'string')
Image = namedtuple('Image', 'filename, width')
HTML = namedtuple('HTML', 'string')

# Exporters are visitors:
class HTMLExporter:
  __init__(directory, source, title)
  @lines
  visit_text(text)  # Add lines to self.lines
  visit_image(image)
  visit_html(html)
  save()  # Save lines to index.html.

class LaTeXExporter:
  __init__(directory, source, title)
  visit_text(text)
  visit_image(image)
  visit_html(html)  # No-op.
  save()

One question is where to specify the export type. It shouldn't be in show() since that is often called many times. It could be in in the Handout constructor, but that means you have to run multiple times to export into multiple formats. However, I think this might be fine. The constructor could accept a list of output formats if this is really a use case.

By the way, do you have a preference for how to name the exporters? I can think of exporter, output, target, backend

@epogrebnyak
Copy link
Contributor

epogrebnyak commented Aug 10, 2019

The visit_something() seem very redundant to me. For testability one apparently would need to do quite a few things in this setting just to see if the program converts a block type well from the source.

from dataclasses import dataclass

class Block:
    pass

@dataclass
class Message(Block):
  string: str

  def html(self):
     return '<pre class="message">' + self.string + '</pre>'
 
  def markdown(self):
     return self.string

  def latex(self):
     pass

assert Message('Some text').html() == '<pre class="message">Some text</pre>'

This way we keep data and block conversion fucntions closer together, much easier testing.

Later in code you can have a visitor class that assembles the full html document or a body of latex or an ipynb file.

class LaTeXExporter:
  __init__(directory, blocks, title)
  render()
  save()

@epogrebnyak
Copy link
Contributor

epogrebnyak commented Aug 10, 2019

class Handout:
  __init__(directory, title='Handout', format='html', source=True)

What is does source=True mean? Better if it were a more verbose flag.

@epogrebnyak
Copy link
Contributor

As for show() - we are considering this a fixed API interface, right? I remember there was a discussion or a change of .save() vs .show(). Once show() is fixed, format="html" is ok for constructor.

As an extra feature can add .save_html(), save_latex(), etc methods to Handout class.

@epogrebnyak
Copy link
Contributor

The constructor could accept a list of output formats if this is really a use case.

If we provide save_x() family of methods same Handout instance can be used several times if that fits the workflow when the use wants both an htmnl and a latex for example.

@epogrebnyak
Copy link
Contributor

By the way, do you have a preference for how to name the exporters? I can think of exporter, output, target, backend

'Exporter' seems quite a natural, I think it stresses we are doing one-directional conversion.
'Backend' is closer to render-only option without saving a file. output and target are too generic I think.

@danijar
Copy link
Owner Author

danijar commented Aug 24, 2019

In addition to LaTeX export, it would make sense to export to Markdown. This might also be easier for users to further convert into other formats downstream.

@epogrebnyak
Copy link
Contributor

@danijar, what is inside visit_html method? does this have an advantage over using a blocks own html() method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants