Support compound file extensions. Fix #2780 #2816

holden-nelson · 2020-10-29T18:35:13Z

Before this commit, the file extension matcher could not match compound
file extensions; e.g., '.md.html' would match .html

Pull Request Checklist

Resolves: #2780

Ensured tests pass and (if applicable) updated functional test output
Conformed to code style guidelines by running appropriate linting tools
Added tests for changed code
Updated documentation for changed code

Before this commit, the file extension matcher could not match compound file extensions; e.g., '.md.html' would match `.html`

avaris · 2020-10-29T19:17:09Z

Some problems:

Pretty sure this will fail to include some.file.with.dots.rst
You don't actually handle reader selection logic (inside readers.py:Readers.read_file). foo.md.html would already be selected with previous code, if a reader that handles .html was enabled. The issue is that it was triggering HTMLReader instead of a reader specific to md.html. And same still happens.

holden-nelson · 2020-10-30T17:24:33Z

I see.

I'll admit, and this is probably obvious, but I'm kind of new to open source submissions and all that, so I probably jumped the gun by making this change without a better holistic view of how Pelican works. But I'm very determined to contribute, so I will spend more time getting acquainted, maybe write a plugin, and then revisit this topic.

Thanks!

avaris · 2020-10-30T19:06:51Z

That's all right and we appreciate the contribution :). If you need any help along the way, feel free to ask.

justinmayer · 2021-02-18T14:49:15Z

Hi @holden-nelson. Just wanted to check in regarding the status of this PR. Do you intend to keep working on it?

holden-nelson · 2021-02-19T19:06:50Z

Hello @justinmayer. I had kind of put this on the backburner as other projects and real life happened, but ultimately I would like to do this. I'm finally wrapping up some of that other stuff, but I wouldn't expect to see any progress on this, on my end, for another couple of months. Let me know how that works on your end, I understand if you'd like to close it and move on.

Thanks!

holden-nelson · 2021-12-23T22:43:29Z

Hello @justinmayer and @avaris. I'm ready to make this contribution, but I need to know how to handle the situation regarding a filename with multiple dots, like some.file.with.dots.rst.

Right now, extensions are stripped out of file names with os.path.splitext(basename)[1][1:] which will regard everything after the last dot, if there is a dot, as the extension. My thought was to use Pathlib.Path.suffixes() to grab everything that appears after the first dot and join it together to get an extension, like in my first pull request. But then a file with dots in its name would have the wrong extension and wouldn't be properly read.

I imagine it would come down to either

Closing the issue and not allowing compound file extensions.
Having a setting ALLOW_COMPOUND_FILE_EXTENSIONS=False with a big warning that if you set it equal to True you can't have file names with dots anywhere in your article path or it will break. And then in the code we can account for both scenarios.

Or is there some other way to allow arbitrary compound file extensions and dots in file names? What are your thoughts on this?

avaris · 2021-12-28T10:51:31Z

File with dots can be handled if we use a_string.endswith(extensions). The real challenge is selecting the appropriate reader. e.g. foo.md.html should trigger a reader with .md.html extension rather than .html one. I'm not sure how to handle that nicely (yet).

holden-nelson · 2022-01-02T05:35:14Z

Forgive me if this comment shows ignorance of the Pelican software.

I see how a_string.endswith() could be used to check for a particular file extension, but I don't see how it's useful for an arbitrary compound one if we still allow dots in file names. I guess what I'm saying is, the way I see it,

if we're only trying to add support for md.html files specifically, we could add a check in Generators.generate_context() for f.endswith('.md.html') . If True, we call Readers.read_file() on it with the fmt=md.html parameter. Otherwise we call it without the fmt parameter. We'd need a quick check in Generators._include_path() well. I don't know that this is the most clean way, especially if down the road we wanted to add support for more compound extensions, but conceptually something like this.
If we want to add support for arbitrary compound extensions and dropped support for filenames with dots we should be able to just switch to Pathlib.Path.suffixes() and be good.
I don't understand how it's possible to allow both arbitrary compound extensions and dots in filenames. Like beyond Python specifically I don't understand how it's logically possible.

avaris · 2022-01-02T16:09:27Z

If we are going to support this, it should be generic and not pinned down to a specific extension. Also I don't see the particular issue with filenames with dots. It's a pattern matching problem. We are trying to match the longest possible extension that matches available extensions. So a file like file.with.dots.md.html can be considered as one of the following possibilities:

file.with.dots.md + .html
file.with.dots + .md.html
file.with + .dots.md.html
file + .with.dots.md.html

Now the problem is which of these combinations is compatible with available readers. And the challenging part would be selecting the longest matching extension from available readers. One possible (and simple) option is "ranking the extensions and their readers" (say, based on the number of dots) and going them one by one starting from the highest until a match is found and that would be the preferred reader. But this would generally be slower than current approach, which is extract the last extension and check if it is in a dict...

Another approach is building a "tree of readers" based on the extensions and each level of compound extension makes it one more level deep. Something like

readers = {
    '.html': {
        '': HTMLReader,
        '.md': MDHTMLReader,
    },
    # ...
}

So, once .html is matched, it looks for the next extension and so on until it can't match anymore. You can think of it like working backwards from pathlib.Path.suffixes. This is quite a bit more involved but would perform a lot better.

holden-nelson · 2022-01-03T22:03:43Z

Ahh. Okay, now I see how that would work. Thanks for taking the time to explain.

I'll continue working on this feature and submit a PR soon.

Created a new class ReaderTree that is an infinitely nested defaultdict containing components of the extension. See comments on PR getpelican#2816.

Support compound file extensions. Fix #2780

d44d27f

Before this commit, the file extension matcher could not match compound file extensions; e.g., '.md.html' would match `.html`

justinmayer added the needs more work label Oct 5, 2021

holden-nelson mentioned this pull request Jan 7, 2022

Add support for compound file extensions. #2969

Open

4 tasks

justinmayer added awaiting review and removed needs more work labels Jun 24, 2023

justinmayer requested a review from avaris June 24, 2023 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support compound file extensions. Fix #2780 #2816

Support compound file extensions. Fix #2780 #2816

holden-nelson commented Oct 29, 2020 •

edited

avaris commented Oct 29, 2020 •

edited

holden-nelson commented Oct 30, 2020

avaris commented Oct 30, 2020

justinmayer commented Feb 18, 2021

holden-nelson commented Feb 19, 2021

holden-nelson commented Dec 23, 2021

avaris commented Dec 28, 2021

holden-nelson commented Jan 2, 2022

avaris commented Jan 2, 2022

holden-nelson commented Jan 3, 2022

Support compound file extensions. Fix #2780 #2816

Are you sure you want to change the base?

Support compound file extensions. Fix #2780 #2816

Conversation

holden-nelson commented Oct 29, 2020 • edited

Pull Request Checklist

avaris commented Oct 29, 2020 • edited

holden-nelson commented Oct 30, 2020

avaris commented Oct 30, 2020

justinmayer commented Feb 18, 2021

holden-nelson commented Feb 19, 2021

holden-nelson commented Dec 23, 2021

avaris commented Dec 28, 2021

holden-nelson commented Jan 2, 2022

avaris commented Jan 2, 2022

holden-nelson commented Jan 3, 2022

holden-nelson commented Oct 29, 2020 •

edited

avaris commented Oct 29, 2020 •

edited