Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwanted unescaping of http url strings #355

Open
NicoHood opened this issue Jun 12, 2022 · 7 comments · May be fixed by #356
Open

Unwanted unescaping of http url strings #355

NicoHood opened this issue Jun 12, 2022 · 7 comments · May be fixed by #356

Comments

@NicoHood
Copy link
Contributor

Hi!
I have an url in my ical event description that is already html encoded. Here is an example:

https://www.facebook.com/events/756119502186737/?acontext=%7B%22source%22%3A5%2C%22action_history%22%3A[%7B%22surface%22%3A%22page%22%2C%22mechanism%22%3A%22main_list%22%2C%22extra_data%22%3A%22%5C%22[]%5C%22%22%7D]%2C%22has_source%22%3Atrue%7D

This is how it looks in the ical file:

DESCRIPTION:https://www.facebook.com/events/7561195021867
 37/?acontext=%7B%22source%22%3A5%2C%22action_history%22%3A[%7B%22surface%22
 %3A%22page%22%2C%22mechanism%22%3A%22main_list%22%2C%22extra_data%22%3A%22%
 5C%22[]%5C%22%22%7D]%2C%22has_source%22%3Atrue%7D

But some characters now get unescaped for some unknown reason:

https://www.facebook.com/events/756119502186737/?acontext=%7B%22source%22:5,%22action_history%22:[%7B%22surface%22:%22page%22,%22mechanism%22:%22main_list%22,%22extra_data%22:%22\%22[]\%22%22%7D],%22has_source%22:true%7D

It turns out, that this code causes the issue:
https://github.com/collective/icalendar/blob/master/src/icalendar/parser.py#L273

It converts %3A to : which in my case is NOT wanted. The url is broken then.

Why was this html unescape introduced and how can we fix that?

@NicoHood
Copy link
Contributor Author

Wouldn't it make more sense to escape them with the character function, with backslashes instead of html codes?
https://github.com/collective/icalendar/blob/master/src/icalendar/parser.py#L25

@NicoHood
Copy link
Contributor Author

NicoHood commented Jun 12, 2022

I see what the issue is: escape_string is used as a workaround to hide \: characters and only search for "real" : splitting characters. However, what nobody thought of was, that this will modify the final output, if there already were escaped characters. This solution is quite of wonky and error prone and should be fixed. The splitting should be parsed more smart, and not by escaping/unescaping which will always lead to errors, if implemented like that. Also it makes it very hard to understand, as the html escape characters are absolutely missplaced and have nothing to do with the code itself.

@NicoHood
Copy link
Contributor Author

NicoHood commented Jun 12, 2022

This is my tested recoding of this function. It completely removed the escaping functions, as they are not required anymore.

What I am not sure about is, if inside quotes, backslashes should also quote the quote itself. Example:
"this text is quote and it even contains a \" quote mark"

I can implement that, but it was not implemented before and I am not sure if the spec even allows that. This document states, that is is not implemented very often, so we could just stick to that?
https://tools.ietf.org/id/draft-daboo-ical-vcard-parameter-encoding-02.html#rfc.appendix.Appendix%20A

def parts(self):
    """
    Split the content line up into (name, parameters, values) parts.
    
    Example with parameter:
    DESCRIPTION;ALTREP="cid:part1.0001@example.org":The Fall'98 Wild

    Example without parameters:
    DESCRIPTION:The Fall'98 Wild
    
    https://icalendar.org/iCalendar-RFC-5545/3-2-property-parameters.html
    """
    try:
        st = self
        name_split = None
        value_split = None
        in_quotes = False
        # Any character can be escaped using a backslash, e.g.: "test\:test"
        quote_character = False
        for i, ch in enumerate(st):
            # We can also quote using quotation marks. This ignores any output, until another quote appears.
            if ch == '"':
                in_quotes = not in_quotes
                continue
                
            # Ignore input, as we are currently in quotation mark quotes
            if in_quotes:
                continue
            
            # Skip quoted character
            if quote_character:
                quote_character = False
                continue

            # The next character should be ignored
            if ch == '\\':
                quote_character = True
                continue

            # The name ends either after the parameter or value delimiter
            if ch in ':;' and not name_split:
                name_split = i

            # The value starts after the value delimiter
            if ch == ':' and not value_split:
                value_split = i

        # Get name
        name = st[:name_split]
        if not name:
            raise ValueError('Key name is required')
        validate_token(name)

        # Check if parameters are empty
        if not name_split or name_split + 1 == value_split:
            raise ValueError('Invalid content line')

        # Get parameters (text between ; and :)
        params = Parameters.from_ical(st[name_split + 1: value_split],
                                      strict=self.strict)

        # Get the value after the :
        values = st[value_split + 1:]
        return (name, params, values)
    except ValueError as exc:
        raise ValueError(
            "Content line could not be parsed into parts: '%s': %s"
            % (self, exc)
        )

@NicoHood
Copy link
Contributor Author

Here is a VEVENT to test with:

BEGIN:VEVENT
DTSTART:20220305T200000Z
DTSTAMP:20220612T093000Z
UID:6co62d1l6cs3eb9lcgp3cb9k6ssm6b9ochim8b9g71hjedb4c8pj6p9pc4@google.com
CREATED:20220223T074954Z
DESCRIPTION:<html-blob>Feier Deine Jugend! <a href="https://www.facebook.co
 m/events/1213722619037860?acontext=%7B%22event_action_history%22%3A[%7B%22s
 urface%22%3A%22page%22%7D]%7D">https://www.facebook.com/events/121372261903
 7860?acontext=%7B%22event_action_history%22%3A[%7B%22surface%22%3A%22page%2
 2%7D]%7D</a></html-blob>
LAST-MODIFIED:20220225T121837Z
LOCATION:Removed
SEQUENCE:1
STATUS:CONFIRMED
SUMMARY:BRAVO HITS Party
TRANSP:OPAQUE
END:VEVENT

@NicoHood
Copy link
Contributor Author

And this is my code as monket patch, so people coming here via google can hotfix their library right now:

# Monkey patch icalendar bug
# https://medium.com/@chipiga86/python-monkey-patching-like-a-boss-87d7ddb8098e
# https://github.com/collective/icalendar/issues/355
from icalendar.parser import validate_token
from icalendar.parser import Parameters

def parts_patched(self):
    """
    Split the content line up into (name, parameters, values) parts.

    Example with parameter:
    DESCRIPTION;ALTREP="cid:part1.0001@example.org":The Fall'98 Wild

    Example without parameters:
    DESCRIPTION:The Fall'98 Wild

    https://icalendar.org/iCalendar-RFC-5545/3-2-property-parameters.html
    """
    try:
        st = self
        name_split = None
        value_split = None
        in_quotes = False
        # Any character can be escaped using a backslash, e.g.: "test\:test"
        quote_character = False
        for i, ch in enumerate(st):
            # We can also quote using quotation marks. This ignores any output, until another quote appears.
            if ch == '"':
                in_quotes = not in_quotes
                continue

            # Ignore input, as we are currently in quotation mark quotes
            if in_quotes:
                continue

            # Skip quoted character
            if quote_character:
                quote_character = False
                continue

            # The next character should be ignored
            if ch == '\\':
                quote_character = True
                continue

            # The name ends either after the parameter or value delimiter
            if ch in ':;' and not name_split:
                name_split = i

            # The value starts after the value delimiter
            if ch == ':' and not value_split:
                value_split = i

        # Get name
        name = st[:name_split]
        if not name:
            raise ValueError('Key name is required')
        validate_token(name)

        # Check if parameters are empty
        if not name_split or name_split + 1 == value_split:
            raise ValueError('Invalid content line')

        # Get parameters (text between ; and :)
        params = Parameters.from_ical(st[name_split + 1: value_split],
                                      strict=self.strict)

        # Get the value after the :
        values = st[value_split + 1:]
        return (name, params, values)
    except ValueError as exc:
        raise ValueError(
            "Content line could not be parsed into parts: '%s': %s"
            % (self, exc)
        )

from icalendar import parser
parser.Contentline.parts = parts_patched
# End of monkey patch

@niccokunzmann
Copy link
Member

@NicoHood Would it be ok for you to create a pull request for this? It could be just the code. I am not a contributor to this project but willing to look at it.

NicoHood added a commit to NicoHood/icalendar that referenced this issue Jun 15, 2022
@NicoHood NicoHood linked a pull request Jun 15, 2022 that will close this issue
@NicoHood
Copy link
Contributor Author

Sure. #356

@niccokunzmann niccokunzmann changed the title Unwanted unescaping if http url strings Unwanted unescaping of http url strings Aug 21, 2022
@jacadzaca jacadzaca linked a pull request Sep 5, 2022 that will close this issue
niccokunzmann added a commit that referenced this issue Oct 3, 2022
niccokunzmann pushed a commit that referenced this issue Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants