Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSS XML can contain invalid characters #3268

Closed
horgh opened this issue Apr 3, 2017 · 12 comments · Fixed by #11738
Closed

RSS XML can contain invalid characters #3268

horgh opened this issue Apr 3, 2017 · 12 comments · Fixed by #11738

Comments

@horgh
Copy link

horgh commented Apr 3, 2017

I found that RSS feeds that Hugo generates can contain characters that are invalid for XML.

The XML 1.0 spec defines valid characters: https://www.w3.org/TR/2006/REC-xml-20060816/#charsets

One I encountered in the wild in a blog using Hugo is U+000b (\v, vertical tab). (It was this blog, if you're interested: https://blog.hypriot.com/)

Trying to parse such XML raises an error with Go's decoder (which is how I noticed this in the first place):

XML syntax error on line 10: illegal character code U+000B

My environment:

  • Hugo Static Site Generator v0.20-DEV linux/amd64 BuildDate: 2017-04-02T17:42:16-07:00
  • Debian Linux (testing) 64bit

Here are two small sample Go programs to help demonstrate the problem:

Create a post with an invalid character:

package main

import "fmt"

func main() {
	post := `+++
date = "2017-04-02T16:11:58+05:30"
draft = false
title = "New post"

+++

Hi there
`

	post += "\u000bsudo apt-get update\u000b"

	fmt.Println(post)
}

Use like this: $ ./create-problem-post > ~/t/bookshelf/content/post/newpost.md

Then re-generate the site: $ hugo

Then try to decode the RSS feed with this program:

package main

import (
	"encoding/xml"
	"io/ioutil"
	"log"
	"os"
)

func main() {
	buf, err := ioutil.ReadAll(os.Stdin)
	if err != nil {
		log.Fatalf("Reading from stdin: %s", err)
	}

	type TestStruct struct {
		Blah string
	}

	t := TestStruct{}

	if err := xml.Unmarshal(buf, &t); err != nil {
		log.Fatalf("Unmarshal XML: %s", err)
	}
}

Like so:

$ cat ~/t/bookshelf/public/index.xml | ./read-problem-post 
2017/04/02 21:27:43 Unmarshal XML: XML syntax error on line 22: illegal character code U+000B

Thank you!

@stale
Copy link

stale bot commented Dec 6, 2017

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

@stale stale bot added the Stale label Dec 6, 2017
@horgh
Copy link
Author

horgh commented Dec 10, 2017

The problem still exists. I just tested with the latest master branch:

$ hugo version
Hugo Static Site Generator v0.32-DEV linux/amd64 BuildDate: 2017-12-10T12:38:16-08:00

I had to change the sample program I provided that creates the problem post slightly to account for front matter changes:

$ cat create-problem-post/main.go 
package main

import "fmt"

func main() {
	post := `---
title: "New post"
date: "2017-04-02T16:11:58+05:30"
draft: false
---

Hi there
`

	post += "\u000bsudo apt-get update\u000b"

	fmt.Println(post)
}

@stale stale bot removed the Stale label Dec 10, 2017
@DarwinJS
Copy link
Contributor

DarwinJS commented Feb 9, 2018

It is also generating &ldquo and &rdquo which no browser or validator accepts as valid.

@stale
Copy link

stale bot commented Jun 9, 2018

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

@stale stale bot added the Stale label Jun 9, 2018
@horgh
Copy link
Author

horgh commented Jun 9, 2018

The problem still exists with master as of this moment:

will@snorri:~/t/hugosite$ hugo version
Hugo Static Site Generator v0.42-DEV linux/amd64
will@snorri:~/t/hugosite$ ~/go/src/github.com/horgh/hugo-rss-test/read-problem-post/read-problem-post < public/index.xml
2018/06/09 09:01:41 Unmarshal XML: XML syntax error on line 20: illegal character code U+000B

@stale stale bot removed the Stale label Jun 9, 2018
@stale
Copy link

stale bot commented Oct 7, 2018

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

@stale stale bot added the Stale label Oct 7, 2018
@horgh
Copy link
Author

horgh commented Oct 7, 2018

This is still a problem with current master:

will@snorri:~/t/hugosite$ hugo version
Hugo Static Site Generator v0.50-DEV linux/amd64 BuildDate: unknown
will@snorri:~/t/hugosite$ rm public/index.xml
will@snorri:~/t/hugosite$ hugo
[snip]
will@snorri:~/t/hugosite$ ~/go/src/github.com/horgh/hugo-rss-test/read-problem-post/read-problem-post < public/index.xml
2018/10/07 10:28:55 Unmarshal XML: XML syntax error on line 20: illegal character code U+000B

@moorereason
Copy link
Contributor

Notes from a short investigation:

I attempted to use <!CDATA[ ... ]]> around the <description> contents, but that didn't fix the issue. The illegal character within the CDATA block still violates the XML spec. (Additionally, we use html/template for the RSS feed, so the <!CDATA gets escaped if we try to use that in the RSS template, anyway.)

I then added a transform.XMLEscape template function that essentially calls xml.EscapeText. That doesn't work by itself because xml.EscapeText will bail out when it finds an illegal character.

So, it looks like we'd need to add a sanitizeXML function to strip illegal characters prior to escaping.

@horgh
Copy link
Author

horgh commented Oct 8, 2018

Thanks for looking at this! I was actually going to take a stab at fixing it too.

Your XMLEscape idea seems great! Regarding the error from xml.EscapeText: I tested with U+000B and it output the Unicode replacement character for it rather than erroring. What character did you test with that caused an error?

A sanitizeXML function seems okay too if there are indeed errors.

@moorereason
Copy link
Contributor

I looked at the xml.EscapeText result rather quickly at the end, so you may be right. I'll take another look.

horgh added a commit to horgh/hugo that referenced this issue Oct 8, 2018
This is to avoid including characters invalid for XML.

Fixes gohugoio#3268
@horgh
Copy link
Author

horgh commented Oct 8, 2018

I have a branch that uses the EscapeText template method: https://github.com/gohugoio/hugo/compare/master...horgh:horgh/rss-invalid-chars?expand=1

I had some trouble with the tests. For some reason in the tests the vertical tab disappears all together. If I build and run a test against a hugo directory it works fine though. Any ideas? Or maybe you're making a branch anyway!

Edit: And I don't understand that Travis failure!

pwindle pushed a commit to pwindle/hugo that referenced this issue Aug 31, 2019
This is to avoid including characters invalid for XML.

Fixes gohugoio#3268
pwindle pushed a commit to pwindle/hugo that referenced this issue Aug 31, 2019
This is to avoid including characters invalid for XML.

Fixes gohugoio#3268
jmooring added a commit to jmooring/hugo that referenced this issue Nov 25, 2023
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants