Issue with whitespace #6

freshyill · 2017-06-18T19:26:49Z

I'm having an issue with whitespace and I'm wondering if Camaro is handling it as-designed, or if I should look to another package to help with this.

Given this chunk of XML (truncated, but you get the idea)

<body>
 … 
he conducted research in immunology and rheumatology.</p>
</sec>
</sec>
<sec disp-level="1">
<title>Eye on 45</title>
<sec disp-level="2">
<title>Protests take shape</title>
<p>As U.S. President …

Using this to construct my template…

body: "article/body",

I get this result…

he conducted research in immunology and rheumatology.Eye on 45Protests take shapeAs U.S. President

I do want to take the entire text of the body as just text, without any tags preserved. Should I expect to see a space character between where tags were stripped, or should it be concatenated like this?

The text was updated successfully, but these errors were encountered:

tuananh · 2017-06-19T01:44:54Z

Since the html data in your example looks like valid xml, it get parsed as well. So when you query article/body, instead of getting a node with string content inside, you get a node with child node inside. get string value of that node will strip down all the tags inside it.

The proper way of putting data like this in XML is wrapping it inside CDATA like this

const transform = require('camaro')

const xml = `
<xml>
    <html>
        <![CDATA[
        <body>
            <p>
                ...he conducted research in immunology and rheumatology
            </p>
            <sec disp-level="1" />
            <title>Eye on 45</title>
            <sec disp-level="2" />
            <title>Protests take shape</title>
        </body>
        ]]>
    </html>
</xml>
`
const result = transform(xml, {
    html: 'xml/html'
})

console.log(JSON.stringify(result, null, 2))

freshyill · 2017-06-20T03:51:08Z

The XML I'm working with is as proper as it's going to get. This example uses JATS, which is a highly structured and quite strict DTD used in scholarly publishing.

It's possible my example wasn't entirely clear. I do want to strip all tags. In this case, I'm only interested in the text.

he conducted research in immunology and rheumatology.Eye on 45Protests take shapeAs U.S. President
                                                    ^         ^                  ^

I've marked where removed tags resulted in text being concatenated. Would you consider having a single space character be placed between removed tags instead of concatenating the text, maybe as an option?

tuananh · 2017-06-20T04:34:13Z

I see. You only want to place space char in place of those remove tags. For now, it's not possible because I don't check whether the path is a leaf node or contain child nodes inside.

freshyill · 2017-06-20T15:55:19Z

OK, thank you for considering!

tuananh added the enhancement label Jun 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with whitespace #6

Issue with whitespace #6

freshyill commented Jun 18, 2017 •

edited

tuananh commented Jun 19, 2017 •

edited

freshyill commented Jun 20, 2017

tuananh commented Jun 20, 2017

freshyill commented Jun 20, 2017

Issue with whitespace #6

Issue with whitespace #6

Comments

freshyill commented Jun 18, 2017 • edited

tuananh commented Jun 19, 2017 • edited

freshyill commented Jun 20, 2017

tuananh commented Jun 20, 2017

freshyill commented Jun 20, 2017

freshyill commented Jun 18, 2017 •

edited

tuananh commented Jun 19, 2017 •

edited