Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with whitespace #6

Open
freshyill opened this issue Jun 18, 2017 · 4 comments
Open

Issue with whitespace #6

freshyill opened this issue Jun 18, 2017 · 4 comments

Comments

@freshyill
Copy link

freshyill commented Jun 18, 2017

I'm having an issue with whitespace and I'm wondering if Camaro is handling it as-designed, or if I should look to another package to help with this.

Given this chunk of XML (truncated, but you get the idea)

<body>
 … 
he conducted research in immunology and rheumatology.</p>
</sec>
</sec>
<sec disp-level="1">
<title>Eye on 45</title>
<sec disp-level="2">
<title>Protests take shape</title>
<p>As U.S. President …

Using this to construct my template…

body: "article/body",

I get this result…

he conducted research in immunology and rheumatology.Eye on 45Protests take shapeAs U.S. President 

I do want to take the entire text of the body as just text, without any tags preserved. Should I expect to see a space character between where tags were stripped, or should it be concatenated like this?

@tuananh
Copy link
Owner

tuananh commented Jun 19, 2017

Since the html data in your example looks like valid xml, it get parsed as well. So when you query article/body, instead of getting a node with string content inside, you get a node with child node inside. get string value of that node will strip down all the tags inside it.

The proper way of putting data like this in XML is wrapping it inside CDATA like this

const transform = require('camaro')

const xml = `
<xml>
    <html>
        <![CDATA[
        <body>
            <p>
                ...he conducted research in immunology and rheumatology
            </p>
            <sec disp-level="1" />
            <title>Eye on 45</title>
            <sec disp-level="2" />
            <title>Protests take shape</title>
        </body>
        ]]>
    </html>
</xml>
`
const result = transform(xml, {
    html: 'xml/html'
})

console.log(JSON.stringify(result, null, 2))

@freshyill
Copy link
Author

The XML I'm working with is as proper as it's going to get. This example uses JATS, which is a highly structured and quite strict DTD used in scholarly publishing.

It's possible my example wasn't entirely clear. I do want to strip all tags. In this case, I'm only interested in the text.

he conducted research in immunology and rheumatology.Eye on 45Protests take shapeAs U.S. President
                                                    ^         ^                  ^

I've marked where removed tags resulted in text being concatenated. Would you consider having a single space character be placed between removed tags instead of concatenating the text, maybe as an option?

@tuananh
Copy link
Owner

tuananh commented Jun 20, 2017

I see. You only want to place space char in place of those remove tags. For now, it's not possible because I don't check whether the path is a leaf node or contain child nodes inside.

@freshyill
Copy link
Author

OK, thank you for considering!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants