Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inner text of node? #101

Open
no-realm opened this issue Apr 15, 2017 · 19 comments
Open

Inner text of node? #101

no-realm opened this issue Apr 15, 2017 · 19 comments

Comments

@no-realm
Copy link

no-realm commented Apr 15, 2017

Hi,
I am trying to get the inner text of an node.

<a href="http://example-com">Link Name</a>

I tried different means to get the 'Link Name' part, but I always get NULL back.

myhtml_node_text(); // Returns NULL
myhtml_node_string(); // Returns an object with length == 0
myhtml_token_node_text(); // Returns NULL
myhtml_token_node_string(); // Returns an object with length == 0
@no-realm
Copy link
Author

no-realm commented Apr 15, 2017

Ah, never mind.
I had to first get the child node and then get the text with myhtml_node_text().
I am basing my program on some C# code which is why I thought that the node with the tag contained the link name.

But myhtml works a bit different I guess 😄
A C++ wrapper would be nice... just saying.

@lexborisov
Copy link
Owner

lexborisov commented Apr 15, 2017

@Randshot
Yea,

<a href="http://example-com">Link Name</a>

created tree

<a href="http://example-com">
    -text: Link Name

for get text from <a> node use myhtml_node_child and myhtml_node_text
or use collection

myhtml_collection_t *nodes = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_A, NULL);
myhtml_node_text( myhtml_node_child(nodes->list[0]) );

or see serialization functions == innerText in JS

myhtml_serialization_tree_callback(a_node->child, callback, NULL);
// or buffer
mycore_string_raw_t str = {0};
myhtml_serialization_tree_buffer(a_node->child, &str);

see example

or get all the text nodes at once

myhtml_collection_t *nodest= myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG__TEXT, NULL);
myhtml_node_text( nodes->list[0] );

Use Modest for search a nodes by CSS Selectors, see example it's much easier than fingering a tree.

P.S.: Yes, wrapper C ++ is needed, who would do ?!

@no-realm
Copy link
Author

no-realm commented Apr 15, 2017

I have started working on one.
My C++ skills aren't the best but it should be sufficient in most cases.
For more intense usage, the C-API should used.

@lexborisov
Copy link
Owner

Thanks!
After done you send me link for your wrapper?

@no-realm
Copy link
Author

no-realm commented Apr 15, 2017

@lexborisov
Yeah sure.
I plan to implement it as a single header wrapper which has various classes for myhtml.
I am still unsure about some design aspects though.

For example, I have a Node class which contains a protected pointer to the myhtml node struct and various methods for reading and modifying the node.
Should I read all node properties when the Node object is initialized or only get the property on demand by using the provided methods (myhtml_node_text)?.

@lexborisov
Copy link
Owner

@Randshot
You do not need to store data in class. They may become obsolete, this can later cause confusion.
I think it should look like this, for example:

node->next();
/* class node... */
next() {
node->next; /* get from C structure or  myhtml_node_next(node)*/
}

@hbakhtiyor
Copy link

@Randshot any updates of your wrapper?

@no-realm
Copy link
Author

no-realm commented May 3, 2017

@hbakhtiyor I haven't had any time for it lately. I will update you when I have some progress.

@fariouche
Copy link

Hi,
I have a similar issue, I cannot extract text from a <script> tag.
The page I'm testing is google.com.
I'm doing a get_child_node() on the <script> node, and it returns NULL... (works fine with a <title> node...)
Did I missed something?

@lexborisov
Copy link
Owner

Hi,
You can show me HTML pages (html code)?

@fariouche
Copy link

dump.log
This is the google page I've got, exactly what I've pushed to myhtml_parse.
myhtml_parse(pCtx->tree, MyENCODING_UTF_8, (char*)html_buffer, html_buffer_size);
No error returned.
Thanks

@lexborisov
Copy link
Owner

lexborisov commented Jan 12, 2018

Work fine.
Code:

    myhtml_parse(tree, MyENCODING_UTF_8, res.html, res.size);
    myhtml_collection_t *collection = myhtml_get_nodes_by_tag_id(tree, NULL, MyHTML_TAG_SCRIPT, NULL);
    
    for (size_t i = 0; i < collection->length; i++) {
        mycore_string_raw_t str = {0};
        if(collection->list[i]->child == NULL) {
            printf("Oh, God! This not work, I can't believe this is not working\n");
            exit(1);
        }
        
        myhtml_serialization_tree_buffer(collection->list[i]->child, &str);
        
        printf("%s\n", str.data);
        
        mycore_string_raw_destroy(&str, false);
    }

@lexborisov
Copy link
Owner

and, we have no get_child_node() function, we have myhtml_node_child() function

@fariouche
Copy link

Thanks...

Yes, myhtml_node_child(), not get_child_node() (typo)
strange... I'm not using collection. And tokenizer_colorize_high_level() seems to work.
I Just do the following:
myhtml_parse()
node = myhtml_node_child()
Verify that tag is TAG_HTML.
node = myhtml_node_child(node)
Verify that TAG is TAG_HEAD
node = myhtml_node_child(node)
while(node)
parse_node(node)
node = myhtml_node_next(node)

At some time, my parse_node() function will parse TAG_SCRIPT, and this is where I'm doing the myhtml_node_child(node) -> NULL.

@fariouche
Copy link

This is maybe linked to
myhtml_tree_parse_flags_set(tree,
MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN|
MyHTML_TREE_PARSE_FLAGS_WITHOUT_DOCTYPE_IN_TREE);

I just tried parse_without_whitespace example, and I see that <script> is empty

@fariouche
Copy link

I confirm that this is because of MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN.

Is a script a whitespace?

@lexborisov
Copy link
Owner

I think there's a bug with MyHTML_TREE_PARSE_FLAGS_SKIP_WHITESPACE_TOKEN flag

@donglu
Copy link

donglu commented Apr 4, 2018

myhtml_collection_t *text=myhtml_get_nodes_by_tag_id_in_scope(tree,NULL,classname_list->list[i]->child,MyHTML_TAG__TEXT, NULL);

const char *title=myhtml_node_text(text->list[0],NULL);
printf("%s\n",title)

@Azq2
Copy link
Contributor

Azq2 commented May 23, 2018

If you want "true" analog of innerText (!= textContent), i have some example: https://github.com/Azq2/perl-html5-dom/blob/f57c11343a3c8ab77a5162083791560de7d6746b/DOM.xs#L282 written by spec.

If you want more simple textContent - https://github.com/Azq2/perl-html5-dom/blob/f57c11343a3c8ab77a5162083791560de7d6746b/DOM.xs#L252

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants