Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[From Scrapy] Selector.extract_links and links #26

Open
nramirezuy opened this issue Jan 27, 2016 · 2 comments · May be fixed by #216
Open

[From Scrapy] Selector.extract_links and links #26

nramirezuy opened this issue Jan 27, 2016 · 2 comments · May be fixed by #216

Comments

@nramirezuy
Copy link

This was originally on Scrapy. Moving it to Parsel to continue discussion.

scrapy/scrapy#331

@nramirezuy nramirezuy changed the title [Moved] Selector.extract_links and links [From Scrapy] Selector.extract_links and links Jan 27, 2016
@nramirezuy
Copy link
Author

I think this can be implemented in two different methods one returning text and the other returning the actual Selector instances.

@Gallaecio
Copy link
Member

Gallaecio commented Apr 1, 2019

What about this?

  • We implement uri(), which returns the first URI found in a selector or selector list, as a string, resolved to absolute URIs when possible.
  • We implement uris(), which iterates through all the URIs found in a selector or selector list, as strings, resolved to absolute URIs when possible.
  • When a selector matches an attribute or element text, it is assumed to be an URI, and is resolved to an absolute URI accordingly when possible. This allows using uri() and uris() to extract resolved URIs from arbitrary element strings and attributes, including custom data-* attributes.
  • When a selector matches an element, we look for all known HTML URI attributes in the matched element and all child elements.
    • If users only want URIs from attributes from the specified element, and not its children, they must perform further filtering with xpath or css beforehand, pointing to the desired attributes.
    • If users only want specific URI attributes, they must perform further filtering with xpath or css beforehand.

As for a version that returns Selector instances, what’s the use case? Maybe providing some predefined, importable XPath expressions (e.g. from parsel.xpath import URI_XPATH) would be a better approach.

@Gallaecio Gallaecio linked a pull request May 4, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants