Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto scraper? #132

Open
walking-octopus opened this issue Dec 14, 2023 · 0 comments
Open

Auto scraper? #132

walking-octopus opened this issue Dec 14, 2023 · 0 comments

Comments

@walking-octopus
Copy link

There's a neat little package autoscraper that allows to quickly build no-code web extractors.

  • You take a page with known content.
  • Say what text from it you need and what alias to bind it to. For example, { "name": " Apple Mac Mini (256GB SSD, M1, 8GB)", "current_bid": "US $130.50", "end_of_bid": "Saturday, 11:32 PM" }
  • Fit the model to your known page and known data.
  • It then tries to find what DOM selectors can yield the desired data with best accuracy and saves it into a model object you can pickle, which you probably should given the known page may die long before the DOM changes, so best keep model creation somewhere in a notebook.
  • Now you can just predict that data from new URLs/DOMs.

I actually wonder the idea can be extended to also use data from the heap to try get the text out, especially given it's a lot messier than hunting for the selector.

May be prototyped as another CLI on top of heap, html, and image exporting here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant