Auto scraper? #132

walking-octopus · 2023-12-14T19:57:38Z

There's a neat little package autoscraper that allows to quickly build no-code web extractors.

You take a page with known content.
Say what text from it you need and what alias to bind it to. For example, { "name": " Apple Mac Mini (256GB SSD, M1, 8GB)", "current_bid": "US $130.50", "end_of_bid": "Saturday, 11:32 PM" }
Fit the model to your known page and known data.
It then tries to find what DOM selectors can yield the desired data with best accuracy and saves it into a model object you can pickle, which you probably should given the known page may die long before the DOM changes, so best keep model creation somewhere in a notebook.
Now you can just predict that data from new URLs/DOMs.

I actually wonder the idea can be extended to also use data from the heap to try get the text out, especially given it's a lot messier than hunting for the selector.

May be prototyped as another CLI on top of heap, html, and image exporting here.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto scraper? #132

Auto scraper? #132

walking-octopus commented Dec 14, 2023

Auto scraper? #132

Auto scraper? #132

Comments

walking-octopus commented Dec 14, 2023