Zé The Scraper

Install

Install Berkeley DB

Limitações

Os artigos article são listados por ordem da data de coleta dateCreated porem os artigos podem ser considerados com atualizados e serem coletados novamente causado que a data de coleta e data de publicação datePublished divirjam

Usage

Crawlling using a single spider an single url

scrapy crawl <spider_name> -a url=http(s):someurl.com?query1=a&query2=b

Crawlling using a single spider with urls extrected from Google

scrapy crawl <spider_name> -a search='{ \
  "query": "Enem OR \"Exame Nacional * Ensino Médio\"", \
  "regex": "(?i)Enem|Exame.{0,}Nacional.{0,}Ensino.{0,}Mé?e?dio" \
  "engine": "google", \
  "dateRestrict": "d1",\
  "results_per_page": 50,\
  "pages": 2 \
}'

Crawlling using all spiders with urls extrected from Google

scrapy crawl all -a search='{ \
  "query": "Enem OR \"Exame Nacional * Ensino Médio\"", \
  "regex": "(?i)Enem|Exame.{0,}Nacional.{0,}Ensino.{0,}Mé?e?dio"
  "engine": "google", \
  "dateRestrict": "d1", \
  "results_per_page": 50, \
  "pages": 2 \
}'

scrapy crawl all \
-a search=google \
-a query="Enem OR \"Exame Nacional * Ensino Médio\"" \
-a regex="(?i)Enem|Exame.{0,}Nacional.{0,}Ensino.{0,}Mé?e?dio" \
-a dateRestrict=d1

References

http://xpo6.com/list-of-english-stop-words/
Scrapy - Docs | Jobs: pausing and resuming crawls
[scrapy.extensions.memusage][https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/memusage.py] It's a good code to extend, overide _send_report_ function to send to another services than only mail

TODO:

Implement DeltaFetch midleware
decompose class .n--noticia__newsletter to spider estadao
Use https://github.com/codelucas/newspaper

Ideas

Relation DB Schema

https://cloud.google.com/bigtable/docs/schema-design

Use this:

lambda

class AVRO_FIELD_TYPE(Enum):
    str = 'STRING'
    list = 'RECORD'
    int = 'INTERGE'
    bool = 'BOOLEAN'

f_avro = lambda ft, md='NULLABLE', fd=[]: { 'avro': { 
    # 'field_type': ft.uppe() if ft else AVRO_FIELD_TYPE[type(ft)], 
    'field_type': ft.uppe(), 
    'mode': md, 
    'fields': fd } }

@property
def identifier(self):
    self['output_processor'] = self.get('output_processor') if self.get('output_processor') \
                                else TakeFirst()
    if not hasattr(self, 'schemas'):
        self['schemas'] = self.f_avro('STRING', 'NULLABLE', [])
    
    return self 

@identifier.setter
def identifier(self, value):
    self['output_processor'] if self.get('output_processor') else TakeFirst()
    return self

Name		Name	Last commit message	Last commit date
Latest commit History 337 Commits
data		data
ze		ze
.gitignore		.gitignore
.python-version		.python-version
.travis.yml		.travis.yml
README.md		README.md
proxies-list.txt		proxies-list.txt
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
setup.py		setup.py

labic/ze-the-scraper

Folders and files

Latest commit

History

Repository files navigation

Zé The Scraper

Install

Limitações

Usage

Crawlling using a single spider an single url

Crawlling using a single spider with urls extrected from Google

Crawlling using all spiders with urls extrected from Google

References

TODO:

Ideas

Relation DB Schema

lambda

About

Topics

Resources

Stars

Watchers

Forks

Languages