Skip to content

labic/ze-the-scraper

Repository files navigation

Zé The Scraper

Build Status

Install

Limitações

  • Os artigos article são listados por ordem da data de coleta dateCreated porem os artigos podem ser considerados com atualizados e serem coletados novamente causado que a data de coleta e data de publicação datePublished divirjam

Usage

Crawlling using a single spider an single url

scrapy crawl <spider_name> -a url=http(s):someurl.com?query1=a&query2=b

Crawlling using a single spider with urls extrected from Google

scrapy crawl <spider_name> -a search='{ \
  "query": "Enem OR \"Exame Nacional * Ensino Médio\"", \
  "regex": "(?i)Enem|Exame.{0,}Nacional.{0,}Ensino.{0,}Mé?e?dio" \
  "engine": "google", \
  "dateRestrict": "d1",\
  "results_per_page": 50,\
  "pages": 2 \
}' 

Crawlling using all spiders with urls extrected from Google

scrapy crawl all -a search='{ \
  "query": "Enem OR \"Exame Nacional * Ensino Médio\"", \
  "regex": "(?i)Enem|Exame.{0,}Nacional.{0,}Ensino.{0,}Mé?e?dio"
  "engine": "google", \
  "dateRestrict": "d1", \
  "results_per_page": 50, \
  "pages": 2 \
}'

scrapy crawl all \
-a search=google \
-a query="Enem OR \"Exame Nacional * Ensino Médio\"" \
-a regex="(?i)Enem|Exame.{0,}Nacional.{0,}Ensino.{0,}Mé?e?dio" \
-a dateRestrict=d1

References

TODO:

Ideas

Relation DB Schema

https://cloud.google.com/bigtable/docs/schema-design

| Row key | Column data | | INEP | NEWS:EDUCACAO (V1 03/01/15):558.40 |

Use this:

lambda

class AVRO_FIELD_TYPE(Enum):
    str = 'STRING'
    list = 'RECORD'
    int = 'INTERGE'
    bool = 'BOOLEAN'

f_avro = lambda ft, md='NULLABLE', fd=[]: { 'avro': { 
    # 'field_type': ft.uppe() if ft else AVRO_FIELD_TYPE[type(ft)], 
    'field_type': ft.uppe(), 
    'mode': md, 
    'fields': fd } }

@property
def identifier(self):
    self['output_processor'] = self.get('output_processor') if self.get('output_processor') \
                                else TakeFirst()
    if not hasattr(self, 'schemas'):
        self['schemas'] = self.f_avro('STRING', 'NULLABLE', [])
    
    return self 

@identifier.setter
def identifier(self, value):
    self['output_processor'] if self.get('output_processor') else TakeFirst()
    return self