What is it?

crappyspider is a generic scrapy spider:

  • it crawls a website
  • logs on the standard output the visited urls and their http codes
  • generates a log file with the visited urls (more data coming soon, such as http codes ;) )

Features

  • a login step
  • restric crawling through urls patterns (see ‘patterns’ in the ‘config’ doc page)
  • exclude patterns of urls (see ‘excluded_patterns’ in the ‘config’ doc page)

Usage

scrapy crawl crappyspider -a start_urls=http://example.com -a allowed_domains=example.com
scrapy crawl crappyspider -a conf=conf.json -a output=outputfile -a output_format=json

Use cases

After deployment:

You will be able to read through the standard input to pick up any HTTP error such as 500’s (pink highlight <3 )

scrapy crawl crappyspider -a start_urls=http://deployed-site.com -a allowed_domains=deployed-site.com

After code change:

Feed any functional testing script with this crawler output file.

scrapy crawl crappyspider -a config=config.json -a output_filename=log.json

Acknowledgements

Thanks to Peopledoc for freeing the project.