Welcome to Crappyspider’s documentation!¶

source: https://github.com/novapost/crappyspider
ticketing: https://github.com/novapost/crappyspider/issues
documentation: http://crappyspider.readthedocs.org/en/latest/

What is it?¶

crappyspider is a generic scrapy spider:

it crawls a website
logs on the standard output the visited urls and their http codes
generates a log file with the visited urls (more data coming soon, such as http codes ;) )

Features¶

a login step
restric crawling through urls patterns (see ‘patterns’ in the ‘config’ doc page)
exclude patterns of urls (see ‘excluded_patterns’ in the ‘config’ doc page)

Usage¶

scrapy crawl crappyspider -a start_urls=http://example.com -a allowed_domains=example.com
scrapy crawl crappyspider -a conf=conf.json -a output=outputfile -a output_format=json

Use cases¶

After deployment:¶

You will be able to read through the standard input to pick up any HTTP error such as 500’s (pink highlight <3 )

scrapy crawl crappyspider -a start_urls=http://deployed-site.com -a allowed_domains=deployed-site.com

After code change:¶

Feed any functional testing script with this crawler output file.

scrapy crawl crappyspider -a config=config.json -a output_filename=log.json

Acknowledgements¶

Thanks to Peopledoc for freeing the project.