初始化项目

1
2
scrapy startproject <project-name>
cd <project-name>

创建并运行爬虫

1
2
scrapy genspider <spider-name> <spider-domin>
scrapy crawl <spider-name>

Extra

  • 日志
1
2
3
# setting.py
LOG_LEVEL = 'WARNING'
LOG_FILE = './log.log'
  • headers
1
2
3
4
5
6
# setting.py
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Referer': 'https://www.baidu.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
}
  • pipelines
1
2
3
4
# setting.py
ITEM_PIPELINES = {
'scrapy_spider.pipelines.ScrapySpiderPipeline': 300,
}
  • 处理详情
1
2
3
4
5
6
7
yield scrapy.Request(
href,
callback=self.parse_detail,
meta={'item':item}
)
def parse_detail(self, response):
item = response.meta['item']
  • 下一页
1
2
3
4
yield scrapy.Request(
next_url,
callback=self.parse
)