正文
通过脚本同时运行几个spider
小程序:扫一扫查出行
【扫一扫了解最新限行尾号】
复制小程序
【扫一扫了解最新限行尾号】
复制小程序
# 通过脚本同时运行几个spider
目录结构:
1.在命令行能通过的情况下创建两个spider如
TestSpider
Test2Spider
2.在items.py的同级目录创建run.py文件,有三种方式,任选其一,其代码如下:
方式一: 通过CrawlerProcess同时运行几个spider
run_by_CrawlerProcess.py源代码:
# 通过CrawlerProcess同时运行几个spider
from scrapy.crawler import CrawlerProcess
# 导入获取项目配置的模块
from scrapy.utils.project import get_project_settings
# 导入蜘蛛模块(即自己创建的spider)
from spiders.test import TestSpider
from spiders.test2 import Test2Spider # get_project_settings() 必须得有,不然"HTTP status code is not handled or not allowed"
process = CrawlerProcess(get_project_settings())
process.crawl(TestSpider) # 注意引入
#process.crawl(Test2Spider) # 注意引入
process.start()
方式二:通过CrawlerRunner同时运行几个spider
run_by_CrawlerRunner.py源代码:
# 通过CrawlerRunner同时运行几个spider
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
# 导入获取项目配置的模块
from scrapy.utils.project import get_project_settings
# 导入蜘蛛模块(即自己创建的spider)
from spiders.test import TestSpider
from spiders.test2 import Test2Spider configure_logging()
# get_project_settings() 必须得有,不然"HTTP status code is not handled or not allowed"
runner = CrawlerRunner(get_project_settings())
runner.crawl(TestSpider)
#runner.crawl(Test2Spider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
方式三:通过CrawlerRunner和链接(chaining) deferred来线性运行来同时运行几个spider
run_by_CrawlerRunner_and_Deferred.py源代码:
# 通过CrawlerRunner和链接(chaining) deferred来线性运行来同时运行几个spider
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
# 导入获取项目配置的模块
from scrapy.utils.project import get_project_settings
# 导入蜘蛛模块(即自己创建的spider)
from spiders.test import TestSpider
from spiders.test2 import Test2Spider configure_logging()
# get_project_settings() 必须得有,不然"HTTP status code is not handled or not allowed"
runner = CrawlerRunner(get_project_settings()) @defer.inlineCallbacks
def crawl():
yield runner.crawl(TestSpider)
#yield runner.crawl(Test2Spider)
reactor.stop() crawl()
reactor.run() # the script will block here until the last crawl call is finished
3.修改两个spider文件引入items,和外部类的如(HeadersHelper.py)的引入模式(以run.py所在目录为中心)
原导入模式:
from ..items import ScrapydoubanmovieItem
from .HeadersHelper import HeadersHelper
注释:这种导入能够在命令行scrapy crawl Test正常运行
修改为:
from items import ScrapydoubanmovieItem
from .HeadersHelper import HeadersHelper
注释:修改后这种导入在命令行scrapy crawl Test会报错,但通过运行run.py文件,能够同时运行两个spider
4.按照运行python文件的方式运行run.py,可以得到结果