Python爬蟲實戰之使用Scrapy爬起點網的完本小說
本篇的目的是用scrapy來爬取起點小說網的完本小說,使用的環境ubuntu,至于scrapy的安裝就自行百度了。
scrapy startproject name 通過終端進入到你創建項目的目錄下輸入上面的命令就可以完成項目的創建.name是項目名字.
我這里定義的item中的title用來存書名,desc用來存書的內容.、
import scrapy
class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
desc = scrapy.Field()
pass
在pipelines可以編寫存儲數據的形式,我這里就是使用txt形式的文件來存儲每一本書
import json
import codecs
#以txt的形式存儲
class TutorialPipeline(object):
#def __init__(self):
def process_item(self, item, spider):
//根據書名來創建文件,item.get('title')就可以獲取到書名
self.file = codecs.open(item.get('title')+'.txt', 'w', encoding='utf-8')
self.file.write(item.get("desc")+ "\n")
return item
def spider_closed(self, spider):
self.file.close()
只要將下面代碼中的tutorial替換成自己項目的名字就可以
BOT_NAME = 'tutorial'
#USER_AGENT
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'
# start MySQL database configure setting
# end of MySQL database configure setting
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'
ITEM_PIPELINES = {
'tutorial.pipelines.TutorialPipeline': 300,
}
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import TutorialItem
from scrapy.http import Request
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
//我這里是下載起點體育類的完本小說,所以通過for來創建每一個頁面的url,因為每一個只是page不同而已,而page是根據全部的本數/頁數而來
start_urls = [
"http://fin.qidian.com/?size=-1&sign=-1&tag=-1&chanId=8&subCateId=-1&orderId=&update=-1&page="+str(page)+"&month=-1&style=1&vip=-1" for page in range(1,292/20)
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
//獲取每一個書的url
book =hxs.select('//div[@class="book-mid-info"]/h4/a//@href').extract()
for bookurl in book:
//根據獲取到的書本url跳轉到每本書的頁面
yield Request("http:"+bookurl, self.parseBook, dont_filter=True)
def parseBook(self,response):
hxs = HtmlXPathSelector(response)
//獲取免費閱讀的url
charterurl = hxs.select('//div[@class="book-info "]//a[@class="red-btn J-getJumpUrl "]/@href').extract()
//每一本書都創建一個item
item = TutorialItem()
for url in charterurl:
通過免費閱讀的url進入書的第一章
yield Request("http:"+url,meta={'item': item},callback=self.parseCharter, dont_filter=True)
def parseCharter(self ,response):
hxs = HtmlXPathSelector(response)
//獲取書名
names = hxs.select('//div[@class="info fl"]/a[1]/text()').extract()
//獲取上面傳遞過來的item
item = response.meta['item']
for name in names:
//將書名存入到item的title字段中
names = item.get('title')
if None==names:
item['title'] = name
//獲取章節名
biaoti = hxs.select('//h3[@class="j_chapterName"]/text()').extract()
content = ''
for biaot in biaoti:
content=content+biaot+"\n"
//獲取每一章的內容
s = hxs.select('//div[@class="read-content j_readContent"]//p/text()').extract()
for srt in s:
//將章節和內容拼接起來存入到item的desc中
content = content + srt
desc = item.get('desc')
if None==desc:
item['desc'] =content
else:
item['desc']=desc+content
if content=='':
yield item
#獲取下一章的內容
chapters = hxs.select('//div[@class="chapter-control dib-wrap"]/a[@id="j_chapterNext"]//@href').extract()
for chapter in chapters:
#print "https:" + chapter
yield Request("http:" + chapter, meta={'item': item},callback=self.parseCharter, dont_filter=True)
通過上面的代碼雖然可以獲取所有書的內容,但是起點是有vip限制的,也就是說必須用起點的vip帳號登錄才能查看完本的小說,因此這有點遺憾,我沒有起點小說網的會員.