2016-2020Chinese-Weather-Analysis(二)-Python-IT技术博客

2016-2020Chinese-Weather-Analysis(二)

时间:2021-04-20 22:53:00 浏览: 字体: 放大 缩小


天气后报网——数据爬取(Scrapy框架)

 

1.创建天气后报网爬虫

  在开始编程之前,我们首先要根据项目需求对天气后报网站进行分析。目标是提取2016-2020年每个城市的每天的温度、天气状况、风力风向等数据。首先来到天气后报网(http://www.tianqihoubao.com/lishi/)。如图1所示。

  

  

              图 1

  可以看到列表中每个省份下的城市信息,以北京市为例,点击进去,进入二级页面。

、              图 2 

   以2011年1月北京天气为例,进入三级页面(详情页面),其中可以看到日期、天气状况、气温、风力风向等所需的信息。

               图 3

  以上将整个爬虫项目的流程分析完成,编程可以开始了。首先在命令行中切换到用于存储项目的路径,然后输入下面命令创建爬虫项目和爬虫模块:

1 scrapy startproject tqhbCrawl
2 cd tqhbCrawl
3 scrapy genspider -t crawl tqhb_spider "tianqihoubao.com/lishi/"

 

2.定义Item

创建完工程后,首先要做的是定义Item,确定我们需要提取的结构化数据。代码如下:

 1 import scrapy
 2 
 3 class TqhbItem(scrapy.Item):
 4     # 城市名
 5     city_name = scrapy.Field()
 6     # 日期
 7     date = scrapy.Field()
 8     # 天气状况
 9     state = scrapy.Field()
10     # 风力风向
11     wind = scrapy.Field()
12     #温度
13     temp = scrapy.Field()

 

3.编写爬虫模块

通过genspider命令已经创建了一个基于CrawlSpider 类的爬虫模板,类名称为 TqhbSpiderSpider,下面进行开始页面解析,主要有两个方法。detail_url 方法用于解析图2所示的列表信息,抽取三级页面url的链接信息。parse 方法用于抽取图3所示的基本的信息。对于二级页面链接的抽取,则是在 rules 中定义抽取规则(只能抽取start_urls中符合 rules 的链接,故需要使用 detail_url 方法构造三级链接 TqhbSpiderSpider完整代码如下:

 1 class TqhbSpiderSpider(CrawlSpider):
 2     name = 'tqhb_spider'
 3     allowed_domains = ['tianqihoubao.com']
 4     start_urls = ['http://tianqihoubao.com/lishi']
 5 
 6     rules = (
 7         Rule(LinkExtractor(allow='.+lishi.+html'),callback="detail_url",follow=False),
 8     )
 9 
10     def detail_url(self, response):
11         base_url = "http://tianqihoubao.com"
12         divs = response.xpath("//div[@class='box pcity']")[5:9]
13         detail_urls = divs.xpath(".//a/@href").getall()
14         for detail_url in detail_urls:
15             yield scrapy.Request(base_url+detail_url,callback=self.parse)
16 
17     def parse(self, response):
18         # 获取 城市 日期 天气状态 气温 风力风向信息
19         city_name = response.xpath('//div[@id="s-calder"]/h2/text()').get()
20         city_name = ''.join(re.findall(r'[^0-9]', city_name))[:-9]
21         trs = response.xpath("//tr")[1:]
22         for tr in trs:
23             tds = tr.xpath(".//td")
24             date = tds[0].xpath(".//text()").getall()
25             date = "".join(''.join(date).split())
26             state = tds[1].xpath(".//text()").getall()
27             state = "".join(''.join(state).split())
28             temp = tds[2].xpath(".//text()").getall()
29             temp = "".join(''.join(temp).split())
30             wind = tds[3].xpath(".//text()").getall()
31             wind = "".join(''.join(wind).split())
32             item = TqhbItem(
33                 city_name=city_name,
34                 date=date,
35                 state=state,
36                 temp=temp,
37                 wind=wind)
38             yield item

 

4.Pipeline

下面开始编写Pipeline,主要完成 Item 到 SCV 表的存储。

 1 class TqhbPipeline(object):
 2     def __init__(self):
 3         self.fp = open("tqhb.csv", 'wb')
 4         self.exporter = CsvItemExporter(
 5             self.fp, encoding='utf-8')
 6 
 7     def open_spider(self, spider):
 8         print("爬虫开始....")
 9 
10 
11     def process_item(self, item, spider):
12         self.exporter.export_item(item)
13         return item
14 
15     def close_spider(self, spider):
16         self.fp.close()
17         print("爬虫结束了....")

  最后在 settings 中将如下代码的注释取消掉:

1 ITEM_PIPELINES = {
2    'tqhb.pipelines.TqhbPipeline': 300,
3 }

 

5.应对反爬虫机制

  为了不被反爬虫机制检测到,主要采用了伪造随机 ‘User-Agent’、自动限速、禁用 robots.txt 等措施。

  1.伪造随机 User-Agent,编写middlewares.py

 1 import random
 2 
 3 class TqhbDownloaderMiddleware(object):
 4     user_agents =  [
 5     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.23 Safari/537.36",
 6     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36",
 7     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36",
 8     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.27 Safari/537.36",
 9     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
10     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4280.87 Safari/537.36",
11     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.20 Safari/537.36",
12     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75",
13     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/89.0.774.57",
14     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/89.0.774.54",
15     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/89.0.774.50",
16     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/90.0.818.6",
17     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/90.0.818.8",
18     "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (HTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/90.0.818.14"
19     ]
20     def process_request(self,request,spider):
21         user_agent = random.choice(self.user_agents)
22         request.headers["User-Agent"]=user_agent

  并使用该中间件设置DEFAULT_REQUEST_HEADERS:

1 DOWNLOADER_MIDDLEWARES = {
2    'tqhb.middlewares.TqhbDownloaderMiddleware': 543,
3 }
4 
5 DEFAULT_REQUEST_HEADERS = {
6   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
7   'Accept-Language': 'en',
8 }

  2.自动限速设置:

1 DOWNLOAD_DELAY = 1
2 AUTOTHROTTLE_ENABLED = True
3 AUTOTHROTTLE_START_DELAY = 5
4 AUTOTHROTTLE_MAX_DELAY = 60

  3.禁用禁用 robots.txt 

1 ROBOTSTXT_OBEY = False

 

6.运行项目

  在项目文件下创建start.py,代码如下:

1 from scrapy import cmdline
2 
3 cmdline.execute("scrapy crawl tqhb_spider".split())

 

                   存储效果

 

以上所有代码皆可在本人github账号上下载: https://github.com/chyhoo/2016-2020Chinese-Weather-Analysis

 

 

 

标签: 2016-2020Chinese-Weather-Analysis(二)20162020chineseweatheranalysis