Python如何通过Scrapy框架实现爬取CSDN全站热榜标题热词
发表于:2025-11-07 作者:千家信息网编辑
千家信息网最后更新 2025年11月07日,小编给大家分享一下Python如何通过Scrapy框架实现爬取CSDN全站热榜标题热词,希望大家阅读完这篇文章之后都有所收获,下面让我们一起去探讨吧!环境部署scrapy安装pip install s
千家信息网最后更新 2025年11月07日Python如何通过Scrapy框架实现爬取CSDN全站热榜标题热词
小编给大家分享一下Python如何通过Scrapy框架实现爬取CSDN全站热榜标题热词,希望大家阅读完这篇文章之后都有所收获,下面让我们一起去探讨吧!
环境部署
scrapy安装
pip install scrapy -i https://pypi.douban.com/simple
selenium安装
pip install selenium -i https://pypi.douban.com/simple
jieba安装
pip install jieba -i https://pypi.douban.com/simple
IDE:PyCharm
google chrome driver下载对应版本:google chrome driver下载地址
检查浏览器版本,下载对应版本。
实现过程
下面开始搞起。
创建项目
使用scrapy命令创建我们的项目。
scrapy startproject csdn_hot_words
项目结构,如同官方给出的结构。
定义Item实体
按照之前的逻辑,主要属性为标题关键词对应出现次数的字典。代码如下:
# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class CsdnHotWordsItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() words = scrapy.Field()
关键词提取工具
使用jieba分词获取工具。
#!/usr/bin/env python# -*- coding: utf-8 -*-# @Time : 2021/11/5 23:47# @Author : 至尊宝# @Site : # @File : analyse_sentence.py import jieba.analyse def get_key_word(sentence): result_dic = {} words_lis = jieba.analyse.extract_tags( sentence, topK=3, withWeight=True, allowPOS=()) for word, flag in words_lis: if word in result_dic: result_dic[word] += 1 else: result_dic[word] = 1 return result_dic爬虫构造
这里需要给爬虫初始化一个浏览器参数,用来实现页面的动态加载。
#!/usr/bin/env python# -*- coding: utf-8 -*-# @Time : 2021/11/5 23:47# @Author : 至尊宝# @Site : # @File : csdn.py import scrapyfrom selenium import webdriverfrom selenium.webdriver.chrome.options import Options from csdn_hot_words.items import CsdnHotWordsItemfrom csdn_hot_words.tools.analyse_sentence import get_key_word class CsdnSpider(scrapy.Spider): name = 'csdn' # allowed_domains = ['blog.csdn.net'] start_urls = ['https://blog.csdn.net/rank/list'] def __init__(self): chrome_options = Options() chrome_options.add_argument('--headless') # 使用无头谷歌浏览器模式 chrome_options.add_argument('--disable-gpu') chrome_options.add_argument('--no-sandbox') self.browser = webdriver.Chrome(chrome_options=chrome_options, executable_path="E:\\chromedriver_win32\\chromedriver.exe") self.browser.set_page_load_timeout(30) def parse(self, response, **kwargs): titles = response.xpath("//div[@class='hosetitem-title']/a/text()") for x in titles: item = CsdnHotWordsItem() item['words'] = get_key_word(x.get()) yield item代码说明
1、这里使用的是chrome的无头模式,就不需要有个浏览器打开再访问,都是后台执行的。
2、需要添加chromedriver的执行文件地址。
3、在parse的部分,可以参考之前我文章的xpath,获取到标题并且调用关键词提取,构造item对象。
中间件代码构造
添加js代码执行内容。中间件完整代码:
# Define here the models for your spider middleware## See documentation in:# https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signalsfrom scrapy.http import HtmlResponsefrom selenium.common.exceptions import TimeoutExceptionimport time from selenium.webdriver.chrome.options import Options # useful for handling different item types with a single interfacefrom itemadapter import is_item, ItemAdapter class CsdnHotWordsSpiderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_spider_input(self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. return None def process_spider_output(self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, or item objects. for i in result: yield i def process_spider_exception(self, response, exception, spider): # Called when a spider or process_spider_input() method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request or item objects. pass def process_start_requests(self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output() method, except # that it doesn't have a response associated. # Must return only requests (not items). for r in start_requests: yield r def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name) class CsdnHotWordsDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): js = ''' let height = 0 let interval = setInterval(() => { window.scrollTo({ top: height, behavior: "smooth" }); height += 500 }, 500); setTimeout(() => { clearInterval(interval) }, 20000); ''' try: spider.browser.get(request.url) spider.browser.execute_script(js) time.sleep(20) return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8", request=request) except TimeoutException as e: print('超时异常:{}'.format(e)) spider.browser.execute_script('window.stop()') finally: spider.browser.close() def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)制作自定义pipeline
定义按照词频统计最终结果输出到文件。代码如下:
# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interfacefrom itemadapter import ItemAdapter class CsdnHotWordsPipeline: def __init__(self): self.file = open('result.txt', 'w', encoding='utf-8') self.all_words = [] def process_item(self, item, spider): self.all_words.append(item) return item def close_spider(self, spider): key_word_dic = {} for y in self.all_words: print(y) for k, v in y['words'].items(): if k.lower() in key_word_dic: key_word_dic[k.lower()] += v else: key_word_dic[k.lower()] = v word_count_sort = sorted(key_word_dic.items(), key=lambda x: x[1], reverse=True) for word in word_count_sort: self.file.write('{},{}\n'.format(word[0], word[1])) self.file.close()settings配置
配置上要做一些调整。如下调整:
# Scrapy settings for csdn_hot_words project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:## https://docs.scrapy.org/en/latest/topics/settings.html# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'csdn_hot_words' SPIDER_MODULES = ['csdn_hot_words.spiders']NEWSPIDER_MODULE = 'csdn_hot_words.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent# USER_AGENT = 'csdn_hot_words (+http://www.yourdomain.com)'USER_AGENT = 'Mozilla/5.0' # Obey robots.txt rulesROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)# CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 30# The download delay setting will honor only one of:# CONCURRENT_REQUESTS_PER_DOMAIN = 16# CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)# TELNETCONSOLE_ENABLED = False # Override the default request headers:DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'} # Enable or disable spider middlewares# See https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlSPIDER_MIDDLEWARES = { 'csdn_hot_words.middlewares.CsdnHotWordsSpiderMiddleware': 543,} # Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = { 'csdn_hot_words.middlewares.CsdnHotWordsDownloaderMiddleware': 543,} # Enable or disable extensions# See https://docs.scrapy.org/en/latest/topics/extensions.html# EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,# } # Configure item pipelines# See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { 'csdn_hot_words.pipelines.CsdnHotWordsPipeline': 300,} # Enable and configure the AutoThrottle extension (disabled by default)# See https://docs.scrapy.org/en/latest/topics/autothrottle.html# AUTOTHROTTLE_ENABLED = True# The initial download delay# AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies# AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:# AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings# HTTPCACHE_ENABLED = True# HTTPCACHE_EXPIRATION_SECS = 0# HTTPCACHE_DIR = 'httpcache'# HTTPCACHE_IGNORE_HTTP_CODES = []# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'执行主程序
可以通过scrapy的命令执行,但是为了看日志方便,加了一个主程序代码。
#!/usr/bin/env python# -*- coding: utf-8 -*-# @Time : 2021/11/5 22:41# @Author : 至尊宝# @Site : # @File : main.pyfrom scrapy import cmdline cmdline.execute('scrapy crawl csdn'.split())执行结果
执行部分日志
得到result.txt结果。
看完了这篇文章,相信你对"Python如何通过Scrapy框架实现爬取CSDN全站热榜标题热词"有了一定的了解,如果想了解更多相关知识,欢迎关注行业资讯频道,感谢各位的阅读!
代码
标题
浏览器
utf-8
浏览
关键
关键词
版本
结果
项目
全站
框架
中间件
主程序
命令
地址
工具
文件
日志
模式
数据库的安全要保护哪些东西
数据库安全各自的含义是什么
生产安全数据库录入
数据库的安全性及管理
数据库安全策略包含哪些
海淀数据库安全审计系统
建立农村房屋安全信息数据库
易用的数据库客户端支持安全管理
连接数据库失败ssl安全错误
数据库的锁怎样保障安全
海南游爱网络技术
win7打印服务器
最新的软件开发的工具
义乌什么是软件开发设计
浙江定制网络技术开发创新服务
现代软件开发技术指导
网络安全与网络道德教案
有经验的网络技术人员经常
航班订票系统软件开发
海南会计软件开发中心
基于安卓的软件开发实训报告
中国网络技术的先进性的体现
i春秋网络安全67课
数据库安全 pdf
中国电信网络安全审查
张店办公系统oa软件开发公司
免费送只收快递费软件开发
数据库具有可恢复性
如何打开并解释音频数据库
黑魂三进不了服务器
北京艾比利网络技术有限公司
基层网络安全建设意义
网络安全 考研专业
网络安全防护宣传周
企业大数据 数据库
软件开发创业起步方法
lol各区服务器
网络安全宣传教育培训总结
西安铁一中集团网络安全招聘
ed2k电驴服务器列表