用Python实现数据爬取

引言

在当今互联网时代,数据的产生和传输速度之快已经无法计量。从早期简单的HTML页面到今天各种复杂的多媒体内容,人类对数据的需求在逐年增长。而数据爬取技术由此诞生,随着不断的技术革新和发展,其重要性逐渐凸显。Python语言恰好具备优秀的网络编程库和HTML解析器,成为了数据爬取的有力武器。本文将从多个角度详细讲解用Python实现数据爬取的相关技术。

环境搭建

在开始用Python实现数据爬取之前,需要先搭建好Python环境。具体安装可参考Python官网:https://www.python.org/downloads/,安装完成后,需安装以下这些库来支持数据爬取。

import requests
from bs4 import BeautifulSoup
import pandas as pd
from requests.exceptions import RequestException

以上四个库,分别是支持网络请求的requests库,HTML解析的BeautifulSoup库,数据处理的pandas库,网络请求异常捕捉的RequestException库。

网络请求

对目标网站发送网络请求,获取HTML源代码,是数据爬取的第一步。requests库能够支持简单、快速的网络请求。下面是使用requests库发出网络请求的代码实例。

def get_html(url, headers=None):
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None

其中url参数即为目标网站的地址,headers参数则是HTTP协议中的请求头,常用于模拟浏览器操作。

HTML解析

BeautifulSoup库是Python的一款HTML、XML解析库,可以用于在HTML解析过程中,获取HTML文档各个节点的内容。以下是使用BeautifulSoup库进行HTML解析的代码实例。

html = get_html(url)
soup = BeautifulSoup(html, 'lxml')
title = soup.title.string

首先,通过get_html函数获取了目标网站的HTML源代码,随后,将其传入BeautifulSoup函数中,解析出soup对象。soup.title则代表HTML中的标签,其string属性则代表该标签内部的字符串内容,即网页标题。</p><h2 id="title-5"><span class=ez-toc-section id=25E6259525B025E6258D25AE25E525A4258425E725902586></span>数据处理<span class=ez-toc-section-end></span></h2><p>请求到HTML源代码后,需要进一步处理数据,整理数据格式。数据的格式处理工具,pandas库备受欢迎。以下是使用pandas进行数据处理的代码实例。</p><pre> table = soup.find('table', {'class': 'table'}) df = pd.read_html(str(table)) df = df[0:] print(df) </pre><p>首先使用find函数获取属性为“table”的</p><table>标签,再将其传入read_html函数进行解析。数据解析后,即可使用pandas库对数据进行处理。</p><h2 id="title-6"><span class=ez-toc-section id=25E6259525B025E6258D25AE25E525AD259825E5258225A8></span>数据存储<span class=ez-toc-section-end></span></h2><p>数据处理完成后,需要保存到本地文件系统以备后续操作。以下是使用pandas保存数据到csv文件的代码实例。</p><pre> df.to_csv('data.csv', encoding='utf_8_sig') </pre><p>其中,第一个参数代表保存的文件名,第二个参数则代表输出的编码格式,常用的有’utf_8’和’gbk’等。</p><h2 id="title-7"><span class=ez-toc-section id=25E6259525B025E6258D25AE25E7258825AC25E5258F259625E7259A258425E625B325A825E62584258F25E425BA258B25E925A125B9></span>数据爬取的注意事项<span class=ez-toc-section-end></span></h2><h3><span class=ez-toc-section id=25E9258125B525E525AE258825E625B3259525E525BE258B25E625B3259525E825A72584></span>遵守法律法规<span class=ez-toc-section-end></span></h3><p>在进行数据爬取时,需要遵守各地区的法律法规,本文仅用于探讨技术原理,切勿进行恶意爬取等行为。</p><h3><span class=ez-toc-section id=25E5258F258D25E7258825AC25E725AD259625E7259525A525E7259A258425E525BA259425E525AF25B9></span>反爬策略的应对<span class=ez-toc-section-end></span></h3><p>许多网站会设置反爬策略以防数据爬取行为,此时,可以使用一些技巧性方法,如使用Session会话保持,修改请求头信息,代理IP等方案应对。</p><h3><span class=ez-toc-section id=25E92581259325E525BE25B725E9259725AE25E925A22598></span>道德问题<span class=ez-toc-section-end></span></h3><p>数据爬取的行为会涉及道德问题,尽量避免因数据爬取带来的负面影响。</p><h2 id="title-8"><span class=ez-toc-section id=25E6258025BB25E725BB2593></span>总结<span class=ez-toc-section-end></span></h2><p>Python作为一门强大的编程语言,对于数据爬取、处理、存储具备非常优秀的扩展能力。在网络信息化程度不断提升的今天,数据爬取技术势必会面临更为严峻的测试,只有不断学习、钻研,才能在数据爬取领域驰骋自如,发掘出更多的数据存储之珍宝。</p><div class=entry-readmore><div class=entry-readmore-btn></div></div><div class=entry-copyright><p>原创文章,作者:BNYI,如若转载,请注明出处:https://www.506064.com/n/131124.html</p></div></div><div class=entry-tag><a href=https://www.506064.com/n/tag/python rel=tag>python</a><a href=https://www.506064.com/n/tag/shuju rel=tag>数据</a></div><div class=entry-action><div class=btn-zan data-id=131124><i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-thumb-up-fill></use></svg></i> 赞 <span class=entry-action-num>(0)</span></div></div><div class=entry-bar><div class=entry-bar-inner><div class=entry-bar-author> <a data-user=3903 target=_blank href=https://www.506064.com/spacehome/bnyi class="avatar j-user-card"> <img alt=BNYI src='//g.izt6.com/avatar/?s=60&d=mm&r=g' srcset="//g.izt6.com/avatar/?s=120&d=mm&r=g 2x" class='avatar avatar-60 photo avatar-default' height=60 width=60 decoding=async><span class=author-name>BNYI</span> </a></div><div class=entry-bar-info><div class="info-item meta"> <a class="meta-item j-heart" href=javascript:; data-id=131124><i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-star></use></svg></i> <span class=data>0</span></a> <a class=meta-item href=#comments><i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-comment></use></svg></i> <span class=data>0</span></a></div><div class="info-item share"> <a class="meta-item mobile j-mobile-share" href=javascript:; data-id=131124 data-qrcode=https://www.506064.com/n/131124.html><i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-share></use></svg></i> 生成海报</a> <a class="meta-item wechat" data-share=wechat target=_blank rel=nofollow href=#> <i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-wechat></use></svg></i> </a> <a class="meta-item weibo" data-share=weibo target=_blank rel=nofollow href=#> <i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-weibo></use></svg></i> </a> <a class="meta-item qq" data-share=qq target=_blank rel=nofollow href=#> <i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-qq></use></svg></i> </a></div><div class="info-item act"> <a href=javascript:; id=j-reading><i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-article></use></svg></i></a></div></div></div></div></div><div class=entry-page><div class="entry-page-prev entry-page-nobg"> <a href=https://www.506064.com/n/131121.html title=中安装php,中安装饰怎么样 rel=prev> <span>中安装php,中安装饰怎么样</span> </a><div class=entry-page-info> <span class=pull-left><i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-arrow-left-double></use></svg></i> 上一篇</span> <span class=pull-right>2024-10-03</span></div></div><div class="entry-page-next entry-page-nobg"> <a href=https://www.506064.com/n/131088.html title="提高开发效率的Android Studio设置技巧" rel=next> <span>提高开发效率的Android Studio设置技巧</span> </a><div class=entry-page-info> <span class=pull-right>下一篇 <i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-arrow-right-double></use></svg></i></span> <span class=pull-left>2024-10-03</span></div></div></div><div class=entry-related-posts><h3 class="entry-related-title">相关推荐</h3><ul class="entry-related cols-3 post-loop post-loop-default"><li class="item item-no-thumb"><div class=item-content><h3 class="item-title"> <a href=https://www.506064.com/n/133495.html target=_blank rel=bookmark> nginx在Windows下的配置 </a></h3><div class=item-excerpt><p>一、安装nginx nginx是一款高性能的Web服务器和反向代理服务器,可用于为Web应用程序提供负载均衡、缓存和访问限制等服务。在Windows下安装nginx需要先下载安装包…</p></div><div class=item-meta><div class="item-meta-li author"> <a data-user=6270 target=_blank href=https://www.506064.com/spacehome/xbaa class="avatar j-user-card"> <img alt=XBAA src='//g.izt6.com/avatar/?s=60&d=mm&r=g' srcset="//g.izt6.com/avatar/?s=120&d=mm&r=g 2x" class='avatar avatar-60 photo avatar-default' height=60 width=60 decoding=async> <span>XBAA</span> </a></div> <a class=item-meta-li href=https://www.506064.com/n/category/code target=_blank>编程</a> <span class="item-meta-li date">2024-10-03</span><div class=item-meta-right></div></div></div> </li> <li class="item item-no-thumb"><div class=item-content><h3 class="item-title"> <a href=https://www.506064.com/n/131085.html target=_blank rel=bookmark> 利用Spring Cloud Zipkin提升分布式系统的监控能力 </a></h3><div class=item-excerpt><p>随着应用程序逐渐向微服务架构转型,许多企业已经意识到构建分布式系统需要更加细致的监控和故障处理机制。分布式系统的复杂性使得识别和解决问题更加困难,尤其是应用程序间的相互依赖导致问题…</p></div><div class=item-meta><div class="item-meta-li author"> <a data-user=3864 target=_blank href=https://www.506064.com/spacehome/eqje class="avatar j-user-card"> <img alt=EQJE src='//g.izt6.com/avatar/?s=60&d=mm&r=g' srcset="//g.izt6.com/avatar/?s=120&d=mm&r=g 2x" class='avatar avatar-60 photo avatar-default' height=60 width=60 decoding=async> <span>EQJE</span> </a></div> <a class=item-meta-li href=https://www.506064.com/n/category/code target=_blank>编程</a> <span class="item-meta-li date">2024-10-03</span><div class=item-meta-right></div></div></div> </li> <li class="item item-no-thumb"><div class=item-content><h3 class="item-title"> <a href=https://www.506064.com/n/132412.html target=_blank rel=bookmark> 使用NumPy实现高效平方根计算 </a></h3><div class=item-excerpt><p>一、什么是NumPy NumPy是Python中的一个重要的科学计算库,它支持大量的高级数学函数和矩阵运算,是进行数据分析、科学计算和数据可视化的重要工具。同时NumPy还支持对多…</p></div><div class=item-meta><div class="item-meta-li author"> <a data-user=5189 target=_blank href=https://www.506064.com/spacehome/utuw class="avatar j-user-card"> <img alt=UTUW src='//g.izt6.com/avatar/?s=60&d=mm&r=g' srcset="//g.izt6.com/avatar/?s=120&d=mm&r=g 2x" class='avatar avatar-60 photo avatar-default' height=60 width=60 decoding=async> <span>UTUW</span> </a></div> <a class=item-meta-li href=https://www.506064.com/n/category/code target=_blank>编程</a> <span class="item-meta-li date">2024-10-03</span><div class=item-meta-right></div></div></div> </li> <li class="item item-no-thumb"><div class=item-content><h3 class="item-title"> <a href=https://www.506064.com/n/138401.html target=_blank rel=bookmark> randomuuidisnotafunction与UUID </a></h3><div class=item-excerpt><p>一、UUID是什么? UUID(Universally Unique Identifier,通用唯一标识符)是一种128位长的标识符,用于在计算机系统中识别信息 UUID是通过MA…</p></div><div class=item-meta><div class="item-meta-li author"> <a data-user=11119 target=_blank href=https://www.506064.com/spacehome/poxl class="avatar j-user-card"> <img alt=POXL src='//g.izt6.com/avatar/?s=60&d=mm&r=g' srcset="//g.izt6.com/avatar/?s=120&d=mm&r=g 2x" class='avatar avatar-60 photo avatar-default' height=60 width=60 decoding=async> <span>POXL</span> </a></div> <a class=item-meta-li href=https://www.506064.com/n/category/code target=_blank>编程</a> <span class="item-meta-li date">2024-10-04</span><div class=item-meta-right></div></div></div> </li> <li class="item item-no-thumb"><div class=item-content><h3 class="item-title"> <a href=https://www.506064.com/n/132098.html target=_blank rel=bookmark> Android字符串转换为整数的实现方法 </a></h3><div class=item-excerpt><p>在Android开发中,经常会遇到需要把字符串转换为整数的情况。例如,输入框输入的值必须为整数,而用户输入时却可能会输入字符串。因此,必须对用户输入的字符串进行转换,以符合程序的要…</p></div><div class=item-meta><div class="item-meta-li author"> <a data-user=4876 target=_blank href=https://www.506064.com/spacehome/oddw class="avatar j-user-card"> <img alt=ODDW src='//g.izt6.com/avatar/?s=60&d=mm&r=g' srcset="//g.izt6.com/avatar/?s=120&d=mm&r=g 2x" class='avatar avatar-60 photo avatar-default' height=60 width=60 decoding=async> <span>ODDW</span> </a></div> <a class=item-meta-li href=https://www.506064.com/n/category/code target=_blank>编程</a> <span class="item-meta-li date">2024-10-03</span><div class=item-meta-right></div></div></div> </li> <li class="item item-no-thumb"><div class=item-content><h3 class="item-title"> <a href=https://www.506064.com/n/136186.html target=_blank rel=bookmark> 链接多跳一错误的原因及解决方法 </a></h3><div class=item-excerpt><p>一、什么是“链接多跳一错误”? 当用户访问某个链接时,页面出现错误或跳转次数过多甚至导致死循环,这种现象就被称为“链接多跳一错误”。这种错误很容易发生,且影响用户体验,对于开发者来…</p></div><div class=item-meta><div class="item-meta-li author"> <a data-user=8933 target=_blank href=https://www.506064.com/spacehome/uwdt class="avatar j-user-card"> <img alt=UWDT src='//g.izt6.com/avatar/?s=60&d=mm&r=g' srcset="//g.izt6.com/avatar/?s=120&d=mm&r=g 2x" class='avatar avatar-60 photo avatar-default' height=60 width=60 decoding=async> <span>UWDT</span> </a></div> <a class=item-meta-li href=https://www.506064.com/n/category/code target=_blank>编程</a> <span class="item-meta-li date">2024-10-04</span><div class=item-meta-right></div></div></div> </li> <li class="item item-no-thumb"><div class=item-content><h3 class="item-title"> <a href=https://www.506064.com/n/137438.html target=_blank rel=bookmark> git账号和github账号一样吗? </a></h3><div class=item-excerpt><p>一、账户类型 Git账号和GitHub账号的类型并不相同。Git账号可以理解为您的身份验证(即用户名和密码),用于在您的计算机上的版本控制系统上执行操作。而GitHub账号是在Gi…</p></div><div class=item-meta><div class="item-meta-li author"> <a data-user=10167 target=_blank href=https://www.506064.com/spacehome/nluc class="avatar j-user-card"> <img alt=NLUC src='//g.izt6.com/avatar/?s=60&d=mm&r=g' srcset="//g.izt6.com/avatar/?s=120&d=mm&r=g 2x" class='avatar avatar-60 photo avatar-default' height=60 width=60 decoding=async> <span>NLUC</span> </a></div> <a class=item-meta-li href=https://www.506064.com/n/category/code target=_blank>编程</a> <span class="item-meta-li date">2024-10-04</span><div class=item-meta-right></div></div></div> </li> <li class="item item-no-thumb"><div class=item-content><h3 class="item-title"> <a href=https://www.506064.com/n/141847.html target=_blank rel=bookmark> timeout上海全方位解析 </a></h3><div class=item-excerpt><p>一、timeout上海简介 Timeout上海是一家针对上海市吃喝玩乐的在线杂志,旨在为在上海生活的人们提供最新、最全面的关于美食、旅游、文化娱乐以及城市生活等方面的信息。 其网站…</p></div><div class=item-meta><div class="item-meta-li author"> <a data-user=14451 target=_blank href=https://www.506064.com/spacehome/ssso class="avatar j-user-card"> <img alt=SSSO src='//g.izt6.com/avatar/?s=60&d=mm&r=g' srcset="//g.izt6.com/avatar/?s=120&d=mm&r=g 2x" class='avatar avatar-60 photo avatar-default' height=60 width=60 decoding=async> <span>SSSO</span> </a></div> <a class=item-meta-li href=https://www.506064.com/n/category/code target=_blank>编程</a> <span class="item-meta-li date">2024-10-09</span><div class=item-meta-right></div></div></div> </li> <li class="item item-no-thumb"><div class=item-content><h3 class="item-title"> <a href=https://www.506064.com/n/131756.html target=_blank rel=bookmark> 相邻兄弟选择器 </a></h3><div class=item-excerpt><p>相邻兄弟选择器是CSS3新增的选择器,它可以精准地选择相邻的兄弟元素,其语法形式为“E + F”(注意中间有加号)。其中E是要匹配的元素,F是E后面的第一个兄弟元素。 一、基础用法…</p></div><div class=item-meta><div class="item-meta-li author"> <a data-user=4534 target=_blank href=https://www.506064.com/spacehome/gfzb class="avatar j-user-card"> <img alt=GFZB src='//g.izt6.com/avatar/?s=60&d=mm&r=g' srcset="//g.izt6.com/avatar/?s=120&d=mm&r=g 2x" class='avatar avatar-60 photo avatar-default' height=60 width=60 decoding=async> <span>GFZB</span> </a></div> <a class=item-meta-li href=https://www.506064.com/n/category/code target=_blank>编程</a> <span class="item-meta-li date">2024-10-03</span><div class=item-meta-right></div></div></div> </li> <li class="item item-no-thumb"><div class=item-content><h3 class="item-title"> <a href=https://www.506064.com/n/127358.html target=_blank rel=bookmark> python27(python27dll) </a></h3><div class=item-excerpt><p>本文目录一览: 1、Python27是什么 2、如何在Windows 7安装Python2.7 3、python cp27什么意思 Python27是什么 python27其实就是…</p></div><div class=item-meta><div class="item-meta-li author"> <a data-user=1076 target=_blank href=https://www.506064.com/spacehome/w3kap class="avatar j-user-card"> <img alt=W3KAP src='//g.izt6.com/avatar/?s=60&d=mm&r=g' srcset="//g.izt6.com/avatar/?s=120&d=mm&r=g 2x" class='avatar avatar-60 photo avatar-default' height=60 width=60 decoding=async> <span>W3KAP</span> </a></div> <a class=item-meta-li href=https://www.506064.com/n/category/code target=_blank>编程</a> <span class="item-meta-li date">2024-10-03</span><div class=item-meta-right></div></div></div> </li></ul></div><div id=comments class=entry-comments><div id=respond class=comment-respond><h3 id="reply-title" class="comment-reply-title">发表回复 <small><a rel=nofollow id=cancel-comment-reply-link href=/n/131124.html#respond style=display:none;><i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-close></use></svg></i></a></small></h3><div class=comment-form><div class=comment-must-login>请登录后评论...</div><div class=form-submit><div class="form-submit-text pull-left"><a href=https://www.506064.com/login>登录</a>后才能评论</div> <button name=submit type=submit id=must-submit class="wpcom-btn btn-primary btn-xs submit">提交</button></div></div></div></div></article></main><aside class=sidebar><div class="widget widget_profile"><div class=profile-cover><img class=j-lazy src=https://static.506064.com/wp-content/themes/justnews/themer/assets/images/lazy.png data-original=//static.506064.com/wp-content/uploads/2024/03/1617180342.jpg alt=BNYI></div><div class=avatar-wrap> <a target=_blank href=https://www.506064.com/spacehome/bnyi class=avatar-link><img alt=BNYI src='//g.izt6.com/avatar/?s=120&d=mm&r=g' srcset="//g.izt6.com/avatar/?s=240&d=mm&r=g 2x" class='avatar avatar-120 photo avatar-default' height=120 width=120 decoding=async></a></div><div class=profile-info> <a target=_blank href=https://www.506064.com/spacehome/bnyi class=profile-name><span class=author-name>BNYI</span></a><p class=author-description>这个人很懒,什么都没有留下~</p><div class=profile-stats><div class=profile-stats-inner><div class=user-stats-item> <b>1</b> <span>文章</span></div><div class=user-stats-item> <b>0</b> <span>评论</span></div><div class=user-stats-item> <b>0</b> <span>粉丝</span></div></div></div> <button type=button class="wpcom-btn btn-xs btn-follow j-follow btn-primary" data-user=3903><i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-add></use></svg></i>关注</button><button type=button class="wpcom-btn btn-primary btn-xs btn-message j-message" data-user=3903><i class="wpcom-icon wi"><svg aria-hidden=true><use xlink:href=#wi-mail-fill></use></svg></i>私信</button></div><div class=profile-posts><h3 class="widget-title"><span>最近文章</span></h3><ul> <li><a href=https://www.506064.com/n/131124.html title=用Python实现数据爬取>用Python实现数据爬取</a></li></ul></div></div><div class="widget widget-area widget-ez_toc_sticky"><div id=ez-toc-widget-sticky-container class="ez-toc-widget-sticky-container ez-toc-widget-sticky-v2_0_69_1 ez-toc-widget-sticky counter-hierarchy ez-toc-widget-sticky-container ez-toc-widget-sticky-direction"><h3 class="widget-title"><span> <span class=ez-toc-widget-sticky-title-container><style>#ez_toc_widget_sticky-2 .ez-toc-widget-sticky-title { font-size: 120%; font-weight: 500; color: #000; } #ez_toc_widget_sticky-2 .ez-toc-widget-sticky-container ul.ez-toc-widget-sticky-list li.active{ background-color: #ededed; }</style><span class=ez-toc-widget-sticky-title-toggle><span class="ez-toc-widget-sticky-title " >文章目录</span><a href=# class="ez-toc-widget-sticky-pull-right ez-toc-widget-sticky-btn ez-toc-widget-sticky-btn-xs ez-toc-widget-sticky-btn-default ez-toc-widget-sticky-toggle" aria-label="Widget Easy TOC toggle icon"><span style="border: 0;padding: 0;margin: 0;position: absolute !important;height: 1px;width: 1px;overflow: hidden;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);clip-path: inset(50%);white-space: nowrap;">Toggle Table of Content</span><span class><span class=eztoc-hide>Toggle</span><span class=ez-toc-icon-toggle-span></span></span></a></span> </span></span></h3><nav><ul class='ez-toc-widget-sticky-list ez-toc-widget-sticky-list-level-1 ' ><li class='ez-toc-widget-sticky-page-1 ez-toc-widget-sticky-heading-level-2'><a class="ez-toc-link ez-toc-heading-1" href=#25E525BC259525E825A82580 title=引言>引言</a></li><li class='ez-toc-widget-sticky-page-1 ez-toc-widget-sticky-heading-level-2'><a class="ez-toc-link ez-toc-heading-2" href=#25E7258E25AF25E525A2258325E6259025AD25E525BB25BA title=环境搭建>环境搭建</a></li><li class='ez-toc-widget-sticky-page-1 ez-toc-widget-sticky-heading-level-2'><a class="ez-toc-link ez-toc-heading-3" href=#25E725BD259125E725BB259C25E825AF25B725E625B12582 title=网络请求>网络请求</a></li><li class='ez-toc-widget-sticky-page-1 ez-toc-widget-sticky-heading-level-2'><a class="ez-toc-link ez-toc-heading-4" href=#HTML25E825A725A325E6259E2590 title=HTML解析>HTML解析</a></li><li class='ez-toc-widget-sticky-page-1 ez-toc-widget-sticky-heading-level-2'><a class="ez-toc-link ez-toc-heading-5" href=#25E6259525B025E6258D25AE25E525A4258425E725902586 title=数据处理>数据处理</a></li><li class='ez-toc-widget-sticky-page-1 ez-toc-widget-sticky-heading-level-2'><a class="ez-toc-link ez-toc-heading-6" href=#25E6259525B025E6258D25AE25E525AD259825E5258225A8 title=数据存储>数据存储</a></li><li class='ez-toc-widget-sticky-page-1 ez-toc-widget-sticky-heading-level-2'><a class="ez-toc-link ez-toc-heading-7" href=#25E6259525B025E6258D25AE25E7258825AC25E5258F259625E7259A258425E625B325A825E62584258F25E425BA258B25E925A125B9 title=数据爬取的注意事项>数据爬取的注意事项</a><ul class=ez-toc-widget-sticky-list-level-3 ><li class=ez-toc-widget-sticky-heading-level-3><a class="ez-toc-link ez-toc-heading-8" href=#25E9258125B525E525AE258825E625B3259525E525BE258B25E625B3259525E825A72584 title=遵守法律法规>遵守法律法规</a></li><li class='ez-toc-widget-sticky-page-1 ez-toc-widget-sticky-heading-level-3'><a class="ez-toc-link ez-toc-heading-9" href=#25E5258F258D25E7258825AC25E725AD259625E7259525A525E7259A258425E525BA259425E525AF25B9 title=反爬策略的应对>反爬策略的应对</a></li><li class='ez-toc-widget-sticky-page-1 ez-toc-widget-sticky-heading-level-3'><a class="ez-toc-link ez-toc-heading-10" href=#25E92581259325E525BE25B725E9259725AE25E925A22598 title=道德问题>道德问题</a></li></ul></li><li class='ez-toc-widget-sticky-page-1 ez-toc-widget-sticky-heading-level-2'><a class="ez-toc-link ez-toc-heading-11" href=#25E6258025BB25E725BB2593 title=总结>总结</a></li></ul></nav></div></div><div class="widget widget_lastest_products"><h3 class="widget-title"><span>可能喜欢</span></h3><ul class=p-list> <li class="col-xs-24 col-md-12 p-item"><div class=p-item-wrap> <a class=thumb href=https://www.506064.com/n/125944.html> <img width=480 height=300 src=https://static.506064.com/wp-content/themes/justnews/themer/assets/images/lazy.png class="attachment-default size-default wp-post-image j-lazy" alt="AI Logo 制作工具 LogoAI.ai,快速生成高质量 Logo" decoding=async data-original=https://static.506064.com/wp-content/uploads/2024/09/1725603329861slvpz89t-480x300.png> </a><h4 class="title"> <a href=https://www.506064.com/n/125944.html title="AI Logo 制作工具 LogoAI.ai,快速生成高质量 Logo"> AI Logo 制作工具 LogoAI.ai,快速生成高质量 Logo </a></h4></div> </li> <li class="col-xs-24 col-md-12 p-item"><div class=p-item-wrap> <a class=thumb href=https://www.506064.com/n/125936.html> <img width=480 height=300 src=https://static.506064.com/wp-content/themes/justnews/themer/assets/images/lazy.png class="attachment-default size-default wp-post-image j-lazy" alt=在Steam上体验《黑神话悟空》的最经济便宜购买途径 decoding=async data-original=https://static.506064.com/wp-content/uploads/2024/09/image-480x300.png> </a><h4 class="title"> <a href=https://www.506064.com/n/125936.html title=在Steam上体验《黑神话悟空》的最经济便宜购买途径> 在Steam上体验《黑神话悟空》的最经济便宜购买途径 </a></h4></div> </li> <li class="col-xs-24 col-md-12 p-item"><div class=p-item-wrap> <a class=thumb href=https://www.506064.com/n/213.html> <img width=480 height=300 src=https://static.506064.com/wp-content/themes/justnews/themer/assets/images/lazy.png class="attachment-default size-default wp-post-image j-lazy" alt=krenz平面设计构成色彩第12期 decoding=async data-original=https://static.506064.com/wp-content/uploads/2024/03/krenz12-480x300.png> </a><h4 class="title"> <a href=https://www.506064.com/n/213.html title=krenz平面设计构成色彩第12期> krenz平面设计构成色彩第12期 </a></h4></div> </li> <li class="col-xs-24 col-md-12 p-item"><div class=p-item-wrap> <a class=thumb href=https://www.506064.com/n/7001.html> <img width=480 height=300 src=https://static.506064.com/wp-content/themes/justnews/themer/assets/images/lazy.png class="attachment-default size-default wp-post-image j-lazy" alt=百度站长平台「快速收录」4月26日下线 decoding=async data-original=https://static.506064.com/wp-content/uploads/2024/04/019781617003186-480x300.jpg> </a><h4 class="title"> <a href=https://www.506064.com/n/7001.html title=百度站长平台「快速收录」4月26日下线> 百度站长平台「快速收录」4月26日下线 </a></h4></div> </li> <li class="col-xs-24 col-md-12 p-item"><div class=p-item-wrap> <a class=thumb href=https://www.506064.com/n/6832.html> <img width=480 height=300 src=https://static.506064.com/wp-content/themes/justnews/themer/assets/images/lazy.png class="attachment-default size-default wp-post-image j-lazy" alt=腾讯云遨驰终端(OrcaTerm)轻量(2折)和CVM(5折)服务器续费券 decoding=async data-original=https://static.506064.com/wp-content/uploads/2024/04/qcloud-OrcaTerm-480x300.jpg> </a><h4 class="title"> <a href=https://www.506064.com/n/6832.html title=腾讯云遨驰终端(OrcaTerm)轻量(2折)和CVM(5折)服务器续费券> 腾讯云遨驰终端(OrcaTerm)轻量(2折)和CVM(5折)服务器续费券 </a></h4></div> </li> <li class="col-xs-24 col-md-12 p-item"><div class=p-item-wrap> <a class=thumb href=https://www.506064.com/n/2540.html> <img width=480 height=300 src=https://static.506064.com/wp-content/themes/justnews/themer/assets/images/lazy.png class="attachment-default size-default wp-post-image j-lazy" alt=剪映识别的字幕文件在哪里? decoding=async data-original=https://static.506064.com/wp-content/uploads/2024/03/jy_zimu_location_yh-480x300.jpg> </a><h4 class="title"> <a href=https://www.506064.com/n/2540.html title=剪映识别的字幕文件在哪里?> 剪映识别的字幕文件在哪里? </a></h4></div> </li> <li class="col-xs-24 col-md-12 p-item"><div class=p-item-wrap> <a class=thumb href=https://www.506064.com/n/2544.html> <img width=480 height=300 src=https://static.506064.com/wp-content/themes/justnews/themer/assets/images/lazy.png class="attachment-default size-default wp-post-image j-lazy" alt=哪个文件是剪映字幕文件? decoding=async data-original=https://static.506064.com/wp-content/uploads/2024/03/jy_which_file-480x300.jpg> </a><h4 class="title"> <a href=https://www.506064.com/n/2544.html title=哪个文件是剪映字幕文件?> 哪个文件是剪映字幕文件? </a></h4></div> </li> <li class="col-xs-24 col-md-12 p-item"><div class=p-item-wrap> <a class=thumb href=https://www.506064.com/n/212.html> <img width=480 height=300 src=https://static.506064.com/wp-content/themes/justnews/themer/assets/images/lazy.png class="attachment-default size-default wp-post-image j-lazy" alt=0基础入门实战深度学习Pytorch decoding=async data-original=https://static.506064.com/wp-content/uploads/2024/03/Pytorch-480x300.png> </a><h4 class="title"> <a href=https://www.506064.com/n/212.html title=0基础入门实战深度学习Pytorch> 0基础入门实战深度学习Pytorch </a></h4></div> </li> <li class="col-xs-24 col-md-12 p-item"><div class=p-item-wrap> <a class=thumb href=https://www.506064.com/n/217.html> <img width=480 height=300 src=https://static.506064.com/wp-content/themes/justnews/themer/assets/images/lazy.png class="attachment-default size-default wp-post-image j-lazy" alt=Epic免费领游戏:荒野的召唤:垂钓者+无敌少侠:原子伊芙 decoding=async data-original=https://static.506064.com/wp-content/uploads/2024/03/Epic-480x300.png> </a><h4 class="title"> <a href=https://www.506064.com/n/217.html title=Epic免费领游戏:荒野的召唤:垂钓者+无敌少侠:原子伊芙> Epic免费领游戏:荒野的召唤:垂钓者+无敌少侠:原子伊芙 </a></h4></div> </li> <li class="col-xs-24 col-md-12 p-item"><div class=p-item-wrap> <a class=thumb href=https://www.506064.com/n/117551.html> <img width=480 height=300 src=https://static.506064.com/wp-content/themes/justnews/themer/assets/images/lazy.png class="attachment-default size-default wp-post-image j-lazy" alt=字节跳动旗下豆包AI编程助手MarsCode拉新活动:京东E卡 decoding=async data-original=https://static.506064.com/wp-content/uploads/2024/08/image-480x300.png> </a><h4 class="title"> <a href=https://www.506064.com/n/117551.html title=字节跳动旗下豆包AI编程助手MarsCode拉新活动:京东E卡> 字节跳动旗下豆包AI编程助手MarsCode拉新活动:京东E卡 </a></h4></div> </li></ul></div></aside></div></div><footer class=footer><div class=container><div class="footer-col-wrap footer-with-none"><div class="footer-col footer-col-copy"><ul class="footer-nav hidden-xs"><li id=menu-item-2539 class="menu-item menu-item-2539"><a href=/tools/base64/ >Base64编码解码</a></li> <li id=menu-item-2550 class="menu-item menu-item-2550"><a href=/tools/jianying/ >剪映字幕导出工具</a></li> <li id=menu-item-2551 class="menu-item menu-item-2551"><a href=/tools/jianying/srtdr.html>导入剪映字幕工具</a></li></ul><div class=copyright><p>Copyright © 2024 简单一点 版权所有 <a href=https://beian.miit.gov.cn target=_blank rel="nofollow noopener">滇ICP备2024022404号-1</a> Powered by 506064.Com</p></div></div></div></div></footer><div class="action action-style-0 action-color-0 action-pos-0" style=bottom:20%;><div class="action-item j-share"> <i class="wpcom-icon wi action-item-icon"><svg aria-hidden=true><use xlink:href=#wi-share></use></svg></i></div><div class="action-item gotop j-top"> <i class="wpcom-icon wi action-item-icon"><svg aria-hidden=true><use xlink:href=#wi-arrow-up-2></use></svg></i></div></div><link rel=stylesheet href=https://static.506064.com/wp-content/cache/minify/b8217.css media=all><style id=ez-toc-widget-sticky-inline-css>.ez-toc-widget-sticky-direction {direction: ltr;}.ez-toc-widget-sticky-container ul{counter-reset: item ;}.ez-toc-widget-sticky-container nav ul li a::before {content: counters(item, '.', decimal) '. ';display: inline-block;counter-increment: item;flex-grow: 0;flex-shrink: 0;margin-right: .2em; float: left; }</style> <script id=main-js-extra>/*<![CDATA[*/var _wpcom_js = {"webp":"","ajaxurl":"https:\/\/www.506064.com\/wp-admin\/admin-ajax.php","theme_url":"https:\/\/www.506064.com\/wp-content\/themes\/justnews","slide_speed":"5000","is_admin":"0","lang":"zh_CN","js_lang":{"share_to":"\u5206\u4eab\u5230:","copy_done":"\u590d\u5236\u6210\u529f\uff01","copy_fail":"\u6d4f\u89c8\u5668\u6682\u4e0d\u652f\u6301\u62f7\u8d1d\u529f\u80fd","confirm":"\u786e\u5b9a","qrcode":"\u4e8c\u7ef4\u7801","page_loaded":"\u5df2\u7ecf\u5230\u5e95\u4e86","no_content":"\u6682\u65e0\u5185\u5bb9","load_failed":"\u52a0\u8f7d\u5931\u8d25\uff0c\u8bf7\u7a0d\u540e\u518d\u8bd5\uff01","expand_more":"\u9605\u8bfb\u5269\u4f59 %s"},"share":"1","share_items":{"weibo":{"title":"\u5fae\u535a","icon":"weibo"},"wechat":{"title":"\u5fae\u4fe1","icon":"wechat"},"qzone":{"title":"QQ\u7a7a\u95f4","icon":"qzone"},"qq":{"title":"QQ\u597d\u53cb","icon":"qq"},"douban":{"name":"douban","title":"\u8c46\u74e3","icon":"douban"}},"lightbox":"1","post_id":"131124","user_card_height":"356","poster":{"notice":"\u8bf7\u300c\u70b9\u51fb\u4e0b\u8f7d\u300d\u6216\u300c\u957f\u6309\u4fdd\u5b58\u56fe\u7247\u300d\u540e\u5206\u4eab\u7ed9\u66f4\u591a\u597d\u53cb","generating":"\u6b63\u5728\u751f\u6210\u6d77\u62a5\u56fe\u7247...","failed":"\u6d77\u62a5\u56fe\u7247\u751f\u6210\u5931\u8d25"},"video_height":"482","fixed_sidebar":"1","dark_style":"0","font_url":"\/\/static.506064.com\/wp-content\/uploads\/wpcom\/fonts.f5a8b036905c9579.css","follow_btn":"<i class=\"wpcom-icon wi\"><svg aria-hidden=\"true\"><use xlink:href=\"#wi-add\"><\/use><\/svg><\/i>\u5173\u6ce8","followed_btn":"\u5df2\u5173\u6ce8","user_card":"1"};/*]]>*/</script> <script src=https://static.506064.com/wp-content/cache/minify/d218d.js></script> <script id=ez-toc-js-js-extra>/*<![CDATA[*/var ezTOC = {"smooth_scroll":"","visibility_hide_by_default":"","scroll_offset":"30","fallbackIcon":"<i class=\"ez-toc-toggle-el\"><\/i>","chamomile_theme_is_on":""};/*]]>*/</script> <script src=https://static.506064.com/wp-content/cache/minify/0c713.js></script> <script id=wpcom-member-js-extra>var _wpmx_js = {"ajaxurl":"https:\/\/www.506064.com\/wp-admin\/admin-ajax.php","plugin_url":"https:\/\/www.506064.com\/wp-content\/plugins\/wpcom-member\/","post_id":"131124","js_lang":{"login_desc":"\u60a8\u8fd8\u672a\u767b\u5f55\uff0c\u8bf7\u767b\u5f55\u540e\u518d\u8fdb\u884c\u76f8\u5173\u64cd\u4f5c\uff01","login_title":"\u8bf7\u767b\u5f55","login_btn":"\u767b\u5f55","reg_btn":"\u6ce8\u518c"},"login_url":"https:\/\/www.506064.com\/login","register_url":"https:\/\/www.506064.com\/reg","captcha_label":"\u70b9\u51fb\u8fdb\u884c\u4eba\u673a\u9a8c\u8bc1","captcha_verified":"\u9a8c\u8bc1\u6210\u529f","errors":{"require":"\u4e0d\u80fd\u4e3a\u7a7a","email":"\u8bf7\u8f93\u5165\u6b63\u786e\u7684\u7535\u5b50\u90ae\u7bb1","pls_enter":"\u8bf7\u8f93\u5165","password":"\u5bc6\u7801\u5fc5\u987b\u4e3a6~32\u4e2a\u5b57\u7b26","passcheck":"\u4e24\u6b21\u5bc6\u7801\u8f93\u5165\u4e0d\u4e00\u81f4","phone":"\u8bf7\u8f93\u5165\u6b63\u786e\u7684\u624b\u673a\u53f7\u7801","terms":"\u8bf7\u9605\u8bfb\u5e76\u540c\u610f\u6761\u6b3e","sms_code":"\u9a8c\u8bc1\u7801\u9519\u8bef","captcha_verify":"\u8bf7\u70b9\u51fb\u6309\u94ae\u8fdb\u884c\u9a8c\u8bc1","captcha_fail":"\u4eba\u673a\u9a8c\u8bc1\u5931\u8d25\uff0c\u8bf7\u91cd\u8bd5","nonce":"\u968f\u673a\u6570\u6821\u9a8c\u5931\u8d25","req_error":"\u8bf7\u6c42\u5931\u8d25"}};</script> <script src=https://static.506064.com/wp-content/cache/minify/e6954.js></script> <script id=QAPress-js-js-extra>var QAPress_js = {"ajaxurl":"https:\/\/www.506064.com\/wp-admin\/admin-ajax.php","ajaxloading":"https:\/\/www.506064.com\/wp-content\/plugins\/qapress\/images\/loading.gif","max_upload_size":"2097152","compress_img_size":"1920","lang":{"delete":"\u5220\u9664","nocomment":"\u6682\u65e0\u56de\u590d","nocomment2":"\u6682\u65e0\u8bc4\u8bba","addcomment":"\u6211\u6765\u56de\u590d","submit":"\u53d1\u5e03","loading":"\u6b63\u5728\u52a0\u8f7d...","error1":"\u53c2\u6570\u9519\u8bef\uff0c\u8bf7\u91cd\u8bd5","error2":"\u8bf7\u6c42\u5931\u8d25\uff0c\u8bf7\u7a0d\u540e\u518d\u8bd5\uff01","confirm":"\u5220\u9664\u64cd\u4f5c\u65e0\u6cd5\u6062\u590d\uff0c\u5e76\u5c06\u540c\u65f6\u5220\u9664\u5f53\u524d\u56de\u590d\u7684\u8bc4\u8bba\u4fe1\u606f\uff0c\u60a8\u786e\u5b9a\u8981\u5220\u9664\u5417\uff1f","confirm2":"\u5220\u9664\u64cd\u4f5c\u65e0\u6cd5\u6062\u590d\uff0c\u60a8\u786e\u5b9a\u8981\u5220\u9664\u5417\uff1f","confirm3":"\u5220\u9664\u64cd\u4f5c\u65e0\u6cd5\u6062\u590d\uff0c\u5e76\u5c06\u540c\u65f6\u5220\u9664\u5f53\u524d\u95ee\u9898\u7684\u56de\u590d\u8bc4\u8bba\u4fe1\u606f\uff0c\u60a8\u786e\u5b9a\u8981\u5220\u9664\u5417\uff1f","deleting":"\u6b63\u5728\u5220\u9664...","success":"\u64cd\u4f5c\u6210\u529f\uff01","denied":"\u65e0\u64cd\u4f5c\u6743\u9650\uff01","error3":"\u64cd\u4f5c\u5f02\u5e38\uff0c\u8bf7\u7a0d\u540e\u518d\u8bd5\uff01","empty":"\u5185\u5bb9\u4e0d\u80fd\u4e3a\u7a7a","submitting":"\u6b63\u5728\u63d0\u4ea4...","success2":"\u63d0\u4ea4\u6210\u529f\uff01","ncomment":"0\u6761\u8bc4\u8bba","login":"\u62b1\u6b49\uff0c\u60a8\u9700\u8981\u767b\u5f55\u624d\u80fd\u8fdb\u884c\u56de\u590d","error4":"\u63d0\u4ea4\u5931\u8d25\uff0c\u8bf7\u7a0d\u540e\u518d\u8bd5\uff01","need_title":"\u8bf7\u8f93\u5165\u6807\u9898","need_cat":"\u8bf7\u9009\u62e9\u5206\u7c7b","need_content":"\u8bf7\u8f93\u5165\u5185\u5bb9","success3":"\u66f4\u65b0\u6210\u529f\uff01","success4":"\u53d1\u5e03\u6210\u529f\uff01","need_all":"\u6807\u9898\u3001\u5206\u7c7b\u548c\u5185\u5bb9\u4e0d\u80fd\u4e3a\u7a7a","length":"\u5185\u5bb9\u957f\u5ea6\u4e0d\u80fd\u5c11\u4e8e10\u4e2a\u5b57\u7b26","load_done":"\u56de\u590d\u5df2\u7ecf\u5168\u90e8\u52a0\u8f7d","load_fail":"\u52a0\u8f7d\u5931\u8d25\uff0c\u8bf7\u7a0d\u540e\u518d\u8bd5\uff01","load_more":"\u70b9\u51fb\u52a0\u8f7d\u66f4\u591a","approve":"\u786e\u5b9a\u8981\u5c06\u5f53\u524d\u95ee\u9898\u8bbe\u7f6e\u4e3a\u5ba1\u6838\u901a\u8fc7\u5417\uff1f","end":"\u5df2\u7ecf\u5230\u5e95\u4e86","upload_fail":"\u56fe\u7247\u4e0a\u4f20\u51fa\u9519\uff0c\u8bf7\u7a0d\u540e\u518d\u8bd5\uff01","file_types":"\u4ec5\u652f\u6301\u4e0a\u4f20jpg\u3001png\u3001gif\u683c\u5f0f\u7684\u56fe\u7247\u6587\u4ef6","file_size":"\u56fe\u7247\u5927\u5c0f\u4e0d\u80fd\u8d85\u8fc72M","uploading":"\u6b63\u5728\u4e0a\u4f20...","upload":"\u63d2\u5165\u56fe\u7247"}};</script> <script src=https://static.506064.com/wp-content/cache/minify/81d57.js></script> <script id=ez-toc-widget-stickyjs-js-extra>var ezTocWidgetSticky = {"appearance_options":"","advanced_options":"","scroll_fixed_position":"30","sidebar_sticky_title":"120","sidebar_sticky_title_size_unit":"%","sidebar_sticky_title_weight":"500","sidebar_sticky_title_color":"#000","sidebar_width":"auto","sidebar_width_size_unit":"none","fixed_top_position":"30","fixed_top_position_size_unit":"px","navigation_scroll_bar":"on","scroll_max_height":"auto","scroll_max_height_size_unit":"none"};</script> <script src=https://static.506064.com/wp-content/cache/minify/11e9f.js></script> <script type=application/ld+json>{ "@context": "https://schema.org", "@type": "Article", "@id": "https://www.506064.com/n/131124.html", "url": "https://www.506064.com/n/131124.html", "headline": "用Python实现数据爬取", "description": "引言 在当今互联网时代,数据的产生和传输速度之快已经无法计量。从早期简单的HTML页面到今天各种复杂的多媒体内容,人类对数据的需求在逐年增长。而数据爬取技术由此诞生,随着不断的技术…", "datePublished": "2024-10-03T23:43:11+08:00", "dateModified": "2024-10-03T23:43:11+08:00", "author": {"@type":"Person","name":"BNYI","url":"https://www.506064.com/spacehome/bnyi","image":"//g.izt6.com/avatar/?s=96&d=mm&r=g"} }</script> </body></html>