1、⽹络爬⾍
1、定义:⽹络蜘蛛,⽹络机器⼈,抓取⽹络数据的程序
2、总结:⽤Python程序去模仿⼈去访问⽹站,模仿的越逼真越好 3、⽬的:通过有效的⼤量的数据分析市场⾛势,公司的决策2、企业获取数据的⽅式 1、公司⾃有
2、第三⽅数据平台购买
1、数据堂、贵阳⼤数据交易所 3、爬⾍程序爬取数据
市场上没有或者价格太⾼,利⽤爬⾍程序去爬取3、Python做爬⾍的优势
1、Python:请求模块、解析模块丰富成熟 2、PHP:多线程,异步⽀持不够好 3、JAVA:代码笨重,代码量⼤
4、C/C++:虽然效率⾼,但代码成型太慢4、爬⾍的分类
1、通⽤的⽹络爬⾍(搜索引擎引⽤,需要遵守robots协议) 1、搜索引擎如何获取⼀个新⽹站的URL
1、⽹站主动向搜索引擎提供(百度站长平台) 2、和DNS服务商(万⽹),快速收录新⽹站 2、聚焦⽹络爬⾍(需要什么爬取什么)
⾃⼰写的爬⾍程序:⾯向主题爬⾍,⾯向需求爬⾍5、爬取数据步骤
1、确定需要爬取的URL地址
2、通过HTTP/HTTPS协议来获取响应的HTML页⾯ 3、提取HTML页⾯⾥有⽤的数据 1、所需数据,保存
2、页⾯中其他的URL,继续重复第2步6、Chrome浏览器插件 1、插件安装步骤
1、右上⾓->更多⼯具->扩展程序
2、点开 开发者模式
3、把插件拖拽到浏览器界⾯ 2、插件介绍
1、Proxy SwitchyOmega:代理切换插件 2、XPath Helper:⽹页数据解析插件
3、JSON View:查看json格式的数据(好看)7、Fiddler抓包⼯具 1、抓包设置
1、设置Fiddler抓包⼯具
2、设置浏览器代理
Proxy SwitchyOmega ->选项->新建情景模式->HTTP 127.0.0.1 8888
2、Fiddler常⽤菜单
1、Inspector:查看抓到数据包的详细内容 2、常⽤选项
1、Headers:客户端发送到服务器的header,包含web客户端信息 cookie传输状态 2、WebForms:显⽰请求的POST的数据 3、Raw:将整个请求显⽰为纯⽂本
import urllib.request
url = 'http://www.baidu.com/'#发起请求并获取响应对象
response = urllib.request.urlopen(url)#响应对象的read()⽅法获取响应内容#read()⽅法得到的是bytes类型#read() bytes -->string
html = response.read().decode('utf-8')print(html)
2、urllib.request.Request(url,headers={})
1、重构User-Agent,爬⾍和反爬⾍⽃争第⼀步 2、使⽤步骤
1、构建请求对象request:Request()
2、获取响应对象response:urlopen(request) 3、利⽤响应对象response.read().decode('utf-8')
# -*- coding: utf-8 -*-import urllib.request
url = 'http://www.baidu.com/'
headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}#1、构建请求对象
request = urllib.request.Request(url,headers=headers)#2、得到响应对象
response = urllib.request.urlopen(request)#3、获取响应对象的内容
html = response.read().decode('utf-8')print(html)
3、请求对象request⽅法 1、add_header()
作⽤:添加或修改headers(User-Agent) 2、get_header(‘User-agent’),只有U是⼤写 作⽤:获取已有的HTTP报头的值
import urllib.request
url = 'http://www.baidu.com/'
headers = 'User-AgentUser-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'request = urllib.request.Request(url)#请求对象⽅法add_header()
request.add_header(\"User-Agent\",headers)#获取响应对象
response = urllib.request.urlopen(request)#get_header()⽅法获取User-agent,
#注意User-agent的写法,只有U是⼤写的print(request.get_header('User-agent'))#获取响应码
print(response.getcode())
#获取响应报头信息,返回结果是⼀个字典print(response.info())
html = response.read().decode('utf-8')print(html)
4、响应对象response⽅法
1、read();读取服务器响应的内容 2、getcode():
作⽤:返回HTTP的响应状态码 200:成功
4XX:服务器页⾯出错(连接到了服务器) 5XX:服务器出错(没有连接到服务器) 3、info():
作⽤:返回服务器的响应报头信息 2、urllib.parse
1、quote('中⽂字符串') 2、urlencode(字典)
3、unquote(\"编码之后的字符串\"),解码
import urllib.requestimport urllib.parse
url = 'http://www.baidu.com/s?wd='key = input('请输⼊要搜索的内容')#编码,拼接URL
key = urllib.parse.quote(key)fullurl = url+key
print(fullurl)#http://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3
headers = {'User-Agent':\"User-AgentUser-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50\"}request = urllib.request.Request(fullurl,headers = headers)resp = urllib.request.urlopen(request)html = resp.read().decode('utf-8')print(html)
import urllib.requestimport urllib.parse
baseurl = \"http://www.baidu.com/s?\"
headers = {'User-Agent':\"User-AgentUser-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50\"}key = input(\"请输⼊要搜索的内容\")#urlencode编码,参数⼀定是字典d = {\"wd\":key}
d = urllib.parse.urlencode(d)url = baseurl + d
resq = urllib.request.Request(url,headers = headers)resp = urllib.request.urlopen(resq)html = resp.read().decode('utf-8')print(html)
练习:爬取百度贴吧1、简单版
# -*- coding: utf-8 -*-\"\"\"
百度贴吧数据抓取要求:
1、输⼊贴吧的名称
2、输⼊抓取的起始页和终⽌页
3、把每⼀页的内容保存到本地:第⼀页.html 第⼆页.html
http://tieba.baidu.com/f?kw=%E6%B2%B3%E5%8D%97%E5%A4%A7%E5%AD%A6&ie=utf-8&pn=0\"\"\"
import urllib.requestimport urllib.parse
baseurl = \"http://tieba.baidu.com/f?\"
headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}title = input(\"请输⼊要查找的贴吧\")
begin_page = int(input(\"请输⼊起始页\"))end_page = int(input(\"请输⼊起始页\"))#RUL进⾏编码kw = {\"kw\":title}
kw = urllib.parse.urlencode(kw)
#写循环拼接URL,发请求获取响应,写⼊本地⽂件for page in range(begin_page,end_page+1): pn = (page-1)*50 #拼接URL
url = baseurl + kw + \"&pa=\" + str(pn) #发请求,获取响应
req = urllib.request.Request(url,headers=headers) res = urllib.request.urlopen(req) html = res.read().decode(\"utf-8\") #写⽂件保存在本地
filename = \"第\" + str(page) +\"页.html\"
with open(filename,'w',encoding='utf-8') as f: print(\"正在下载第%d页\"%page) f.write(html)
print(\"第%d页下载成功\"%page)
2、函数版
import urllib.requestimport urllib.parse
#发请求,获取响应,得到htmldef getPage(url):
headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'} req = urllib.request.Request(url,headers=headers) res = urllib.request.urlopen(req) html = res.read().decode(\"utf-8\") return html
#保存html⽂件到本地
def writePage(filename,html):
with open(filename,'w',encoding=\"utf-8\") as f: f.write(html)
#主函数
def workOn():
name = input(\"请输⼊贴吧名\") begin = int(input(\"请输⼊起始页\")) end = int(input(\"请输⼊终⽌页\")) baseurl = \"http://tieba.baidu.com/f?\" kw = {\"kw\":name}
kw = urllib.parse.urlencode(kw) for page in range(begin,end+1): pn = (page-1) *50
url = baseurl + kw + \"&pn=\" + str(pn) html = getPage(url)
filename = \"第\"+ str(page) + \"页.html\" writePage(filename,html)if __name__ == \"__main__\": workOn()
3、封装为类
import urllib.requestimport urllib.parse
class BaiduSpider: def __init__(self):
self.baseurl = \"http://tieba.baidu.com/f?\"
self.headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
def getPage(self,url):
'''发请求,获取响应,得到html'''
req = urllib.request.Request(url,headers = self.headers) res = urllib.request.urlopen(req) html = res.read().decode(\"utf-8\") return html
def writePage(self,filename,html): '''保存html⽂件到本地'''
with open(filename,'w',encoding=\"utf-8\") as f: f.write(html)
def workOn(self): '''主函数'''
name = input(\"请输⼊贴吧名\") begin = int(input(\"请输⼊起始页\")) end = int(input(\"请输⼊终⽌页\")) kw = {\"kw\":name}
kw = urllib.parse.urlencode(kw) for page in range(begin,end+1): pn = (page-1) *50
url = self.baseurl + kw + \"&pn=\" + str(pn) html = self.getPage(url)
filename = \"第\"+ str(page) + \"页.html\" writePage(filename,html)
if __name__ == \"__main__\": #创建对象
daiduSpider = BaiduSpider() #调⽤类内的⽅法
daiduSpider.workOn()
1、解析
1、数据分类
1、结构化数据
特点:有固定的格式:HTML、XML、JSON等 2、⾮结构化数据
⽰例:图⽚、⾳频、视频,这类数据⼀般存储为⼆进制 2、正则表达式(re模块) 1、使⽤流程
1、创建编译对象:p = re.compile(r\"\\d\")
2、对字符串匹配:result = p.match('123ABC') 3、获取匹配结果:print(result.group()) 2、常⽤⽅法
1、match(s):只匹配字符串开头,返回⼀个对象
2、search(s):从开始往后去匹配第⼀个,返回⼀个对象 3、group():从match和search返回的对象中取值 4、findall(s):全部匹配,返回⼀个列表 3、表达式
.:任意字符(不能匹配\\n) [...]:包含[]中的⼀个内容 \\d:数字
\\w:字母、数字、下划线 \\s:空⽩字符 \\S:⾮空字符
*:前⼀个字符出现0次或多次 ?:0次或1次 +:1次或多次
{m}:前⼀个字符出现m次
贪婪匹配:在整个表达式匹配成功前提下,尽可能多的去匹配 ⾮贪婪匹配:整个表达式匹配成功前提下,尽可能少的去匹配 4、⽰例:
import re
s = \"\"\"
仰天⼤笑出门去,我辈岂是篷篙⼈
天⽣我材必有⽤,千⾦散尽还复来
#创建编译对象,贪婪匹配
p =re.compile(\"
#['
仰天⼤笑出门去,我辈岂是篷篙⼈
天⽣我材必有⽤,千⾦散尽还复来
p1 = re.compile(\"
#['
仰天⼤笑出门去,我辈岂是篷篙⼈
天⽣我材必有⽤,千⾦散尽还复来
5、findall()的分组
解释:先按整体匹配出来,然后在匹配()中内容,如果有2个或多个(),则以元组⽅式显⽰
import res = 'A B C D'
p1 = re.compile(\"\\w+\\s+\\w+\")print(p1.findall(s))#['A B','C D']
#1、先按照整体去匹配['A B','C D']#2、显⽰括号⾥⾯的⼈内容,['A','C']p2 = re.compile(\"(\\w+)\\s+\\w+\")print(p2.findall(s))#['A','C']
#1、先按照整体匹配['A B','C D']
#2、有两个以上分组需要将匹配分组的内容写在⼩括号⾥⾯#,显⽰括号内容:[('A','B'),('C','D')]p3 = re.compile(\"(\\w+)\\s+(\\w+)\")print(p3.findall(s))#[('A','B'),('C','D')]
6、练习,猫眼电影榜单top100
# -*- coding: utf-8 -*-\"\"\"
1、爬取猫眼电影top100榜单
1、程序运⾏,直接爬取第⼀页 2、是否继续爬取(y/n) y:爬取第2页
n:爬取结束,谢谢使⽤
3、把每⼀页的内容保存到本地,第⼀页.html 第⼀页:http://maoyan.com/board/4?offset=0 第⼆页:http://maoyan.com/board/4?offset=10 4、解析:电影名,主演,上映时间\"\"\"
import urllib.requestimport re
class MaoyanSpider:
'''爬取猫眼电影top100榜单''' def __init__(self):
self.baseurl = \"http://maoyan.com/board/4?offset=\"
self.headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
def getPage(self,url): '''获取html页⾯''' #创建请求对象
res = urllib.request.Request(url,headers= self.headers) #发送请求
rep = urllib.request.urlopen(res) #得到响应结果
html = rep.read().decode(\"utf=8\") return html
def wirtePage(self,filename,html): '''保存⾄本地⽂件'''
# with open(filename,'w',encoding=\"utf-8\") as f:# f.write(html)
content_list = self.match_contents(html) for content_tuple in content_list:
movie_title = content_tuple[0].strip()
movie_actors = content_tuple[1].strip()[3:] releasetime = content_tuple[2].strip()[5:15] with open(filename,'a',encoding='utf-8') as f:
f.write(movie_title+\"|\" + movie_actors+\"|\" + releasetime+'\\n')
def match_contents(self,html):
'''匹配电影名,主演,和上映时间''' #正则表达式# '''
#
# '''regex = r'
(.*?)
.*?for page in range(0,10): #拼接URL
url = self.baseurl + str(page*10)
#filename = '猫眼/第' + str(page+1) + \"页.html\" filename = '猫眼/第' + str(page+1) + \"页.txt\" print(\"正在爬取%s页\"%(page+1)) html = self.getPage(url)
self.wirtePage(filename,html) #⽤于记录输⼊的命令 flag = False while True:
msg = input(\"是否继续爬取(y/n)\") if msg == \"y\": flag = True elif msg == \"n\":
print(\"爬取结束,谢谢使⽤\") flag = False else:
print(\"您输⼊的命令⽆效\") continue if flag : break else:
return None
print(\"所有内容爬取完成\")
if __name__ == \"__main__\": spider = MaoyanSpider() spider.workOn()
猫眼电影top100爬取
3、Xpath
4、BeautifulSoup2、请求⽅式及⽅案
1、GET(查询参数都在URL地址中显⽰) 2、POST
1、特点:查询参数在Form表单⾥保存 2、使⽤:
urllib.request.urlopen(url,data = data ,headers = headers) data:表单数据data必须以bytes类型提交,不能是字典 3、案例:有道翻译
1、利⽤Fiddler抓包⼯具抓取WebForms⾥表单数据 2、对POST数据进⾏处理bytes数据类型 3、发送请求获取响应
from urllib import request,parseimport json
#1、处理表单数据
#Form表单的数据放到字典中,然后在进⾏编码转换word = input('请输⼊要翻译的内容:')data = {\"i\":word,
\"from\":\"AUTO\", \"to\":\"AUTO\",
\"smartresult\":\"dict\", \"client\":\"fanyideskweb\", \"salt\":\"1536648367302\",
\"sign\":\"f7f6b53876957660bf69994389fd0014\", \"doctype\":\"json\",
\"version\":\"2.1\",
\"keyfrom\":\"fanyi.web\",
\"action\":\"FY_BY_REALTIME\", \"typoResult\":\"false\"}#2、把data转换为bytes类型
data = parse.urlencode(data).encode('utf-8')#3、发请求获取响应
#此处德 URL为抓包⼯具抓到的POST的URL
url = \"http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule\"
headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}req = request.Request(url,data=data,headers=headers)res = request.urlopen(req)
result = res.read().decode('utf-8')print(type(result))# print(result)#result为json格式的字符串'''{\"type\":\"ZH_CN2EN\ \"errorCode\":0, \"elapsedTime\":1, \"translateResult\":[ [{\"src\":\"你好\ \"tgt\":\"hello\" }] ]}''' #把json格式的字符串转换为Python字典# dic = json.loads(result) print(dic[\"translateResult\"][0][0][\"tgt\"]) 4、json模块 json.loads('json格式的字符串') 作⽤:把json格式的字符串转换为Python字典 3、Cookie模拟登陆 1、Cookie 和 Session cookie:通过在客户端记录的信息确定⽤户⾝份 session:通过在服务器端记录的信息确定⽤户⾝份 2、案例:使⽤cookie模拟登陆⼈⼈⽹ 1、获取到登录信息的cookie(登录⼀次抓包) 2、发送请求得到响应 from urllib import request url = \"http://www.renren.com/967982493/profile\"headers = { 'Host': 'www.renren.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', #Accept-Encoding: gzip, deflate 'Referer': 'http://www.renren.com/SysHome.do', 'Cookie': 'anonymid=jlxfkyrx-jh2vcz; depovince=SC; _r01_=1; jebe_key=6aac48eb-05fb-4569-8b0d-5d71a4a7a3e4%7C911ac4448a97a17c4d3447cbdae800e4%7C1536714317279%7C1%7C1536714319337; jebecookies=a70e405c-c17a-4877 'Connection': 'keep-alive', 'Upgrade-Insecure-Requests': '1',} req = request.Request(url,headers = headers)res = request.urlopen(req) html = res.read().decode('utf-8')print(html) 3、requests模块 1、安装(Conda prompt终端) 1、(base) ->conda install requests 2、常⽤⽅法 1、get():向⽹站发送请求,并获取响应对象 1、⽤法:resopnse = requests.get(url,headers = headers) 2、response的属性 1、response.text:获取响应内容(字符串) 说明:⼀般返回字符编码为ISO-8859-1,可以通过⼿动指定:response.encoding='utf-8' 2、response.content:获取响应内容(bytes) 1、应⽤场景:爬取图⽚,⾳频等⾮结构化数据 2、⽰例:爬取图⽚ 3、response.status_code:返回服务器的响应码 import requests url = \"http://www.baidu.com/\" headers = {\"User-Agent\":\"Mozilla5.0/\"}#发送请求获取响应对象 response = requests.get(url,headers)#改变编码⽅式 response.encoding = 'utf-8'#获取响应内容,text返回字符串print(response.text)#content返回bytesprint(response.content) print(response.status_code)#200 3、get():查询参数 params(字典格式) 1、没有查询参数 res = requests.get(url,headers=headers) 2、有查询参数 params= {\"wd\":\"python\ res = requuests.get(url,params=params,headers=headers) 2、post():参数名data 1、data={} #data参数为字典,不⽤转为bytes数据类型 2、⽰例: import requestsimport json #1、处理表单数据 word = input('请输⼊要翻译的内容:')data = {\"i\":word, \"from\":\"AUTO\", \"to\":\"AUTO\", \"smartresult\":\"dict\", \"client\":\"fanyideskweb\", \"salt\":\"1536648367302\", \"sign\":\"f7f6b53876957660bf69994389fd0014\", \"doctype\":\"json\", \"version\":\"2.1\", \"keyfrom\":\"fanyi.web\", \"action\":\"FY_BY_REALTIME\", \"typoResult\":\"false\"} #此处德 URL为抓包⼯具抓到的POST的URL url = \"http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule\" headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'} response = requests.post(url,data=data,headers=headers)response.encoding = 'utf-8'result = response.text print(type(result))# print(result)#result为json格式的字符串'''{\"type\":\"ZH_CN2EN\ \"errorCode\":0, \"elapsedTime\":1, \"translateResult\":[ [{\"src\":\"你好\ \"tgt\":\"hello\" }] ]}''' #把json格式的字符串转换为Python字典dic = json.loads(result) print(dic[\"translateResult\"][0][0][\"tgt\"]) 3、代理:proxies 1、爬⾍和反爬⾍⽃争的第⼆步 获取代理IP的⽹站 1、西刺代理 2、快代理 3、全国代理 2、普通代理:proxies={\"协议\":\"IP地址:端⼝号\ proxies = {'HTTP':\"123.161.237.114:45327\ import requests url = \"http://www.taobao.com\" proxies = {\"HTTP\":\"123.161.237.114:45327\"}headers = {\"User-Agent\":\"Mozilla5.0/\"} response = requests.get(url,proxies=proxies,headers=headers)response.encoding = 'utf-8'print(response.text) 3、私密代理:proxies={\"协议\":\"http://⽤户名:密码@IP地址:端⼝号\ proxies={'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'} import requests url = \"http://www.taobao.com/\" proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}headers = {\"User-Agent\":\"Mozilla5.0/\"} response = requests.get(url,proxies=proxies,headers=headers)response.encoding = 'utf-8'print(response.text) 4、案例:爬取链家地产⼆⼿房信息 1、存⼊mysql数据库 import pymysql db = pymysql.connect(\"localhost\",\"root\",\"123456\",charset='utf8')cusor = db.cursor() cursor.execute(\"create database if not exists testspider;\")cursor.execute(\"use testspider;\") cursor.execute(\"create table if not exists t1(id int);\")cursor.execute(\"insert into t1 values(100);\")db.commit()cursor.close()db.close() 2、存⼊MongoDB数据库 import pymongo #链接mongoDB数据库 conn = pymongo.MongoClient('localhost',27017)#创建数据库并得到数据库对象db = conn.testpymongo#创建集合并得到集合对象myset = db.t1 #向集合中插⼊⼀个数据 myset.insert({\"name\":\"Tom\"}) \"\"\" 爬取链家地产⼆⼿房信息(⽤私密代理实现)⽬标:爬取⼩区名称,总价步骤: 1、获取url https://cd.lianjia.com/ershoufang/pg1/ https://cd.lianjia.com/ershoufang/pg2/ 2、正则匹配 3、写⼊到本地⽂件\"\"\" import requestsimport re import multiprocessing as mp BASE_URL = \"https://cd.lianjia.com/ershoufang/pg\" proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}headers = {\"User-Agent\":\"Mozilla5.0/\"} regex = ' res = requests.get(url,proxies=proxies,headers=headers) res.encoding = 'utf-8' html = res.text return html def saveFile(page,regex=regex): html = getText(BASE_URL,proxies,headers,page) p = re.compile(regex,re.S) content_list = p.findall(html) for content_tuple in content_list: cell = content_tuple[0].strip() price = content_tuple[1].strip() with open('链家.txt','a') as f: f.write(cell+\" \"+price+\"\\n\") if __name__ == \"__main__\": pool = mp.Pool(processes = 10) pool.map(saveFile,[page for page in range(1,101)]) 链家⼆⼿房产 import requestsimport re import multiprocessing as mp import pymysqlimport warnings BASE_URL = \"https://cd.lianjia.com/ershoufang/pg\" proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}headers = {\"User-Agent\":\"Mozilla5.0/\"} regex = ' c_tab = \"create table if not exists lianjia(id int primary key auto_increment,\\ name varchar(30),\\ price decimal(20,2))charset=utf8;\" db = pymysql.connect(\"localhost\",\"root\",'123456',charset=\"utf8\")cursor = db.cursor() warnings.filterwarnings(\"error\")try: cursor.execute(c_db)except Warning: pass cursor.execute(u_db)try: cursor.execute(c_tab)except Warning: pass def getText(BASE_URL,proxies,headers,page): url = BASE_URL+str(page) res = requests.get(url,proxies=proxies,headers=headers) res.encoding = 'utf-8' html = res.text return html def writeToMySQL(page,regex=regex): html = getText(BASE_URL,proxies,headers,page) p = re.compile(regex,re.S) content_list = p.findall(html) for content_tuple in content_list: cell = content_tuple[0].strip() price = float(content_tuple[1].strip())*10000 s_insert = \"insert into lianjia(name,price) values('%s','%s');\"%(cell,price) cursor.execute(s_insert) db.commit() if __name__ == \"__main__\": pool = mp.Pool(processes = 20) pool.map(writeToMySQL,[page for page in range(1,101)]) 存⼊mysql数据库 import requestsimport re import multiprocessing as mpimport pymongo BASE_URL = \"https://cd.lianjia.com/ershoufang/pg\" proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}headers = {\"User-Agent\":\"Mozilla5.0/\"} regex = ' conn = pymongo.MongoClient('localhost',27017)#创建数据库并得到数据库对象db = conn.spider; #创建集合并得到集合对象myset = db.lianjia def getText(BASE_URL,proxies,headers,page): url = BASE_URL+str(page) res = requests.get(url,proxies=proxies,headers=headers) res.encoding = 'utf-8' html = res.text return html def writeToMongoDB(page,regex=regex): html = getText(BASE_URL,proxies,headers,page) p = re.compile(regex,re.S) content_list = p.findall(html) for content_tuple in content_list: cell = content_tuple[0].strip() price = float(content_tuple[1].strip())*10000 d = {\"houseName\":cell,\"housePrice\":price} #向集合中插⼊⼀个数据 myset.insert(d) if __name__ == \"__main__\": pool = mp.Pool(processes = 20) pool.map(writeToMongoDB,[page for page in range(1,101)]) 存⼊MongoDB 4、WEB客户端验证(有些⽹站需要先登录才可以访问):auth 1、auth = (\"⽤户名\密码\"),是⼀个元组 import requestsimport re regex = r' self.headers = {\"User-Agent\":\"Mozilla5.0/\"} #auth参数为元组 self.auth = (\"tarenacode\",\"code_2013\") self.url = \"http://code.tarena.com.cn/\" def getParsePage(self): res = requests.get(self.url,auth=self.auth, headers=self.headers) res.encoding = \"utf-8\" html = res.text p = re.compile(regex,re.S) r_list = p.findall(html) #调⽤writePage()⽅法 self.writePage(r_list) def writePage(self,r_list): print(\"开始写⼊\") for r_str in r_list: with open('笔记.txt','a') as f: f.write(r_str + \"\\n\") print(\"写⼊完成\") if __name__==\"__main__\": obj = NoteSpider() obj.getParsePage() 5、SSL证书认证:verify 1、verify=True:默认,做SSL证书认证 2、verify=False: 忽略证书认证 import requests url = \"http://www.12306.cn/mormhweb/\"headers = {\"User-Agent\":\"Mozilla5.0/\"} res = requests.get(url,verify=False,headers=headers) res.encoding = \"utf-8\"print(res.text) 4、Handler处理器(urllib.request,了解) 1、定义 ⾃定义的urlopen()⽅法,urlopen⽅法是⼀个特殊的opener 2、常⽤⽅法 1、build_opener(Handler处理器对象) 2、opener.open(url),相当于执⾏了urlopen 3、使⽤流程 1、创建相关Handler处理器对象 http_handler = urllib.request.HTTPHandler() 2、创建⾃定义opener对象 opener = urllib.request.build_opener(http_handler) 3、利⽤opener对象的open⽅法发送请求 4、Handler处理器分类 1、HTTPHandler() import urllib.request url = \"http://www.baidu.com/\" #1、创建HTTPHandler处理器对象 http_handler = urllib.request.HTTPHandler()#2、创建⾃定义的opener对象 opener = urllib.request.build_opener(http_handler)#3、利⽤opener对象的open⽅法发送请求req = urllib.request.Request(url)res = opener.open(req) print(res.read().decode(\"utf-8\")) 2、ProxyHandler(代理IP):普通代理 import urllib.requesturl = \"http://www.baidu.com\" #1、创建handler proxy_handler = urllib.request.ProxyHandler({\"HTTP\":\"123.161.237.114:45327\"})#2、创建⾃定义opener opener = urllib.request.build_opener(proxy_handler)#3、利⽤opener的open⽅法发送请求req = urllib.request.Request(url)res = opener.open(req) print(res.read().decode(\"utf-8\")) 3、ProxyBasicAuthHandler(密码管理器对象):私密代理 1、密码管理器使⽤流程 1、创建密码管理器对象 pwd = urllib.request.HTTPPasswordMgrWithDefaultRealm() 2、添加私密代理⽤户名,密码,IP地址,端⼝号 pwd.add_password(None,\"IP:端⼝\",\"⽤户名\密码\") 2、urllib.request.ProxyBasicAuthHandler(密码管理器对象)1、CSV模块使⽤流程 1、Python语句打开CSV⽂件: with open('test.csv','a',newline='',encoding='utf-8') as f: pass 2、初始化写⼊对象使⽤writer(⽅法: writer = csv.writer(f) 3、写⼊数据使⽤writerow()⽅法 writer.writerow([\"霸王别姬\ 4、⽰例: import csv #打开csv⽂件,如果不写newline=‘’,则每⼀条数据中间会出现⼀条空⾏with open(\"test.csv\",'a',newline='') as f: #初始化写⼊对象 writer = csv.writer(f) #写⼊数据 writer.writerow(['id','name','age']) writer.writerow([1,'Lucy',20]) writer.writerow([2,'Tom',25]) import csv with open(\"猫眼/第⼀页.csv\",'w',newline=\"\") as f: writer = csv.writer(f) writer.writerow(['电影名','主演','上映时间']) ''' 如果使⽤utf-8会出现['\霸王别姬', '张国荣,张丰毅,巩俐', '1993-01-01'] 使⽤utf-8-sig['霸王别姬', '张国荣,张丰毅,巩俐', '1993-01-01'] 两者的区别: UTF-8以字节为编码单元,它的字节顺序在所有系统中都是⼀様的,没有字节序的问题, 也因此它实际上并不需要BOM(“ByteOrder Mark”)。 但是UTF-8 with BOM即utf-8-sig需要提供BOM。 ''' with open(\"猫眼/第1页.txt\",'r',encoding=\"utf-8-sig\") as file: while True: data_list = file.readline().strip().split(\"|\") print(data_list) writer.writerow(data_list) if data_list[0]=='': break 2、Xpath⼯具(解析HTML) 1、Xpath 在XML⽂档中查找信息的语⾔,同样适⽤于HTML⽂档的检索 2、Xpath辅助⼯具 1、Chrome插件:Xpath Helper 打开/关闭:Ctrl + Shift + ⼤写X 2、FireFox插件:XPath checker 3、Xpath表达式编辑⼯具:XML Quire 3、Xpath匹配规则 1、匹配演⽰ 1、查找bookstore下⾯的所有节点:/bookstore 2、查找所有的book节点://book 3、查找所有book节点下title节点中,lang属性为‘en’的节点://book/title[@lang='en'] 2、选取节点 /:从根节点开始选取 /bookstore,表⽰“/‘前⾯的节点的⼦节点 //:从整个⽂档中查找某个节点 //price,表⽰“//”前⾯节点的所有后代节点 @:选取某个节点的属性 //title[@lang=\"en\"] 3、@使⽤ 1、选取1个节点://title[@lang='en'] 2、选取N个节点://title[@lang] 3、选取节点属性值://title/@lang 4、匹配多路径 1、符号: | 2、⽰例: 获取所有book节点下的title节点和price节点 //book/title|//book/price 5、函数 contains():匹配⼀个属性值中包含某些字符串的节点 //title[contains(@lang,'e')] text():获取⽂本 last():获取最后⼀个元素 //ul[@class='pagination']/li[last()] not():取反 //*[@id=\"content\"]/div[2]//p[not(@class='otitle')] 6、可以通过解析出来的标签对象继续调⽤xpath函数往下寻找标签 语法:获取的标签对象.xpath(“./div/span”) \"\"\" 糗事百科https://www.qiushibaike.com/8hr/page/1/匹配内容 1、⽤户昵称,div/div/a/h2.text 2、内容,div/a/div/span.text 3、点赞数,div/div/span/i.text 4、评论数,div/div/span/a/i.text\"\"\" import requests from lxml import etree url = \"https://www.qiushibaike.com/8hr/page/1/\"headers = {'User-Agent':\"Mozilla5.0/\"}res = requests.get(url,headers=headers)res.encoding = \"utf-8\"html = res.text #先获取所有段⼦的div列表parseHtml = etree.HTML(html) div_list = parseHtml.xpath(\"//div[contains(@id,'qiushi_tag_')]\")print(len(div_list))#遍历列表 for div in div_list: #获取⽤户昵称 username = div.xpath('./div/a/h2')[0].text print(username) #获取内容 content = div.xpath('.//div[@class=\"content\"]/span')[0].text print(content) #获取点赞 laughNum = div.xpath('./div/span/i')[0].text print(laughNum) #获取评论数 pingNum = div.xpath('./div/span/a/i')[0].text print(pingNum) 3、解析HTML源码 1、lxml库:HTML/XML解析库 1、安装 conda install lxml pip install lxml 2、使⽤流程 1、利⽤lxml库的etree模块构建解析对象 2、解析对象调⽤xpath⼯具定位节点信息 3、使⽤ 1、导⼊模块from lxml import etree 2、创建解析对象:parseHtml = etree.HTML(html) 3、调⽤xpath进⾏解析:r_list = parseHtml.xpath(\"//title[@lang='en']\") 说明:只要调⽤了xpath,则结果⼀定是列表 from lxml import etree html = \"\"\" #1、创建解析对象 parseHtml = etree.HTML(html)#2、利⽤解析对象调⽤xpath⼯具,#获取a标签中href的值s1 = \"//a/@href\"#获取单独的/ s2 = \"//a[@id='channel']/@href\"#获取后⾯的a标签中href的值s3 = \"//li/a/@href\" s3 = \"//ul[@id='nav']/li/a/@href\"#更准确 #获取所有a标签的内容,1、⾸相获取标签对象,2、遍历对象列表,在通过对象.text属性获取⽂本值s4 = \"//a\" #获取新浪社会 s5 = \"//a[@id='channel']\"#获取国内,国际,.......s6 = \"//ul[@id='nav']//a\"r_list = parseHtml.xpath(s6)print(r_list)for i in r_list: print(i.text) 4、案例:抓取百度贴吧帖⼦⾥⾯的图⽚ 1、⽬标:抓取贴吧中帖⼦图⽚ 2、思路 1、先获取贴吧主页的URL:河南⼤学,下⼀页的URL规律 2、获取河南⼤学吧中每个帖⼦的URL 3、对每个帖⼦发送请求,获取帖⼦⾥⾯所有图⽚的URL 4、对图⽚URL发送请求,以wb的范式写⼊本地⽂件 \"\"\"步骤 1、获取贴吧主页的URL http://tieba.baidu.com/f?kw=河南⼤学&pn=0 http://tieba.baidu.com/f?kw=河南⼤学&pn=50 2、获取每个帖⼦的URL,//div[@class='t_con cleafix']/div/div/div/a/@href https://tieba.baidu.com/p/5878699216 3、打开每个帖⼦,找到图⽚的URL,//img[@class='BDE_Image']/@src http://imgsrc.baidu.com/forum/w%3D580/sign=da37aaca6fd9f2d3201124e799ed8a53/27985266d01609240adb3730d90735fae7cd3480.jpg 4、保存到本地 \"\"\" import requests from lxml import etreeclass TiebaPicture: def __init__(self): self.baseurl = \"http://tieba.baidu.com\" self.pageurl = \"http://tieba.baidu.com/f\" self.headers = {'User-Agent':\"Mozilla5.0/\"} def getPageUrl(self,url,params): '''获取每个帖⼦的URL''' res = requests.get(url,params=params,headers = self.headers) res.encoding = 'utf-8' html = res.text #从HTML页⾯获取每个帖⼦的URL parseHtml = etree.HTML(html) t_list = parseHtml.xpath(\"//div[@class='t_con cleafix']/div/div/div/a/@href\") print(t_list) for t in t_list: t_url = self.baseurl + t self.getImgUrl(t_url) def getImgUrl(self,t_url): '''获取帖⼦中所有图⽚的URL''' res = requests.get(t_url,headers=self.headers) res.encoding = \"utf-8\" html = res.text parseHtml = etree.HTML(html) img_url_list = parseHtml.xpath(\"//img[@class='BDE_Image']/@src\") for img_url in img_url_list: self.writeImg(img_url) def writeImg(self,img_url): '''将图⽚保存如⽂件''' res = requests.get(img_url,headers=self.headers) html = res.content #保存到本地,将图⽚的URL的后10位作为⽂件名 filename = img_url[-10:] with open(filename,'wb') as f: print(\"%s正在下载\"%filename) f.write(html) print(\"%s下载完成\"%filename) def workOn(self): '''主函数''' kw = input(\"请输⼊你要爬取的贴吧名\") begin = int(input(\"请输⼊起始页\")) end = int(input(\"请输⼊终⽌页\")) for page in range(begin,end+1): pn = (page-1)*50 #拼接某个贴吧的URl params = {\"kw\":kw,\"pn\":pn} self.getPageUrl(self.pageurl,params=params) if __name__ == \"__main__\": spider = TiebaPicture() spider.workOn() 爬取百度贴吧图⽚ 1、动态⽹站数据抓取 - Ajax 1、Ajax动态加载 1、特点:动态加载(滚动⿏标滑轮时加载) 2、抓包⼯具:查询参数在WebForms -> QueryString 2、案例:⾖瓣电影top100榜单 import requestsimport jsonimport csv url = \"https://movie.douban.com/j/chart/top_list\"headers = {'User-Agent':\"Mozilla5.0/\"} params = {\"type\":\"11\", \"interval_id\":\"100:90\", \"action\":\"\", \"start\":\"0\", \"limit\":\"100\"} res = requests.get(url,params=params,headers=headers)res.encoding=\"utf-8\" #得到json格式的数组[]html = res.text #把json格式的数组转为python的列表ls = json.loads(html) with open(\"⾖瓣100.csv\",'a',newline=\"\") as f: writer = csv.writer(f) writer.writerow([\"name\",\"score\"]) for dic in ls: name = dic['title'] score = dic['rating'][1] writer.writerow([name,score]) 2、json模块 1、作⽤:json格式类型 和 Python数据类型相互转换 2、常⽤⽅法 1、json.loads():json格式 --> Python数据类型 json python 对象 字典 数组 列表 2、json.dumps(): 3、selenium + phantomjs 强⼤的⽹络爬⾍ 1、selenium 1、定义:WEB⾃动化测试⼯具,应⽤于WEB⾃动化测试 2、特点: 1、可运⾏在浏览器上,根据指令操作浏览器,让浏览器⾃动加载页⾯ 2、只是⼀个⼯具,不⽀持浏览器功能,只能与第三⽅浏览器结合使⽤ 3、安装 conda install selenium pip install selenium 2、phantomjs 1、Windowds 1、定义:⽆界⾯浏览器(⽆头浏览器) 2、特点: 1、把⽹站加载到内存执⾏页⾯加载 2、运⾏⾼效 3、安装 1、把安装包拷贝到Python安装路径Script... 2、Ubuntu 1、下载phantomjs安装包放到⼀个路径下 2、⽤户主⽬录:vi .bashrc export PHANTOM_JS = /home/.../phantomjs-... export PATH=$PHANTOM_JS/bin:$PATH 3、source .bashrc 4、终端:phantomjs 3、⽰例代码 #导⼊selenium库中的⽂本driverfrom selenium import webdriver#创建打开phantomjs的对象driver = webdriver.PhantomJS()#访问百度 driver.get(\"http://www.baidu.com/\")#获取⽹页截图 driver.save_screenshot(\"百度.png\") 4、常⽤⽅法 1、driver.get(url) 2、driver.page_source.find(\"内容\"): 作⽤:从html源码中搜索字符串,搜索成功返回⾮-1,搜索失败返回-1 from selenium import webdriverdriver = webdriver.PhantomJS()driver.get(\"http://www.baidu.com/\")r1 = driver.page_source.find(\"kw\")r2 = driver.page_source.find(\"aaaa\")print(r1,r2)#1053 -1 3、driver.find_element_by_id(\"id值\").text 4、driver.find_element_by_name(\"属性值\") 5、driver.find_element_by_class_name(\"属性值\") 6、对象名.send_keys(\"内容\") 7、对象名.click() 8、driver.quit() 5、案例:登录⾖瓣⽹站4、BeautifulSoup 1、定义:HTML或XML的解析,依赖于lxml库 2、安装并导⼊ 安装: pip install beautifulsoup4 conda install beautifulsoup4 导⼊模块:from bs4 import BeautifulSoup as bs 3、⽰例 4、BeautifulSoup⽀持的解析库 1、lxml HTML解析器, 'lxml'速度快,⽂档容错能⼒强 2、Python标准库 'html.parser',速度⼀般 3、lxml XML解析器 'xml':速度快 from selenium import webdriver from bs4 import BeautifulSoup as bsimport time driver = webdriver.PhantomJS() driver.get(\"https://www.douyu.com/directory/all\")while True: html = driver.page_source #创建解析对象 soup = bs(html,'lxml') #直接调⽤⽅法去查找元素 #存放所有主播的元素对象 names = soup.find_all(\"span\",{\"class\":\"dy-name ellipsis fl\"}) numbers = soup.find_all(\"span\",{\"class\":\"dy-num fr\"}) #name ,number 都是对象,有get_text() for name , number in zip(names,numbers): print(\"观众⼈数:\",number.get_text(),\"主播\",name.get_text()) if html.find(\"shark-pager-disable-next\") ==-1: driver.find_element_by_class_name(\"shark-pager-next\").click() time.sleep(4) else: break 使⽤pytesseract识别验证码 1、安装 sudo pip3 install pytesseract 2、使⽤步骤: 1、打开验证码图⽚:Image.open(‘验证码图⽚路径’) 2、使⽤pytesseract模块中的image_to_string()⽅法进⾏识别 from PIL import Imagefrom pytesseract import *#1、加载图⽚ image = Image.open('t1.png')#2、识别过程 text = image_to_string(image)print(text) 使⽤captcha模块⽣成验证码 1、安装 sudo pip3 install captcha import random from PIL import Imageimport numpy as np from captcha.image import ImageCaptcha digit = ['0','1','2','3','4','5','6','7','8','9'] alphabet = [chr(i) for i in range(97,123)]+[chr(i) for i in range(65,91)]char_set = digit + alphabet#print(char_set) def random_captcha_text(char_set=char_set,captcha_size=4): '''默认获取⼀个随机的含有四个元素的列表''' captcha_text = [] for i in range(captcha_size): ele = random.choice(char_set) captcha_text.append(ele) return captcha_text def gen_captcha_text_and_inage(): '''默认随机得到⼀个包含四个字符的图⽚验证码并返回字符集''' image = ImageCaptcha() captcha_text = random_captcha_text() #将列表转为字符串 captcha_text = ''.join(captcha_text) captchaInfo = image.generate(captcha_text) #⽣成验证码图⽚ captcha_imge = Image.open(captchaInfo) captcha_imge = np.array(captcha_imge) im = Image.fromarray(captcha_imge) im.save('captcha.png') return captcha_text if __name__ == '__main__': gen_captcha_text_and_inage() 去重 1、去重分为两个步骤,创建两个队列(列表) 1、⼀个队列存放已经爬取过了url,存放之前先判断这个url是否已经存在于已爬队列中,通过这样的⽅式去重 2、另外⼀个队列存放待爬取的url,如果该url不在已爬队列中则放⼊到带爬取队列中 使⽤去重和⼴度优先遍历爬取⾖瓣⽹ import re from bs4 import BeautifulSoupimport basicspiderimport hashlibHelper def get_html(url): \"\"\" 获取⼀页的⽹页源码信息 \"\"\" headers = [(\"User-Agent\",\"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36\")] html = basicspider.downloadHtml(url, headers=headers) return html def get_movie_all(html): \"\"\" 获取当前页⾯中所有的电影的列表信息 \"\"\" soup = BeautifulSoup(html, \"html.parser\") movie_list = soup.find_all('div', class_='bd doulist-subject') #print(movie_list) return movie_list def get_movie_one(movie): \"\"\" 获取⼀部电影的精细信息,最终拼成⼀个⼤的字符串 \"\"\" result = \"\" soup = BeautifulSoup(str(movie),\"html.parser\") title = soup.find_all('div', class_=\"title\") soup_title = BeautifulSoup(str(title[0]), \"html.parser\") for line in soup_title.stripped_strings: result += line try: score = soup.find_all('span', class_='rating_nums') score_ = BeautifulSoup(str(score[0]), \"html.parser\") for line in score_.stripped_strings: result += \"|| 评分:\" result += line except: result += \"|| 评分:5.0\" abstract = soup.find_all('div', class_='abstract') abstract_info = BeautifulSoup(str(abstract[0]), \"html.parser\") for line in abstract_info.stripped_strings: result += \"|| \" result += line result += '\\n' print(result) return result def save_file(movieInfo): \"\"\" 写⽂件的操作,这⾥使⽤的追加的⽅式来写⽂件 \"\"\" with open(\"doubanMovie.txt\",\"ab\") as f: #lock.acquire() f.write(movieInfo.encode(\"utf-8\")) #lock.release() crawl_queue = []#待爬取队列crawled_queue = []#已爬取队列 def crawlMovieInfo(url): '''抓取⼀页数据''' 'https://www.douban.com/doulist/3516235/' global crawl_queue global crawled_queue html = get_html(url) regex = r'https://www\\.douban\\.com/doulist/3516235/\\?start=\\d+&sort=seq&playable=0&sub_type=' p = re.compile(regex,re.S) itemUrls = p.findall(html) #两步去重过程 for item in itemUrls: #将item进⾏hash然后判断是否已经在已爬队列中 hash_irem = hashlibHelper.hashStr(item) if hash_irem not in crawled_queue:#已爬队列去重 crawl_queue.append(item) crawl_queue = list(set(crawl_queue))#将待爬队列去重 #处理当前页⾯ movie_list = get_movie_all(html) for movie in movie_list: save_file(get_movie_one(movie)) #将url转为hash值并存⼊已爬队列中 hash_url = hashlibHelper.hashStr(url) crawled_queue.append(hash_url) if __name__ == \"__main__\": #⼴度优先遍历 seed_url = 'https://www.douban.com/doulist/3516235/?start=0&sort=seq&playable=0&sub_type=' crawl_queue.append(seed_url) while crawl_queue: url = crawl_queue.pop(0) crawlMovieInfo(url) print(crawled_queue) print(len(crawled_queue))import hashlib def hashStr(strInfo): '''对字符串进⾏hash''' hashObj = hashlib.sha256() hashObj.update(strInfo.encode('utf-8')) return hashObj.hexdigest() def hashFile(fileName): '''对⽂件进⾏hash''' hashObj = hashlib.md5() with open(fileName,'rb') as f: while True: #不要⼀次性全部读取出来,如果⽂件太⼤,内存不够 data = f.read(2048) if not data: break hashObj.update(data) return hashObj.hexdigest() if __name__ == \"__main__\": print(hashStr(\"hello\")) print(hashFile('猫眼电影.txt')) hashlibHelper.py from urllib import requestfrom urllib import parsefrom urllib import errorimport randomimport time def downloadHtml(url,headers=[()],proxy={},timeout=None,decodeInfo='utf-8',num_tries=10,useProxyRatio=11): ''' ⽀持user-agent等Http,Request,Headers ⽀持proxy 超时的考虑 编码的问题,如果不是UTF-8编码怎么办 服务器错误返回5XX怎么办 客户端错误返回4XX怎么办 考虑延时的问题 ''' time.sleep(random.randint(1,2))#控制访问,不要太快 #通过useProxyRatio设置是否使⽤代理 if random.randint(1,10) >useProxyRatio: proxy = None #创建ProxuHandler proxy_support = request.ProxyHandler(proxy) #创建opener opener = request.build_opener(proxy_support) #设置user-agent opener.addheaders = headers #安装opener request.install_opener(opener) html = None try: #这⾥可能出现很多异常 #可能会出现编码异常 #可能会出现⽹络下载异常:客户端的异常404,403 # 服务器的异常5XX res = request.urlopen(url) html = res.read().decode(decodeInfo) except UnicodeDecodeError: print(\"UnicodeDecodeError\") except error.URLError or error.HTTPError as e: #客户端的异常404,403(可能被反爬了) if hasattr(e,'code') and 400 <= e.code < 500: print(\"Client Error\"+e.code) elif hasattr(e,'code') and 500 <= e.code < 600: if num_tries > 0: time.sleep(random.randint(1,3))#设置等待的时间 downloadHtml(url,headers,proxy,timeout,decodeInfo,num_tries-1) return html if __name__ == \"__main__\": url = \"http://maoyan.com/board/4?offset=0\" headers = [(\"User-Agent\",\"User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50\")] print(downloadHtml(url,headers=headers)) basicspider.py Scrapy框架 在终端直接输⼊scrapy查看可以使⽤的命令 bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy 使⽤步骤: 1、创建⼀个项⽬:scrapy startproject 项⽬名称 scrapy startproject tencentSpider 2、进⼊到项⽬中,创建⼀个爬⾍ cd tencentSpider scrapy genspider tencent hr.tencent.com #tencent表⽰创建爬⾍的名字,hr.tencent.com表⽰⼊⼝,要爬取的数据必须在这个域名之下 3、修改程序的逻辑 1、settings.py 1、设置ua 2、关闭robots协议 3、关闭cookie 4、打开ItemPipelines # -*- coding: utf-8 -*-# Scrapy settings for tencentSpider project# # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation:# # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html# https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'tencentSpider' SPIDER_MODULES = ['tencentSpider.spiders']NEWSPIDER_MODULE = 'tencentSpider.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'tencentSpider (+http://www.yourdomain.com)' USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0'# Obey robots.txt rules ROBOTSTXT_OBEY = False #是否遵循robots协议 # Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False # Override the default request headers:#DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',# 'Accept-Language': 'en',#} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = { # 'tencentSpider.middlewares.TencentspiderSpiderMiddleware': 543,#} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = { # 'tencentSpider.middlewares.TencentspiderDownloaderMiddleware': 543,#} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None,#} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { 'tencentSpider.pipelines.TencentspiderPipeline': 300,#值表⽰优先级 } # Enable and configure the AutoThrottle extension (disabled by default)# See https://doc.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to# each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' settings.py 2、items.py:ORM import scrapy class TencentspiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() #抓取招聘的职位,连接,岗位类型 positionName = scrapy.Field() positionLink = scrapy.Field() positionType = scrapy.Field() 3、pipelines.py:保存数据的逻辑 import json class TencentspiderPipeline(object): def process_item(self, item, spider): with open('tencent.json','ab') as f: text = json.dumps(dict(item),ensure_ascii=False)+'\\n' f.write(text.encode('utf-8')) return item 4、spiders/tencent.py:主体的逻辑 import scrapy from tencentSpider.items import TencentspiderItem class TencentSpider(scrapy.Spider): name = 'tencent' allowed_domains = ['hr.tencent.com'] #start_urls = ['http://hr.tencent.com/']# start_urls = [] # for i in range(0,530,10): # url = \"https://hr.tencent.com/position.php?keywords=python&start=\"# url += str(i)+\"#a\" # start_urls.append(url) url = \"https://hr.tencent.com/position.php?keywords=python&start=\" offset = 0 start_urls = [url + str(offset)+\"#a\"] def parse(self, response): for each in response.xpath('//tr[@class=\"even\"]|//tr[@class=\"odd\"]'): item = TencentspiderItem()#item是⼀个空字典 item['positionName'] = each.xpath('./td[1]/a/text()').extract()[0] item['positionLink'] = \"https://hr.tencent.com/\"+each.xpath('./td[1]/a/@href').extract()[0] item['positionType'] = each.xpath('./td[2]/text()').extract()[0] yield item #提取链接 if self.offset < 530: self.offset += 10 nextPageUrl = self.url+str(self.offset)+\"#a\" else: return #对下⼀页发起请求 yield scrapy.Request(nextPageUrl,callback = self.parse) 4、运⾏爬⾍ scrapy crawl tencent 5、运⾏爬⾍ 并将数据保存到指定⽂件中 scrapy crawl tencent -o ⽂件名 如何在scrapy框架中设置代理服务器 1、可以在middlewares.py⽂件中的DownloaderMiddleware类中的process_request()⽅法中,来完成代理服务器的设置 2、然后将代理服务器的池放在setting.py⽂件中定义⼀个proxyList = [.....] 3、process_request()⽅法⾥⾯通过random.choice(proxyList)随机选⼀个代理服务器 注意: 1、这⾥的代理服务器如果是私密的,有⽤户名和密码时,需要做⼀层简单的加密处理Base64 2、在scrapy⽣成⼀个基础爬⾍时使⽤:scrapy genspider tencent hr.tencent.com,如果要想⽣成⼀个⾼级的爬⾍CrawlSpider scrapy genspider -t crawl tencent2 hr.tencent.com CrawSpider这个爬⾍可以更加灵活的提取URL等信息,需要了解URL,LinkExtractor Scrapy-Redis搭建分布式爬⾍ Redis是⼀种内存数据库(提供了接⼝将数据保存到磁盘数据库中); 因篇幅问题不能全部显示,请点此查看更多更全内容