您的当前位置:首页正文

python_爬虫

2023-04-14 来源:好走旅游网
python_爬⾍

1、⽹络爬⾍

1、定义:⽹络蜘蛛,⽹络机器⼈,抓取⽹络数据的程序

2、总结:⽤Python程序去模仿⼈去访问⽹站,模仿的越逼真越好 3、⽬的:通过有效的⼤量的数据分析市场⾛势,公司的决策2、企业获取数据的⽅式 1、公司⾃有

2、第三⽅数据平台购买

1、数据堂、贵阳⼤数据交易所 3、爬⾍程序爬取数据

市场上没有或者价格太⾼,利⽤爬⾍程序去爬取3、Python做爬⾍的优势

1、Python:请求模块、解析模块丰富成熟 2、PHP:多线程,异步⽀持不够好 3、JAVA:代码笨重,代码量⼤

4、C/C++:虽然效率⾼,但代码成型太慢4、爬⾍的分类

1、通⽤的⽹络爬⾍(搜索引擎引⽤,需要遵守robots协议) 1、搜索引擎如何获取⼀个新⽹站的URL

1、⽹站主动向搜索引擎提供(百度站长平台) 2、和DNS服务商(万⽹),快速收录新⽹站 2、聚焦⽹络爬⾍(需要什么爬取什么)

⾃⼰写的爬⾍程序:⾯向主题爬⾍,⾯向需求爬⾍5、爬取数据步骤

1、确定需要爬取的URL地址

2、通过HTTP/HTTPS协议来获取响应的HTML页⾯ 3、提取HTML页⾯⾥有⽤的数据 1、所需数据,保存

2、页⾯中其他的URL,继续重复第2步6、Chrome浏览器插件 1、插件安装步骤

1、右上⾓->更多⼯具->扩展程序

2、点开 开发者模式

3、把插件拖拽到浏览器界⾯ 2、插件介绍

1、Proxy SwitchyOmega:代理切换插件 2、XPath Helper:⽹页数据解析插件

3、JSON View:查看json格式的数据(好看)7、Fiddler抓包⼯具 1、抓包设置

1、设置Fiddler抓包⼯具

2、设置浏览器代理

Proxy SwitchyOmega ->选项->新建情景模式->HTTP 127.0.0.1 8888

2、Fiddler常⽤菜单

1、Inspector:查看抓到数据包的详细内容 2、常⽤选项

1、Headers:客户端发送到服务器的header,包含web客户端信息 cookie传输状态 2、WebForms:显⽰请求的POST的数据 3、Raw:将整个请求显⽰为纯⽂本

import urllib.request

url = 'http://www.baidu.com/'#发起请求并获取响应对象

response = urllib.request.urlopen(url)#响应对象的read()⽅法获取响应内容#read()⽅法得到的是bytes类型#read() bytes -->string

html = response.read().decode('utf-8')print(html)

2、urllib.request.Request(url,headers={})

1、重构User-Agent,爬⾍和反爬⾍⽃争第⼀步 2、使⽤步骤

1、构建请求对象request:Request()

2、获取响应对象response:urlopen(request) 3、利⽤响应对象response.read().decode('utf-8')

# -*- coding: utf-8 -*-import urllib.request

url = 'http://www.baidu.com/'

headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}#1、构建请求对象

request = urllib.request.Request(url,headers=headers)#2、得到响应对象

response = urllib.request.urlopen(request)#3、获取响应对象的内容

html = response.read().decode('utf-8')print(html)

3、请求对象request⽅法 1、add_header()

作⽤:添加或修改headers(User-Agent) 2、get_header(‘User-agent’),只有U是⼤写 作⽤:获取已有的HTTP报头的值

import urllib.request

url = 'http://www.baidu.com/'

headers = 'User-AgentUser-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'request = urllib.request.Request(url)#请求对象⽅法add_header()

request.add_header(\"User-Agent\",headers)#获取响应对象

response = urllib.request.urlopen(request)#get_header()⽅法获取User-agent,

#注意User-agent的写法,只有U是⼤写的print(request.get_header('User-agent'))#获取响应码

print(response.getcode())

#获取响应报头信息,返回结果是⼀个字典print(response.info())

html = response.read().decode('utf-8')print(html)

4、响应对象response⽅法

1、read();读取服务器响应的内容 2、getcode():

作⽤:返回HTTP的响应状态码 200:成功

4XX:服务器页⾯出错(连接到了服务器) 5XX:服务器出错(没有连接到服务器) 3、info():

作⽤:返回服务器的响应报头信息 2、urllib.parse

1、quote('中⽂字符串') 2、urlencode(字典)

3、unquote(\"编码之后的字符串\"),解码

import urllib.requestimport urllib.parse

url = 'http://www.baidu.com/s?wd='key = input('请输⼊要搜索的内容')#编码,拼接URL

key = urllib.parse.quote(key)fullurl = url+key

print(fullurl)#http://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3

headers = {'User-Agent':\"User-AgentUser-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50\"}request = urllib.request.Request(fullurl,headers = headers)resp = urllib.request.urlopen(request)html = resp.read().decode('utf-8')print(html)

import urllib.requestimport urllib.parse

baseurl = \"http://www.baidu.com/s?\"

headers = {'User-Agent':\"User-AgentUser-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50\"}key = input(\"请输⼊要搜索的内容\")#urlencode编码,参数⼀定是字典d = {\"wd\":key}

d = urllib.parse.urlencode(d)url = baseurl + d

resq = urllib.request.Request(url,headers = headers)resp = urllib.request.urlopen(resq)html = resp.read().decode('utf-8')print(html)

练习:爬取百度贴吧1、简单版

# -*- coding: utf-8 -*-\"\"\"

百度贴吧数据抓取要求:

1、输⼊贴吧的名称

2、输⼊抓取的起始页和终⽌页

3、把每⼀页的内容保存到本地:第⼀页.html 第⼆页.html

http://tieba.baidu.com/f?kw=%E6%B2%B3%E5%8D%97%E5%A4%A7%E5%AD%A6&ie=utf-8&pn=0\"\"\"

import urllib.requestimport urllib.parse

baseurl = \"http://tieba.baidu.com/f?\"

headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}title = input(\"请输⼊要查找的贴吧\")

begin_page = int(input(\"请输⼊起始页\"))end_page = int(input(\"请输⼊起始页\"))#RUL进⾏编码kw = {\"kw\":title}

kw = urllib.parse.urlencode(kw)

#写循环拼接URL,发请求获取响应,写⼊本地⽂件for page in range(begin_page,end_page+1): pn = (page-1)*50 #拼接URL

url = baseurl + kw + \"&pa=\" + str(pn) #发请求,获取响应

req = urllib.request.Request(url,headers=headers) res = urllib.request.urlopen(req) html = res.read().decode(\"utf-8\") #写⽂件保存在本地

filename = \"第\" + str(page) +\"页.html\"

with open(filename,'w',encoding='utf-8') as f: print(\"正在下载第%d页\"%page) f.write(html)

print(\"第%d页下载成功\"%page)

2、函数版

import urllib.requestimport urllib.parse

#发请求,获取响应,得到htmldef getPage(url):

headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'} req = urllib.request.Request(url,headers=headers) res = urllib.request.urlopen(req) html = res.read().decode(\"utf-8\") return html

#保存html⽂件到本地

def writePage(filename,html):

with open(filename,'w',encoding=\"utf-8\") as f: f.write(html)

#主函数

def workOn():

name = input(\"请输⼊贴吧名\") begin = int(input(\"请输⼊起始页\")) end = int(input(\"请输⼊终⽌页\")) baseurl = \"http://tieba.baidu.com/f?\" kw = {\"kw\":name}

kw = urllib.parse.urlencode(kw) for page in range(begin,end+1): pn = (page-1) *50

url = baseurl + kw + \"&pn=\" + str(pn) html = getPage(url)

filename = \"第\"+ str(page) + \"页.html\" writePage(filename,html)if __name__ == \"__main__\": workOn()

3、封装为类

import urllib.requestimport urllib.parse

class BaiduSpider: def __init__(self):

self.baseurl = \"http://tieba.baidu.com/f?\"

self.headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}

def getPage(self,url):

'''发请求,获取响应,得到html'''

req = urllib.request.Request(url,headers = self.headers) res = urllib.request.urlopen(req) html = res.read().decode(\"utf-8\") return html

def writePage(self,filename,html): '''保存html⽂件到本地'''

with open(filename,'w',encoding=\"utf-8\") as f: f.write(html)

def workOn(self): '''主函数'''

name = input(\"请输⼊贴吧名\") begin = int(input(\"请输⼊起始页\")) end = int(input(\"请输⼊终⽌页\")) kw = {\"kw\":name}

kw = urllib.parse.urlencode(kw) for page in range(begin,end+1): pn = (page-1) *50

url = self.baseurl + kw + \"&pn=\" + str(pn) html = self.getPage(url)

filename = \"第\"+ str(page) + \"页.html\" writePage(filename,html)

if __name__ == \"__main__\": #创建对象

daiduSpider = BaiduSpider() #调⽤类内的⽅法

daiduSpider.workOn()

1、解析

1、数据分类

1、结构化数据

特点:有固定的格式:HTML、XML、JSON等 2、⾮结构化数据

⽰例:图⽚、⾳频、视频,这类数据⼀般存储为⼆进制 2、正则表达式(re模块) 1、使⽤流程

1、创建编译对象:p = re.compile(r\"\\d\")

2、对字符串匹配:result = p.match('123ABC') 3、获取匹配结果:print(result.group()) 2、常⽤⽅法

1、match(s):只匹配字符串开头,返回⼀个对象

2、search(s):从开始往后去匹配第⼀个,返回⼀个对象 3、group():从match和search返回的对象中取值 4、findall(s):全部匹配,返回⼀个列表 3、表达式

.:任意字符(不能匹配\\n) [...]:包含[]中的⼀个内容 \\d:数字

\\w:字母、数字、下划线 \\s:空⽩字符 \\S:⾮空字符

*:前⼀个字符出现0次或多次 ?:0次或1次 +:1次或多次

{m}:前⼀个字符出现m次

贪婪匹配:在整个表达式匹配成功前提下,尽可能多的去匹配 ⾮贪婪匹配:整个表达式匹配成功前提下,尽可能少的去匹配   4、⽰例:

import re

s = \"\"\"

仰天⼤笑出门去,我辈岂是篷篙⼈

天⽣我材必有⽤,千⾦散尽还复来

\"\"\"

#创建编译对象,贪婪匹配

p =re.compile(\"

.*
\",re.S)result = p.findall(s)print(result)

#['

仰天⼤笑出门去,我辈岂是篷篙⼈

\\n\

天⽣我材必有⽤,千⾦散尽还复来

']#⾮贪婪匹配

p1 = re.compile(\"

.*?
\",re.S)result1 = p1.findall(s)print(result1)

#['

仰天⼤笑出门去,我辈岂是篷篙⼈

', '

天⽣我材必有⽤,千⾦散尽还复来

']

5、findall()的分组

解释:先按整体匹配出来,然后在匹配()中内容,如果有2个或多个(),则以元组⽅式显⽰

import res = 'A B C D'

p1 = re.compile(\"\\w+\\s+\\w+\")print(p1.findall(s))#['A B','C D']

#1、先按照整体去匹配['A B','C D']#2、显⽰括号⾥⾯的⼈内容,['A','C']p2 = re.compile(\"(\\w+)\\s+\\w+\")print(p2.findall(s))#['A','C']

#1、先按照整体匹配['A B','C D']

#2、有两个以上分组需要将匹配分组的内容写在⼩括号⾥⾯#,显⽰括号内容:[('A','B'),('C','D')]p3 = re.compile(\"(\\w+)\\s+(\\w+)\")print(p3.findall(s))#[('A','B'),('C','D')]

6、练习,猫眼电影榜单top100

# -*- coding: utf-8 -*-\"\"\"

1、爬取猫眼电影top100榜单

1、程序运⾏,直接爬取第⼀页 2、是否继续爬取(y/n) y:爬取第2页

n:爬取结束,谢谢使⽤

3、把每⼀页的内容保存到本地,第⼀页.html 第⼀页:http://maoyan.com/board/4?offset=0 第⼆页:http://maoyan.com/board/4?offset=10 4、解析:电影名,主演,上映时间\"\"\"

import urllib.requestimport re

class MaoyanSpider:

'''爬取猫眼电影top100榜单''' def __init__(self):

self.baseurl = \"http://maoyan.com/board/4?offset=\"

self.headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}

def getPage(self,url): '''获取html页⾯''' #创建请求对象

res = urllib.request.Request(url,headers= self.headers) #发送请求

rep = urllib.request.urlopen(res) #得到响应结果

html = rep.read().decode(\"utf=8\") return html

def wirtePage(self,filename,html): '''保存⾄本地⽂件'''

# with open(filename,'w',encoding=\"utf-8\") as f:# f.write(html)

content_list = self.match_contents(html) for content_tuple in content_list:

movie_title = content_tuple[0].strip()

movie_actors = content_tuple[1].strip()[3:] releasetime = content_tuple[2].strip()[5:15] with open(filename,'a',encoding='utf-8') as f:

f.write(movie_title+\"|\" + movie_actors+\"|\" + releasetime+'\\n')

def match_contents(self,html):

'''匹配电影名,主演,和上映时间''' #正则表达式# '''

#

#

霸王别姬

#

# 主演:张国荣,张丰毅,巩俐#

#

上映时间:1993-01-01(中国⾹港)

# '''

regex = r'

.*?(.*?)

.*?

(.*?)

.*?
' p = re.compile(regex,re.S) content_list = p.findall(html) return content_list def workOn(self): '''主函数'''

for page in range(0,10): #拼接URL

url = self.baseurl + str(page*10)

#filename = '猫眼/第' + str(page+1) + \"页.html\" filename = '猫眼/第' + str(page+1) + \"页.txt\" print(\"正在爬取%s页\"%(page+1)) html = self.getPage(url)

self.wirtePage(filename,html) #⽤于记录输⼊的命令 flag = False while True:

msg = input(\"是否继续爬取(y/n)\") if msg == \"y\": flag = True elif msg == \"n\":

print(\"爬取结束,谢谢使⽤\") flag = False else:

print(\"您输⼊的命令⽆效\") continue if flag : break else:

return None

print(\"所有内容爬取完成\")

if __name__ == \"__main__\": spider = MaoyanSpider() spider.workOn()

猫眼电影top100爬取

3、Xpath

4、BeautifulSoup2、请求⽅式及⽅案

1、GET(查询参数都在URL地址中显⽰) 2、POST

1、特点:查询参数在Form表单⾥保存 2、使⽤:

urllib.request.urlopen(url,data = data ,headers = headers) data:表单数据data必须以bytes类型提交,不能是字典 3、案例:有道翻译

1、利⽤Fiddler抓包⼯具抓取WebForms⾥表单数据 2、对POST数据进⾏处理bytes数据类型 3、发送请求获取响应

from urllib import request,parseimport json

#1、处理表单数据

#Form表单的数据放到字典中,然后在进⾏编码转换word = input('请输⼊要翻译的内容:')data = {\"i\":word,

\"from\":\"AUTO\", \"to\":\"AUTO\",

\"smartresult\":\"dict\", \"client\":\"fanyideskweb\", \"salt\":\"1536648367302\",

\"sign\":\"f7f6b53876957660bf69994389fd0014\", \"doctype\":\"json\",

\"version\":\"2.1\",

\"keyfrom\":\"fanyi.web\",

\"action\":\"FY_BY_REALTIME\", \"typoResult\":\"false\"}#2、把data转换为bytes类型

data = parse.urlencode(data).encode('utf-8')#3、发请求获取响应

#此处德 URL为抓包⼯具抓到的POST的URL

url = \"http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule\"

headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}req = request.Request(url,data=data,headers=headers)res = request.urlopen(req)

result = res.read().decode('utf-8')print(type(result))#

print(result)#result为json格式的字符串'''{\"type\":\"ZH_CN2EN\ \"errorCode\":0, \"elapsedTime\":1, \"translateResult\":[

[{\"src\":\"你好\ \"tgt\":\"hello\" }] ]}'''

#把json格式的字符串转换为Python字典#

dic = json.loads(result)

print(dic[\"translateResult\"][0][0][\"tgt\"])

4、json模块

json.loads('json格式的字符串')

作⽤:把json格式的字符串转换为Python字典 3、Cookie模拟登陆

1、Cookie 和 Session

cookie:通过在客户端记录的信息确定⽤户⾝份 session:通过在服务器端记录的信息确定⽤户⾝份 2、案例:使⽤cookie模拟登陆⼈⼈⽹

1、获取到登录信息的cookie(登录⼀次抓包) 2、发送请求得到响应

from urllib import request

url = \"http://www.renren.com/967982493/profile\"headers = {

'Host': 'www.renren.com',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2', #Accept-Encoding: gzip, deflate

'Referer': 'http://www.renren.com/SysHome.do',

'Cookie': 'anonymid=jlxfkyrx-jh2vcz; depovince=SC; _r01_=1; jebe_key=6aac48eb-05fb-4569-8b0d-5d71a4a7a3e4%7C911ac4448a97a17c4d3447cbdae800e4%7C1536714317279%7C1%7C1536714319337; jebecookies=a70e405c-c17a-4877 'Connection': 'keep-alive',

'Upgrade-Insecure-Requests': '1',}

req = request.Request(url,headers = headers)res = request.urlopen(req)

html = res.read().decode('utf-8')print(html)

3、requests模块

1、安装(Conda prompt终端)

1、(base) ->conda install requests 2、常⽤⽅法

1、get():向⽹站发送请求,并获取响应对象

1、⽤法:resopnse = requests.get(url,headers = headers) 2、response的属性

1、response.text:获取响应内容(字符串)

说明:⼀般返回字符编码为ISO-8859-1,可以通过⼿动指定:response.encoding='utf-8' 2、response.content:获取响应内容(bytes)

1、应⽤场景:爬取图⽚,⾳频等⾮结构化数据 2、⽰例:爬取图⽚

3、response.status_code:返回服务器的响应码

import requests

url = \"http://www.baidu.com/\"

headers = {\"User-Agent\":\"Mozilla5.0/\"}#发送请求获取响应对象

response = requests.get(url,headers)#改变编码⽅式

response.encoding = 'utf-8'#获取响应内容,text返回字符串print(response.text)#content返回bytesprint(response.content)

print(response.status_code)#200

3、get():查询参数 params(字典格式) 1、没有查询参数

res = requests.get(url,headers=headers) 2、有查询参数

params= {\"wd\":\"python\

res = requuests.get(url,params=params,headers=headers) 2、post():参数名data

1、data={} #data参数为字典,不⽤转为bytes数据类型 2、⽰例:

import requestsimport json

#1、处理表单数据

word = input('请输⼊要翻译的内容:')data = {\"i\":word,

\"from\":\"AUTO\", \"to\":\"AUTO\",

\"smartresult\":\"dict\", \"client\":\"fanyideskweb\", \"salt\":\"1536648367302\",

\"sign\":\"f7f6b53876957660bf69994389fd0014\", \"doctype\":\"json\", \"version\":\"2.1\",

\"keyfrom\":\"fanyi.web\",

\"action\":\"FY_BY_REALTIME\", \"typoResult\":\"false\"}

#此处德 URL为抓包⼯具抓到的POST的URL

url = \"http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule\"

headers = {'User-Agent':'User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}

response = requests.post(url,data=data,headers=headers)response.encoding = 'utf-8'result = response.text

print(type(result))#

print(result)#result为json格式的字符串'''{\"type\":\"ZH_CN2EN\ \"errorCode\":0, \"elapsedTime\":1, \"translateResult\":[

[{\"src\":\"你好\ \"tgt\":\"hello\" }] ]}'''

#把json格式的字符串转换为Python字典dic = json.loads(result)

print(dic[\"translateResult\"][0][0][\"tgt\"])

3、代理:proxies

1、爬⾍和反爬⾍⽃争的第⼆步 获取代理IP的⽹站 1、西刺代理 2、快代理 3、全国代理

2、普通代理:proxies={\"协议\":\"IP地址:端⼝号\ proxies = {'HTTP':\"123.161.237.114:45327\

import requests

url = \"http://www.taobao.com\"

proxies = {\"HTTP\":\"123.161.237.114:45327\"}headers = {\"User-Agent\":\"Mozilla5.0/\"}

response = requests.get(url,proxies=proxies,headers=headers)response.encoding = 'utf-8'print(response.text)

3、私密代理:proxies={\"协议\":\"http://⽤户名:密码@IP地址:端⼝号\

proxies={'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}

import requests

url = \"http://www.taobao.com/\"

proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}headers = {\"User-Agent\":\"Mozilla5.0/\"}

response = requests.get(url,proxies=proxies,headers=headers)response.encoding = 'utf-8'print(response.text)

4、案例:爬取链家地产⼆⼿房信息      1、存⼊mysql数据库

import pymysql

db = pymysql.connect(\"localhost\",\"root\",\"123456\",charset='utf8')cusor = db.cursor()

cursor.execute(\"create database if not exists testspider;\")cursor.execute(\"use testspider;\")

cursor.execute(\"create table if not exists t1(id int);\")cursor.execute(\"insert into t1 values(100);\")db.commit()cursor.close()db.close()

      2、存⼊MongoDB数据库

import pymongo

#链接mongoDB数据库

conn = pymongo.MongoClient('localhost',27017)#创建数据库并得到数据库对象db = conn.testpymongo#创建集合并得到集合对象myset = db.t1

#向集合中插⼊⼀个数据

myset.insert({\"name\":\"Tom\"})

\"\"\"

爬取链家地产⼆⼿房信息(⽤私密代理实现)⽬标:爬取⼩区名称,总价步骤:

1、获取url

https://cd.lianjia.com/ershoufang/pg1/ https://cd.lianjia.com/ershoufang/pg2/ 2、正则匹配

3、写⼊到本地⽂件\"\"\"

import requestsimport re

import multiprocessing as mp

BASE_URL = \"https://cd.lianjia.com/ershoufang/pg\"

proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}headers = {\"User-Agent\":\"Mozilla5.0/\"}

regex = '

.*?(.*?).*?
.*?(.*?)' def getText(BASE_URL,proxies,headers,page): url = BASE_URL+str(page)

res = requests.get(url,proxies=proxies,headers=headers) res.encoding = 'utf-8' html = res.text return html

def saveFile(page,regex=regex):

html = getText(BASE_URL,proxies,headers,page) p = re.compile(regex,re.S) content_list = p.findall(html) for content_tuple in content_list: cell = content_tuple[0].strip() price = content_tuple[1].strip() with open('链家.txt','a') as f: f.write(cell+\" \"+price+\"\\n\")

if __name__ == \"__main__\":

pool = mp.Pool(processes = 10)

pool.map(saveFile,[page for page in range(1,101)])

链家⼆⼿房产

import requestsimport re

import multiprocessing as mp

import pymysqlimport warnings

BASE_URL = \"https://cd.lianjia.com/ershoufang/pg\"

proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}headers = {\"User-Agent\":\"Mozilla5.0/\"}

regex = '

.*?(.*?).*?
.*?(.*?)' c_db = \"create database if not exists spider;\"u_db = \"use spider;\"

c_tab = \"create table if not exists lianjia(id int primary key auto_increment,\\ name varchar(30),\\

price decimal(20,2))charset=utf8;\"

db = pymysql.connect(\"localhost\",\"root\",'123456',charset=\"utf8\")cursor = db.cursor()

warnings.filterwarnings(\"error\")try:

cursor.execute(c_db)except Warning: pass

cursor.execute(u_db)try:

cursor.execute(c_tab)except Warning: pass

def getText(BASE_URL,proxies,headers,page): url = BASE_URL+str(page)

res = requests.get(url,proxies=proxies,headers=headers) res.encoding = 'utf-8' html = res.text return html

def writeToMySQL(page,regex=regex):

html = getText(BASE_URL,proxies,headers,page) p = re.compile(regex,re.S) content_list = p.findall(html) for content_tuple in content_list: cell = content_tuple[0].strip()

price = float(content_tuple[1].strip())*10000

s_insert = \"insert into lianjia(name,price) values('%s','%s');\"%(cell,price) cursor.execute(s_insert) db.commit() if __name__ == \"__main__\":

pool = mp.Pool(processes = 20)

pool.map(writeToMySQL,[page for page in range(1,101)])

存⼊mysql数据库

import requestsimport re

import multiprocessing as mpimport pymongo

BASE_URL = \"https://cd.lianjia.com/ershoufang/pg\"

proxies = {'HTTP':'http://309435365:szayclhp@114.67.228.126:16819'}headers = {\"User-Agent\":\"Mozilla5.0/\"}

regex = '

.*?(.*?).*?
.*?(.*?)' #链接mongoDB数据库

conn = pymongo.MongoClient('localhost',27017)#创建数据库并得到数据库对象db = conn.spider;

#创建集合并得到集合对象myset = db.lianjia

def getText(BASE_URL,proxies,headers,page): url = BASE_URL+str(page)

res = requests.get(url,proxies=proxies,headers=headers) res.encoding = 'utf-8' html = res.text return html

def writeToMongoDB(page,regex=regex):

html = getText(BASE_URL,proxies,headers,page) p = re.compile(regex,re.S) content_list = p.findall(html) for content_tuple in content_list: cell = content_tuple[0].strip()

price = float(content_tuple[1].strip())*10000 d = {\"houseName\":cell,\"housePrice\":price} #向集合中插⼊⼀个数据 myset.insert(d)

if __name__ == \"__main__\":

pool = mp.Pool(processes = 20)

pool.map(writeToMongoDB,[page for page in range(1,101)])

存⼊MongoDB

4、WEB客户端验证(有些⽹站需要先登录才可以访问):auth 1、auth = (\"⽤户名\密码\"),是⼀个元组

import requestsimport re

regex = r'(.*?)'class NoteSpider: def __init__(self):

self.headers = {\"User-Agent\":\"Mozilla5.0/\"} #auth参数为元组

self.auth = (\"tarenacode\",\"code_2013\") self.url = \"http://code.tarena.com.cn/\"

def getParsePage(self):

res = requests.get(self.url,auth=self.auth, headers=self.headers) res.encoding = \"utf-8\" html = res.text

p = re.compile(regex,re.S) r_list = p.findall(html) #调⽤writePage()⽅法 self.writePage(r_list) def writePage(self,r_list): print(\"开始写⼊\") for r_str in r_list:

with open('笔记.txt','a') as f: f.write(r_str + \"\\n\") print(\"写⼊完成\")

if __name__==\"__main__\": obj = NoteSpider() obj.getParsePage()

5、SSL证书认证:verify

1、verify=True:默认,做SSL证书认证 2、verify=False: 忽略证书认证

import requests

url = \"http://www.12306.cn/mormhweb/\"headers = {\"User-Agent\":\"Mozilla5.0/\"}

res = requests.get(url,verify=False,headers=headers)

res.encoding = \"utf-8\"print(res.text)

4、Handler处理器(urllib.request,了解) 1、定义

⾃定义的urlopen()⽅法,urlopen⽅法是⼀个特殊的opener 2、常⽤⽅法

1、build_opener(Handler处理器对象) 2、opener.open(url),相当于执⾏了urlopen 3、使⽤流程

1、创建相关Handler处理器对象

http_handler = urllib.request.HTTPHandler() 2、创建⾃定义opener对象

opener = urllib.request.build_opener(http_handler) 3、利⽤opener对象的open⽅法发送请求 4、Handler处理器分类 1、HTTPHandler()

import urllib.request

url = \"http://www.baidu.com/\"

#1、创建HTTPHandler处理器对象

http_handler = urllib.request.HTTPHandler()#2、创建⾃定义的opener对象

opener = urllib.request.build_opener(http_handler)#3、利⽤opener对象的open⽅法发送请求req = urllib.request.Request(url)res = opener.open(req)

print(res.read().decode(\"utf-8\"))

   2、ProxyHandler(代理IP):普通代理

import urllib.requesturl = \"http://www.baidu.com\"

#1、创建handler

proxy_handler = urllib.request.ProxyHandler({\"HTTP\":\"123.161.237.114:45327\"})#2、创建⾃定义opener

opener = urllib.request.build_opener(proxy_handler)#3、利⽤opener的open⽅法发送请求req = urllib.request.Request(url)res = opener.open(req)

print(res.read().decode(\"utf-8\"))

3、ProxyBasicAuthHandler(密码管理器对象):私密代理

1、密码管理器使⽤流程 1、创建密码管理器对象

pwd = urllib.request.HTTPPasswordMgrWithDefaultRealm() 2、添加私密代理⽤户名,密码,IP地址,端⼝号

pwd.add_password(None,\"IP:端⼝\",\"⽤户名\密码\") 2、urllib.request.ProxyBasicAuthHandler(密码管理器对象)1、CSV模块使⽤流程

1、Python语句打开CSV⽂件:

with open('test.csv','a',newline='',encoding='utf-8') as f: pass

2、初始化写⼊对象使⽤writer(⽅法: writer = csv.writer(f)

3、写⼊数据使⽤writerow()⽅法 writer.writerow([\"霸王别姬\ 4、⽰例:

import csv

#打开csv⽂件,如果不写newline=‘’,则每⼀条数据中间会出现⼀条空⾏with open(\"test.csv\",'a',newline='') as f: #初始化写⼊对象 writer = csv.writer(f) #写⼊数据

writer.writerow(['id','name','age']) writer.writerow([1,'Lucy',20]) writer.writerow([2,'Tom',25])

import csv

with open(\"猫眼/第⼀页.csv\",'w',newline=\"\") as f: writer = csv.writer(f)

writer.writerow(['电影名','主演','上映时间']) '''

如果使⽤utf-8会出现['\霸王别姬', '张国荣,张丰毅,巩俐', '1993-01-01'] 使⽤utf-8-sig['霸王别姬', '张国荣,张丰毅,巩俐', '1993-01-01'] 两者的区别:

UTF-8以字节为编码单元,它的字节顺序在所有系统中都是⼀様的,没有字节序的问题, 也因此它实际上并不需要BOM(“ByteOrder Mark”)。 但是UTF-8 with BOM即utf-8-sig需要提供BOM。 '''

with open(\"猫眼/第1页.txt\",'r',encoding=\"utf-8-sig\") as file: while True:

data_list = file.readline().strip().split(\"|\") print(data_list)

writer.writerow(data_list) if data_list[0]=='': break

2、Xpath⼯具(解析HTML) 1、Xpath

在XML⽂档中查找信息的语⾔,同样适⽤于HTML⽂档的检索 2、Xpath辅助⼯具

1、Chrome插件:Xpath Helper 打开/关闭:Ctrl + Shift + ⼤写X 2、FireFox插件:XPath checker

3、Xpath表达式编辑⼯具:XML Quire 3、Xpath匹配规则

  

    Harry Potter    J K. Rowling     2005    29.99    

    Python    Joe

    2018    49.99  

1、匹配演⽰

1、查找bookstore下⾯的所有节点:/bookstore 2、查找所有的book节点://book

3、查找所有book节点下title节点中,lang属性为‘en’的节点://book/title[@lang='en'] 2、选取节点

/:从根节点开始选取 /bookstore,表⽰“/‘前⾯的节点的⼦节点

//:从整个⽂档中查找某个节点 //price,表⽰“//”前⾯节点的所有后代节点 @:选取某个节点的属性 //title[@lang=\"en\"] 3、@使⽤

1、选取1个节点://title[@lang='en'] 2、选取N个节点://title[@lang] 3、选取节点属性值://title/@lang 4、匹配多路径 1、符号: | 2、⽰例:

获取所有book节点下的title节点和price节点 //book/title|//book/price 5、函数

contains():匹配⼀个属性值中包含某些字符串的节点 //title[contains(@lang,'e')]    text():获取⽂本    last():获取最后⼀个元素

      //ul[@class='pagination']/li[last()]    not():取反

      //*[@id=\"content\"]/div[2]//p[not(@class='otitle')]

  6、可以通过解析出来的标签对象继续调⽤xpath函数往下寻找标签    语法:获取的标签对象.xpath(“./div/span”)

\"\"\"

糗事百科https://www.qiushibaike.com/8hr/page/1/匹配内容

1、⽤户昵称,div/div/a/h2.text 2、内容,div/a/div/span.text 3、点赞数,div/div/span/i.text 4、评论数,div/div/span/a/i.text\"\"\"

import requests

from lxml import etree

url = \"https://www.qiushibaike.com/8hr/page/1/\"headers = {'User-Agent':\"Mozilla5.0/\"}res = requests.get(url,headers=headers)res.encoding = \"utf-8\"html = res.text

#先获取所有段⼦的div列表parseHtml = etree.HTML(html)

div_list = parseHtml.xpath(\"//div[contains(@id,'qiushi_tag_')]\")print(len(div_list))#遍历列表

for div in div_list: #获取⽤户昵称

username = div.xpath('./div/a/h2')[0].text print(username) #获取内容

content = div.xpath('.//div[@class=\"content\"]/span')[0].text print(content) #获取点赞

laughNum = div.xpath('./div/span/i')[0].text print(laughNum) #获取评论数

pingNum = div.xpath('./div/span/a/i')[0].text print(pingNum)

3、解析HTML源码

1、lxml库:HTML/XML解析库 1、安装

conda install lxml pip install lxml 2、使⽤流程

1、利⽤lxml库的etree模块构建解析对象 2、解析对象调⽤xpath⼯具定位节点信息 3、使⽤

1、导⼊模块from lxml import etree

2、创建解析对象:parseHtml = etree.HTML(html)

3、调⽤xpath进⾏解析:r_list = parseHtml.xpath(\"//title[@lang='en']\") 说明:只要调⽤了xpath,则结果⼀定是列表

from lxml import etree

html = \"\"\"

\"\"\"

#1、创建解析对象

parseHtml = etree.HTML(html)#2、利⽤解析对象调⽤xpath⼯具,#获取a标签中href的值s1 = \"//a/@href\"#获取单独的/

s2 = \"//a[@id='channel']/@href\"#获取后⾯的a标签中href的值s3 = \"//li/a/@href\"

s3 = \"//ul[@id='nav']/li/a/@href\"#更准确

#获取所有a标签的内容,1、⾸相获取标签对象,2、遍历对象列表,在通过对象.text属性获取⽂本值s4 = \"//a\"

#获取新浪社会

s5 = \"//a[@id='channel']\"#获取国内,国际,.......s6 = \"//ul[@id='nav']//a\"r_list = parseHtml.xpath(s6)print(r_list)for i in r_list: print(i.text)

4、案例:抓取百度贴吧帖⼦⾥⾯的图⽚ 1、⽬标:抓取贴吧中帖⼦图⽚ 2、思路

1、先获取贴吧主页的URL:河南⼤学,下⼀页的URL规律 2、获取河南⼤学吧中每个帖⼦的URL

3、对每个帖⼦发送请求,获取帖⼦⾥⾯所有图⽚的URL 4、对图⽚URL发送请求,以wb的范式写⼊本地⽂件

\"\"\"步骤

1、获取贴吧主页的URL

http://tieba.baidu.com/f?kw=河南⼤学&pn=0 http://tieba.baidu.com/f?kw=河南⼤学&pn=50

2、获取每个帖⼦的URL,//div[@class='t_con cleafix']/div/div/div/a/@href https://tieba.baidu.com/p/5878699216

3、打开每个帖⼦,找到图⽚的URL,//img[@class='BDE_Image']/@src

http://imgsrc.baidu.com/forum/w%3D580/sign=da37aaca6fd9f2d3201124e799ed8a53/27985266d01609240adb3730d90735fae7cd3480.jpg 4、保存到本地 \"\"\"

import requests

from lxml import etreeclass TiebaPicture: def __init__(self):

self.baseurl = \"http://tieba.baidu.com\" self.pageurl = \"http://tieba.baidu.com/f\" self.headers = {'User-Agent':\"Mozilla5.0/\"}

def getPageUrl(self,url,params): '''获取每个帖⼦的URL'''

res = requests.get(url,params=params,headers = self.headers) res.encoding = 'utf-8' html = res.text

#从HTML页⾯获取每个帖⼦的URL parseHtml = etree.HTML(html)

t_list = parseHtml.xpath(\"//div[@class='t_con cleafix']/div/div/div/a/@href\") print(t_list) for t in t_list:

t_url = self.baseurl + t self.getImgUrl(t_url)

def getImgUrl(self,t_url):

'''获取帖⼦中所有图⽚的URL'''

res = requests.get(t_url,headers=self.headers) res.encoding = \"utf-8\" html = res.text

parseHtml = etree.HTML(html)

img_url_list = parseHtml.xpath(\"//img[@class='BDE_Image']/@src\") for img_url in img_url_list: self.writeImg(img_url)

def writeImg(self,img_url): '''将图⽚保存如⽂件'''

res = requests.get(img_url,headers=self.headers) html = res.content

#保存到本地,将图⽚的URL的后10位作为⽂件名 filename = img_url[-10:]

with open(filename,'wb') as f: print(\"%s正在下载\"%filename) f.write(html)

print(\"%s下载完成\"%filename)

def workOn(self): '''主函数'''

kw = input(\"请输⼊你要爬取的贴吧名\") begin = int(input(\"请输⼊起始页\")) end = int(input(\"请输⼊终⽌页\")) for page in range(begin,end+1): pn = (page-1)*50 #拼接某个贴吧的URl

params = {\"kw\":kw,\"pn\":pn}

self.getPageUrl(self.pageurl,params=params)

if __name__ == \"__main__\": spider = TiebaPicture() spider.workOn()

爬取百度贴吧图⽚

1、动态⽹站数据抓取 - Ajax 1、Ajax动态加载

1、特点:动态加载(滚动⿏标滑轮时加载)

2、抓包⼯具:查询参数在WebForms -> QueryString 2、案例:⾖瓣电影top100榜单

import requestsimport jsonimport csv

url = \"https://movie.douban.com/j/chart/top_list\"headers = {'User-Agent':\"Mozilla5.0/\"}

params = {\"type\":\"11\",

\"interval_id\":\"100:90\", \"action\":\"\", \"start\":\"0\", \"limit\":\"100\"}

res = requests.get(url,params=params,headers=headers)res.encoding=\"utf-8\"

#得到json格式的数组[]html = res.text

#把json格式的数组转为python的列表ls = json.loads(html)

with open(\"⾖瓣100.csv\",'a',newline=\"\") as f: writer = csv.writer(f)

writer.writerow([\"name\",\"score\"]) for dic in ls:

name = dic['title'] score = dic['rating'][1]

writer.writerow([name,score])

2、json模块

1、作⽤:json格式类型 和 Python数据类型相互转换 2、常⽤⽅法

1、json.loads():json格式 --> Python数据类型 json python 对象 字典 数组 列表 2、json.dumps():

3、selenium + phantomjs 强⼤的⽹络爬⾍ 1、selenium

1、定义:WEB⾃动化测试⼯具,应⽤于WEB⾃动化测试 2、特点:

1、可运⾏在浏览器上,根据指令操作浏览器,让浏览器⾃动加载页⾯ 2、只是⼀个⼯具,不⽀持浏览器功能,只能与第三⽅浏览器结合使⽤ 3、安装

conda install selenium pip install selenium 2、phantomjs 1、Windowds

1、定义:⽆界⾯浏览器(⽆头浏览器) 2、特点:

1、把⽹站加载到内存执⾏页⾯加载 2、运⾏⾼效 3、安装

1、把安装包拷贝到Python安装路径Script... 2、Ubuntu

1、下载phantomjs安装包放到⼀个路径下 2、⽤户主⽬录:vi .bashrc

export PHANTOM_JS = /home/.../phantomjs-... export PATH=$PHANTOM_JS/bin:$PATH 3、source .bashrc 4、终端:phantomjs 3、⽰例代码

#导⼊selenium库中的⽂本driverfrom selenium import webdriver#创建打开phantomjs的对象driver = webdriver.PhantomJS()#访问百度

driver.get(\"http://www.baidu.com/\")#获取⽹页截图

driver.save_screenshot(\"百度.png\")

4、常⽤⽅法

1、driver.get(url)

2、driver.page_source.find(\"内容\"):

作⽤:从html源码中搜索字符串,搜索成功返回⾮-1,搜索失败返回-1

from selenium import webdriverdriver = webdriver.PhantomJS()driver.get(\"http://www.baidu.com/\")r1 = driver.page_source.find(\"kw\")r2 = driver.page_source.find(\"aaaa\")print(r1,r2)#1053 -1

3、driver.find_element_by_id(\"id值\").text 4、driver.find_element_by_name(\"属性值\")

5、driver.find_element_by_class_name(\"属性值\") 6、对象名.send_keys(\"内容\") 7、对象名.click() 8、driver.quit()

5、案例:登录⾖瓣⽹站4、BeautifulSoup

1、定义:HTML或XML的解析,依赖于lxml库 2、安装并导⼊ 安装:

pip install beautifulsoup4 conda install beautifulsoup4

导⼊模块:from bs4 import BeautifulSoup as bs 3、⽰例

4、BeautifulSoup⽀持的解析库

1、lxml HTML解析器, 'lxml'速度快,⽂档容错能⼒强 2、Python标准库 'html.parser',速度⼀般 3、lxml XML解析器 'xml':速度快

from selenium import webdriver

from bs4 import BeautifulSoup as bsimport time

driver = webdriver.PhantomJS()

driver.get(\"https://www.douyu.com/directory/all\")while True:

html = driver.page_source #创建解析对象

soup = bs(html,'lxml')

#直接调⽤⽅法去查找元素 #存放所有主播的元素对象

names = soup.find_all(\"span\",{\"class\":\"dy-name ellipsis fl\"}) numbers = soup.find_all(\"span\",{\"class\":\"dy-num fr\"}) #name ,number 都是对象,有get_text() for name , number in zip(names,numbers):

print(\"观众⼈数:\",number.get_text(),\"主播\",name.get_text()) if html.find(\"shark-pager-disable-next\") ==-1:

driver.find_element_by_class_name(\"shark-pager-next\").click() time.sleep(4) else: break

使⽤pytesseract识别验证码

  1、安装 sudo pip3 install pytesseract  2、使⽤步骤:

    1、打开验证码图⽚:Image.open(‘验证码图⽚路径’)    2、使⽤pytesseract模块中的image_to_string()⽅法进⾏识别

from PIL import Imagefrom pytesseract import *#1、加载图⽚

image = Image.open('t1.png')#2、识别过程

text = image_to_string(image)print(text)

使⽤captcha模块⽣成验证码

  1、安装 sudo pip3 install captcha

import random

from PIL import Imageimport numpy as np

from captcha.image import ImageCaptcha

digit = ['0','1','2','3','4','5','6','7','8','9']

alphabet = [chr(i) for i in range(97,123)]+[chr(i) for i in range(65,91)]char_set = digit + alphabet#print(char_set)

def random_captcha_text(char_set=char_set,captcha_size=4): '''默认获取⼀个随机的含有四个元素的列表''' captcha_text = []

for i in range(captcha_size):

ele = random.choice(char_set) captcha_text.append(ele) return captcha_text

def gen_captcha_text_and_inage():

'''默认随机得到⼀个包含四个字符的图⽚验证码并返回字符集''' image = ImageCaptcha()

captcha_text = random_captcha_text() #将列表转为字符串

captcha_text = ''.join(captcha_text)

captchaInfo = image.generate(captcha_text) #⽣成验证码图⽚

captcha_imge = Image.open(captchaInfo) captcha_imge = np.array(captcha_imge) im = Image.fromarray(captcha_imge) im.save('captcha.png') return captcha_text

if __name__ == '__main__':

gen_captcha_text_and_inage()

去重

  1、去重分为两个步骤,创建两个队列(列表)

    1、⼀个队列存放已经爬取过了url,存放之前先判断这个url是否已经存在于已爬队列中,通过这样的⽅式去重    2、另外⼀个队列存放待爬取的url,如果该url不在已爬队列中则放⼊到带爬取队列中  使⽤去重和⼴度优先遍历爬取⾖瓣⽹

import re

from bs4 import BeautifulSoupimport basicspiderimport hashlibHelper

def get_html(url): \"\"\"

获取⼀页的⽹页源码信息 \"\"\"

headers = [(\"User-Agent\",\"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36\")] html = basicspider.downloadHtml(url, headers=headers) return html

def get_movie_all(html): \"\"\"

获取当前页⾯中所有的电影的列表信息 \"\"\"

soup = BeautifulSoup(html, \"html.parser\")

movie_list = soup.find_all('div', class_='bd doulist-subject') #print(movie_list) return movie_list

def get_movie_one(movie): \"\"\"

获取⼀部电影的精细信息,最终拼成⼀个⼤的字符串 \"\"\"

result = \"\"

soup = BeautifulSoup(str(movie),\"html.parser\") title = soup.find_all('div', class_=\"title\")

soup_title = BeautifulSoup(str(title[0]), \"html.parser\") for line in soup_title.stripped_strings: result += line try:

score = soup.find_all('span', class_='rating_nums') score_ = BeautifulSoup(str(score[0]), \"html.parser\") for line in score_.stripped_strings: result += \"|| 评分:\" result += line except:

result += \"|| 评分:5.0\"

abstract = soup.find_all('div', class_='abstract')

abstract_info = BeautifulSoup(str(abstract[0]), \"html.parser\") for line in abstract_info.stripped_strings: result += \"|| \" result += line

result += '\\n' print(result) return result

def save_file(movieInfo): \"\"\"

写⽂件的操作,这⾥使⽤的追加的⽅式来写⽂件 \"\"\"

with open(\"doubanMovie.txt\",\"ab\") as f: #lock.acquire()

f.write(movieInfo.encode(\"utf-8\"))

#lock.release()

crawl_queue = []#待爬取队列crawled_queue = []#已爬取队列

def crawlMovieInfo(url): '''抓取⼀页数据'''

'https://www.douban.com/doulist/3516235/' global crawl_queue global crawled_queue html = get_html(url)

regex = r'https://www\\.douban\\.com/doulist/3516235/\\?start=\\d+&sort=seq&playable=0&sub_type=' p = re.compile(regex,re.S) itemUrls = p.findall(html) #两步去重过程

for item in itemUrls:

#将item进⾏hash然后判断是否已经在已爬队列中 hash_irem = hashlibHelper.hashStr(item)

if hash_irem not in crawled_queue:#已爬队列去重 crawl_queue.append(item)

crawl_queue = list(set(crawl_queue))#将待爬队列去重 #处理当前页⾯

movie_list = get_movie_all(html) for movie in movie_list:

save_file(get_movie_one(movie)) #将url转为hash值并存⼊已爬队列中 hash_url = hashlibHelper.hashStr(url) crawled_queue.append(hash_url)

if __name__ == \"__main__\": #⼴度优先遍历

seed_url = 'https://www.douban.com/doulist/3516235/?start=0&sort=seq&playable=0&sub_type=' crawl_queue.append(seed_url) while crawl_queue:

url = crawl_queue.pop(0) crawlMovieInfo(url) print(crawled_queue) print(len(crawled_queue))import hashlib

def hashStr(strInfo): '''对字符串进⾏hash'''

hashObj = hashlib.sha256()

hashObj.update(strInfo.encode('utf-8')) return hashObj.hexdigest()

def hashFile(fileName): '''对⽂件进⾏hash'''

hashObj = hashlib.md5() with open(fileName,'rb') as f: while True:

#不要⼀次性全部读取出来,如果⽂件太⼤,内存不够 data = f.read(2048) if not data: break

hashObj.update(data) return hashObj.hexdigest()

if __name__ == \"__main__\": print(hashStr(\"hello\"))

print(hashFile('猫眼电影.txt'))

hashlibHelper.py

from urllib import requestfrom urllib import parsefrom urllib import errorimport randomimport time

def downloadHtml(url,headers=[()],proxy={},timeout=None,decodeInfo='utf-8',num_tries=10,useProxyRatio=11): '''

⽀持user-agent等Http,Request,Headers ⽀持proxy 超时的考虑

编码的问题,如果不是UTF-8编码怎么办 服务器错误返回5XX怎么办 客户端错误返回4XX怎么办 考虑延时的问题 '''

time.sleep(random.randint(1,2))#控制访问,不要太快 #通过useProxyRatio设置是否使⽤代理 if random.randint(1,10) >useProxyRatio: proxy = None #创建ProxuHandler

proxy_support = request.ProxyHandler(proxy) #创建opener

opener = request.build_opener(proxy_support) #设置user-agent

opener.addheaders = headers #安装opener

request.install_opener(opener) html = None try:

#这⾥可能出现很多异常 #可能会出现编码异常

#可能会出现⽹络下载异常:客户端的异常404,403 # 服务器的异常5XX res = request.urlopen(url)

html = res.read().decode(decodeInfo) except UnicodeDecodeError: print(\"UnicodeDecodeError\")

except error.URLError or error.HTTPError as e: #客户端的异常404,403(可能被反爬了) if hasattr(e,'code') and 400 <= e.code < 500: print(\"Client Error\"+e.code)

elif hasattr(e,'code') and 500 <= e.code < 600: if num_tries > 0:

time.sleep(random.randint(1,3))#设置等待的时间

downloadHtml(url,headers,proxy,timeout,decodeInfo,num_tries-1) return html

if __name__ == \"__main__\":

url = \"http://maoyan.com/board/4?offset=0\"

headers = [(\"User-Agent\",\"User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50\")] print(downloadHtml(url,headers=headers))

basicspider.py Scrapy框架

在终端直接输⼊scrapy查看可以使⽤的命令 bench Run quick benchmark test

fetch Fetch a URL using the Scrapy downloader

genspider Generate new spider using pre-defined templates

runspider Run a self-contained spider (without creating a project) settings Get settings values

shell Interactive scraping console startproject Create new project version Print Scrapy version

view Open URL in browser, as seen by Scrapy 使⽤步骤:

1、创建⼀个项⽬:scrapy startproject 项⽬名称 scrapy startproject tencentSpider 2、进⼊到项⽬中,创建⼀个爬⾍ cd tencentSpider

scrapy genspider tencent hr.tencent.com #tencent表⽰创建爬⾍的名字,hr.tencent.com表⽰⼊⼝,要爬取的数据必须在这个域名之下 3、修改程序的逻辑 1、settings.py 1、设置ua

2、关闭robots协议 3、关闭cookie

4、打开ItemPipelines

# -*- coding: utf-8 -*-# Scrapy settings for tencentSpider project#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:#

# https://doc.scrapy.org/en/latest/topics/settings.html

# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html# https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'tencentSpider'

SPIDER_MODULES = ['tencentSpider.spiders']NEWSPIDER_MODULE = 'tencentSpider.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'tencentSpider (+http://www.yourdomain.com)'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0'# Obey robots.txt rules

ROBOTSTXT_OBEY = False #是否遵循robots协议

# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False

# Override the default request headers:#DEFAULT_REQUEST_HEADERS = {

# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',# 'Accept-Language': 'en',#}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {

# 'tencentSpider.middlewares.TencentspiderSpiderMiddleware': 543,#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {

# 'tencentSpider.middlewares.TencentspiderDownloaderMiddleware': 543,#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None,#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {

'tencentSpider.pipelines.TencentspiderPipeline': 300,#值表⽰优先级 }

# Enable and configure the AutoThrottle extension (disabled by default)# See https://doc.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings.py

2、items.py:ORM

import scrapy

class TencentspiderItem(scrapy.Item):

# define the fields for your item here like: # name = scrapy.Field()

#抓取招聘的职位,连接,岗位类型 positionName = scrapy.Field() positionLink = scrapy.Field() positionType = scrapy.Field()

3、pipelines.py:保存数据的逻辑

import json

class TencentspiderPipeline(object): def process_item(self, item, spider): with open('tencent.json','ab') as f:

text = json.dumps(dict(item),ensure_ascii=False)+'\\n' f.write(text.encode('utf-8')) return item

4、spiders/tencent.py:主体的逻辑

import scrapy

from tencentSpider.items import TencentspiderItem

class TencentSpider(scrapy.Spider): name = 'tencent'

allowed_domains = ['hr.tencent.com'] #start_urls = ['http://hr.tencent.com/']# start_urls = []

# for i in range(0,530,10):

# url = \"https://hr.tencent.com/position.php?keywords=python&start=\"# url += str(i)+\"#a\"

# start_urls.append(url)

url = \"https://hr.tencent.com/position.php?keywords=python&start=\" offset = 0

start_urls = [url + str(offset)+\"#a\"]

def parse(self, response):

for each in response.xpath('//tr[@class=\"even\"]|//tr[@class=\"odd\"]'): item = TencentspiderItem()#item是⼀个空字典

item['positionName'] = each.xpath('./td[1]/a/text()').extract()[0]

item['positionLink'] = \"https://hr.tencent.com/\"+each.xpath('./td[1]/a/@href').extract()[0] item['positionType'] = each.xpath('./td[2]/text()').extract()[0] yield item

#提取链接

if self.offset < 530: self.offset += 10

nextPageUrl = self.url+str(self.offset)+\"#a\" else: return

#对下⼀页发起请求

yield scrapy.Request(nextPageUrl,callback = self.parse)

4、运⾏爬⾍

scrapy crawl tencent

5、运⾏爬⾍ 并将数据保存到指定⽂件中

    scrapy crawl tencent -o ⽂件名 如何在scrapy框架中设置代理服务器

1、可以在middlewares.py⽂件中的DownloaderMiddleware类中的process_request()⽅法中,来完成代理服务器的设置 2、然后将代理服务器的池放在setting.py⽂件中定义⼀个proxyList = [.....]

3、process_request()⽅法⾥⾯通过random.choice(proxyList)随机选⼀个代理服务器 注意:

1、这⾥的代理服务器如果是私密的,有⽤户名和密码时,需要做⼀层简单的加密处理Base64

2、在scrapy⽣成⼀个基础爬⾍时使⽤:scrapy genspider tencent hr.tencent.com,如果要想⽣成⼀个⾼级的爬⾍CrawlSpider scrapy genspider -t crawl tencent2 hr.tencent.com

CrawSpider这个爬⾍可以更加灵活的提取URL等信息,需要了解URL,LinkExtractor Scrapy-Redis搭建分布式爬⾍

Redis是⼀种内存数据库(提供了接⼝将数据保存到磁盘数据库中);

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- 版权所有