Overview of Crawler Technology

Jiawen Zhang Mar 22, 2019

大致记录了一下爬虫的整体思想和一些常用的库方便日后项目中进行查阅。本文的主要内容整理自崔庆才的爬虫教程。 ## 爬虫基本流程 1. 发起请求通过HTTP库向目标站点发起请求，即发送一个request，请求可以包含额外的header等信息，等待服务器响应 2. 获取响应内容如果服务器能正常响应，会得到一个response，response的内容便是所要获取的页面内容，类型可能有HTML,Json字符串，二进制数据（如图片视频）等类型。 3. 解析内容得到的内容可能是HTML，可以用正则表达式、网页解析库进行解析。可能是Json，可以直接转为Json对象解析，可能是二进制数据，可以做保存或者进一步的处理 4. 保存数据保存形式多样，可以存为文本，也可以保存至数据库，或者保存特定格式的文件

Request 和 Response

Request

请求方式主要有GET、POST两种类型，另外还有HEAD、PUT、DELETE、OPTIONS等 GET和POST的区别： get：直接输入URL回车访问 post：需要构建表单，点击表单提交，请求参数不会包含在URL后
请求URL URL全称统一资源定位符，如一个网页文档、一张图片、一个视频都可以用URL唯一来确定。 > 在渲染过程中，浏览器会根据图片的URL重新发送请求，渲染出图片
请求头包含请求时的头部信息，如User-Agent、Host、Cookies等信息
用于服务器判断配置信息
请求体请求时额外携带的数据，如表单提交时的表单数据

Response

响应状态判断网页响应状态
e.g.状态码：
200 成功
301 跳转
404 找不到服务器
505 服务器错误
响应头如服务类型、内容长度、服务器信息、设置Cookie等等。
响应体最主要的部分，包含了请求资源的内容，如网页HTML、图片二进制数据等。

能抓取的数据

网页文本如HTML文档、Json格式文本等。
图片获取到的是二进制文件，保存为图片格式
视频同为二进制文件，保存为视频格式即可。
其他只要是能获取到的都能抓取

解析方式

直接处理
Json解析
正则表达式
BeautifulSoup
PyQuery
XPath

怎样保存数据

文本纯文本、Json、Xml等
关系型数据库如MySQL、Oracle、SQL Server等具有结构化表结构形式存储。
非关系型数据库如MongoDB、Redis等Key-Value形式存储。
二进制文件如图片、视频、音频等等直接保存成特定形式即可。

Urllib库基本使用

什么是Urllib

Python内置的HTTP请求库
urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块(工具模块)
urllib.robotparser robots.txt解析模块

用法讲解

Request

1	urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

import urllib.request

response = urllib.request.urlopen('http://python.org')
print(response.read().decode('utf-8'))

#将hello以post的形式传递，完成一个post的请求；若不加data则以get的形式发送
import urllib.request
import urllib.parse

data = bytes(urllib.parse.urlencode({'word':'hello'}).encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read())

#在设置时间内没有得到响应的话会抛出异常；http://httpbin.org/get会返回请求时的一些参数，以json形式
import urllib.request

response = urllib.request.urlopen('http://httpbin.org/get',timeout=1)
print(response.read())

#将异常时间设为0.1秒，对异常进行捕获
import urllib.request
import socket
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason,socket.timeout):\\将错误的原因进行判断
        print('TIME OUT')

Response

响应类型

import urllib.request

response = urllib.request.urlopen('http://www.python.org')
print(type(response))

状态码、响应头

是判断响应是否成功的重要标志

import urllib.request

response = urllib.request.urlopen('http://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheaders('Server'))    #获取一个特定的响应头(Server)

import urllib.request

response = urllib.request.urlopen('http://www.python.org')
print(response.read().decode('utf-8'))    #read是获得一个响应体的内容

若要发送更为复杂的request

#加入新的headers
import urllib.request

request = urllib.request.Request('http://www.python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

Handler

相当于辅助工具，帮助我们进行其他的操作 #### 设置代理设置代理可以切换本地的ip地址使不被封

import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http':'http://127.0.0.1:9743',
    'https':'https://127.0.0.1:9743',
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://www.baidu.com')
print(response.read())

#### Cookie 用来维持登录状态

import http.cookiejar,urllib.request

cookie = http.cookiejar.CookieJar() #cookie创建为对象
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie: #对cookie遍历
    print(item.name+"="+item.value)

可将Cookie保存成文本文件，若cookie没有失效的话可以从文件中读出cookie，请求时附加cookie信息来保持登录状态。

#将cookie保存为txt文件
import http.cookiejar,urllib.request

filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

还有一种保存方式

# 用文本文件的形式把cookie存储，然后读取出来，然后把cookie再次放到request里面，请求出了这个网页，这样请求的结果就是登陆后才能看到的网页内容
import http.cookiejar,urllib.request

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))
filename = "cookie.txt"

### 异常处理

#请求一个不存在的网页
from urllib import request,error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason) #捕捉的是URL的异常

具体可以捕捉哪些异常，see:https://docs.python.org/3/library/urllib.html

#请求一个不存在的网页
from urllib import request,error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n') 
except error.URLError as e:
    print(e.reason) 
else:
    print('Request Successfully')

from urllib import request,error
import socket

try:
    response = request.urlopen('http://cuiqingcai.com/index.htm',timeout = 0.01)
except error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason,socket.timeout):
        print('TIME OUT')

### URL解析 #### urlparse 传入一个URL，然后将URL进行分割，分割成几个部分，然后将各个部分依次进行复制
所有URL都可以按标准的结构划分

1	urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html')
print(type(result),result)

若没有协议类型，则默认为https类型

from urllib.parse import urlparse

result = urlparse(www.baidu.com/index.html',scheme="https")
print(result)

若allow_fragments=False，则其内容将拼接到前面的内容中（成为path或query或params） #### urlunparse urlparse的反函数，将url进行拼接

from urllib.parse import urlunparse

data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

#### urljoin 用于拼接url，若前面的url和后面的url不同，后面的字段会覆盖前面的字段 #### urlencode 将字典对象转变为url请求参数

from urllib.parse import urlencode

params = {
    'name':'germey',
    'age':22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

### robotparser 用来解析robot.txt文件