博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
python urllib、urlparse、urllib2、cookielib
阅读量:4694 次
发布时间:2019-06-09

本文共 9223 字,大约阅读时间需要 30 分钟。

1、urllib模块

1.urllib.urlopen(url[,data[,proxies]])

打开一个url的方法,返回一个文件对象,然后可以进行类似文件对象的操作。本例试着打开google

import urllibf = urllib.urlopen('http://www.google.com.hk/')firstLine = f.readline()   #读取html页面的第一行

urlopen返回对象提供方法:

-         read([bytes]):读所以字节或者bytes个字节

-         readline():读一行

-         readlines() :读所有行

-         fileno() :返回文件句柄

-         close() :关闭url链接

-         info():返回一个httplib.HTTPMessage对象,表示远程服务器返回的头信息

-         getcode():返回Http状态码。如果是http请求,200请求成功完成;404网址未找到

-         geturl():返回请求的url

2.urllib.urlretrieve(url[,filename[,reporthook[,data]]])

urlretrieve方法将url定位到的html文件下载到你本地的硬盘中。如果不指定filename,则会存为临时文件。

urlretrieve()返回一个二元组(filename,mine_hdrs)

临时存放:

filename = urllib.urlretrieve('http://www.google.com.hk/')type(filename)
print filename[0]print filename[1]

输出:

'/tmp/tmp8eVLjq'

存为本地文件:

filename = urllib.urlretrieve('http://www.baidu.com/',filename='/home/dzhwen/python文件/Homework/urllib/google.html')print type(filename)print filename[0]print filename[1]

输出:

'/home/dzhwen/python\xe6\x96\x87\xe4\xbb\xb6/Homework/urllib/google.html'

reporthook参数使用如下:

def process(blk,blk_size,total_size):	print('%d/%d - %.02f%%' %(blk*blk_size,total_size,(float)(blk * blk_size) / total_size * 100))def download():	filename,fileinfo = urllib.urlretrieve('http://cnblogs.com','index.html',reporthook=process)

 输出结果:

0/46164 - 0.00%8192/46164 - 17.75%16384/46164 - 35.49%24576/46164 - 53.24%32768/46164 - 70.98%40960/46164 - 88.73%49152/46164 - 106.47%

blk * blk_size的有可能超过total_size,如上函数可以改写为:

def process(blk,blk_size,total_size):	if total_size == -1:		print "can't determine the file size, now retrived", blk * blk_size	else:		percentage = int((blk * blk_size * 100.0) / total_size)		if percentage >= 100:			print('%d/%d - %d%%' % (total_size, total_size, 100))		else:			print('%d/%d - %d%%' % (blk * blk_size, total_size, percentage))

 运行后输出:

0/46238 - 0%8192/46238 - 17%16384/46238 - 35%24576/46238 - 53%32768/46238 - 70%40960/46238 - 88%46238/46238 - 100%

 

3.urllib.urlcleanup()

清除由于urllib.urlretrieve()所产生的缓存

4.urllib.quote(url)和urllib.quote_plus(url)

将url数据获取之后,并将其编码,从而适用与URL字符串中,使其能被打印和被web服务器接受。

urllib.quote('http://www.baidu.com')

转换结果:

'http%3A//www.baidu.com'
urllib.quote_plus('http://www.baidu.com')

转换结果:

'http%3A%2F%2Fwww.baidu.com'

5.urllib.unquote(url)和urllib.unquote_plus(url)

与4的函数相反。

6.urllib.urlencode(query)

将URL中的键值对以连接符&划分

这里可以与urlopen结合以实现post方法和get方法:

GET方法:

import urllibparams=urllib.urlencode({'spam':1,'eggs':2,'bacon':0})f=urllib.urlopen("http://python.org/query?%s" % params)print f.read()

POST方法:  

import urllibparmas = urllib.urlencode({'spam':1,'eggs':2,'bacon':0})f=urllib.urlopen("http://python.org/query",parmas)f.read()

2.urlparse模块

1.urlparse

作用:反向解析url

def parse_html():	url = 'https://www.baidu.com/s?wd=python&rsv_spt=1&rsv_iqid=0xad2dc5550032146a&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=7&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&inputT=22&rsv_sug4=4980'	result = urlparse.urlparse(url)	# params = urlparse.parse_qs(result.query)	print result	# print params

运行结果:

ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='', query='wd=python&rsv_spt=1&rsv_iqid=0xad2dc5550032146a&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=7&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&inputT=22&rsv_sug4=4980', fragment='')

 如上返回的是一个parseResult对象,其中包括协议类型、主机地址、路径、参数以及query

2.parse_qs

import urllibimport urlparsedef parse_html():	url = 'https://www.baidu.com/s?wd=python&rsv_spt=1&rsv_iqid=0xad2dc5550032146a&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=7&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&inputT=22&rsv_sug4=4980'	result = urlparse.urlparse(url)	params = urlparse.parse_qs(result.query)	# print result	print paramsif __name__ == '__main__':	# demo()	# demo2()	parse_html()

 运行结果:

{'wd': ['python'], 'rsv_spt': ['1'], 'rsv_iqid': ['0xad2dc5550032146a'], 'inputT': ['22'], 'f': ['8'], 'rsv_enter': ['1'], 'rsv_bp': ['0'], 'rsv_idx': ['2'], 'tn': ['baiduhome_pg'], 'rsv_sug4': ['4980'], 'rsv_sug7': ['100'], 'rsv_sug1': ['5'], 'issp': ['1'], 'rsv_sug3': ['7'], 'rsv_sug2': ['0'], 'ie': ['utf-8']}

 

3、urllib2模块

urllib2提供更加强大的功能,如cookie的管理,但并不能完全代替urllib,因为urllib.urlencode函数urllib2中是没有的

3.1 urllib2.urlopen()

作用:打开url

参数:

  • url
  • data = None
  • timeout = <object>
import urllibimport urllib2def demo():	url = 'http://www.cnblogs.com/hester/sllsl'	try:		s = urllib2.urlopen(url,timeout = 3)	except urllib2.HTTPError,e:		print e	else:		print s.read(100)if __name__ == '__main__':	demo() 

运行结果:

”温故而知新“

如果url更改为未知的网址:

url = 'http://www.cnblogs.com/hester/asdfas'

 运行结果:

HTTP Error 404: Not Found

3.2 urllib2.Request()

作用:添加或者修改http头

参数:

  • url
  • data
  • headers
import urllibimport urllib2def demo():	url = 'http://www.cnblogs.com/hester'	headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'}	req = urllib2.Request(url,headers=headers)	s = urllib2.urlopen(req)	print s.read(100)	print req.headers	s.close()if __name__ == '__main__':	demo()

 运行结果:

”温故而知新“ {'X-my-hester': 'my value', 'User-agent': 'Mozilla/5.0'}

 3.3 urllib2.bulid_opener()

作用:创建一个打开器

参数:

  • Handler列表
  1. ProxyHandler
  2. UnknownHandler
  3. HTTPHandler
  4. HTTPDefaultHandler
  5. HTTPRedirectHandler
  6. FTPHandler
  7. FileHandler
  8. HTTPErrorHandler
  9. HTTPSHandler

返回:

  • OpenerDirector
import urllibimport urllib2def request_post_debug():	data = {'username':'hester_ge','password':'xxxxxxx'}	headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'}	req = urllib2.Request('http://www.cnblogs.com/hester',data = urllib.urlencode(data),headers=headers)	opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))	s = opener.open(req)	print s.read(100)	s.close()if __name__ == '__main__':	request_post_debug()

运行结果:

send: 'POST /hester HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 35\r\nHost: www.cnblogs.com\r\nX-My-Hester: my value\r\nUser-Agent: Mozilla/5.0\r\nConnection: close\r\nContent-Type: application/x-www-form-urlencoded\r\n\r\nusername=hester_ge&password=xxxxxxx'reply: 'HTTP/1.1 200 OK\r\n'header: Date: Sun, 03 Jul 2016 08:28:37 GMTheader: Content-Type: text/html; charset=utf-8header: Content-Length: 14096header: Connection: closeheader: Vary: Accept-Encodingheader: Cache-Control: private, max-age=10header: Expires: Sun, 03 Jul 2016 08:28:45 GMTheader: Last-Modified: Sun, 03 Jul 2016 08:28:35 GMTheader: X-UA-Compatible: IE=10
”温故而知新“

 3.4 urllib2.install_opener

作用:保存创建的opener

import urllibimport urllib2def demo():	url = 'http://www.cnblogs.com/hester'	headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'}	req = urllib2.Request(url,headers=headers)	s = urllib2.urlopen(req)	print s.read(100)	print req.headers	s.close()# def request_post_debug():# 	data = {'username':'hester_ge','password':'xxxxxxx'}# 	headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'}# 	req = urllib2.Request('http://www.cnblogs.com/hester',data = urllib.urlencode(data),headers=headers)# 	opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))# 	s = opener.open(req)# 	print s.read(100)# 	s.close()def install_opener():	opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1),								  urllib2.HTTPSHandler(debuglevel=1))	urllib2.install_opener(opener)if __name__ == '__main__':	# request_post_debug()	demo()

 运行结果:

”温故而知新“ {'X-my-hester': 'my value', 'User-agent': 'Mozilla/5.0'}

 如上代码更改为:

if __name__ == '__main__':	# request_post_debug()	install_opener()	demo()

运行结果:

send: 'GET /hester HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.cnblogs.com\r\nConnection: close\r\nX-My-Hester: my value\r\nUser-Agent: Mozilla/5.0\r\n\r\n'reply: 'HTTP/1.1 200 OK\r\n'header: Date: Sun, 03 Jul 2016 08:39:31 GMTheader: Content-Type: text/html; charset=utf-8header: Content-Length: 14096header: Connection: closeheader: Vary: Accept-Encodingheader: Cache-Control: private, max-age=10header: Expires: Sun, 03 Jul 2016 08:39:41 GMTheader: Last-Modified: Sun, 03 Jul 2016 08:39:31 GMTheader: X-UA-Compatible: IE=10
”温故而知新“ {'X-my-hester': 'my value', 'User-agent': 'Mozilla/5.0'}

 4、cookies模块

因HTTP协议是无状态的,服务器无法识别请求是否为同一计算机,所以需要使用cookies进行标示。

客户见浏览器先发送request给服务器,服务器收到请求后进行解析,然后发送response给客户机,set_cookies就存在与response中,由浏览器进行设置。

我们这边用到两个模块

cookielib.CookieJar 提供解析并保存cookie的接口

HTTPCookieProcessor 提供自动出来cookie的功能

#encoding=utf8import urllib2import cookielibdef handler_cookie():	cookiejar = cookielib.CookieJar()	handler = urllib2.HTTPCookieProcessor(cookiejar=cookiejar)	opener = urllib2.build_opener(handler,urllib2.HTTPHandler(debuglevel=1))	s = opener.open('http://www.douban.com/')	print s.read(100)	s.close()	print '=' * 20	print cookiejar._cookies	print '=' * 20	#发送第二次请求时,自动带上cookie	s2 = opener.open('http://www.douban.com/')	print s2.read(100)	s2.close()if __name__ == '__main__':	handler_cookie()

运行结果:

/usr/bin/python2.7 /home/hester/PycharmProjects/untitled/demo4.pysend: 'GET / HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.douban.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'reply: 'HTTP/1.1 301 Moved Permanently\r\n'header: Date: Sun, 03 Jul 2016 10:01:41 GMTheader: Content-Type: text/htmlheader: Content-Length: 178header: Connection: closeheader: Location: https://www.douban.com/header: Server: dae

 

 

  

  

  

  

  

  

转载于:https://www.cnblogs.com/hester/p/5420696.html

你可能感兴趣的文章
解决supervisord启动问题
查看>>
全局组,通用组,本地组
查看>>
wget 抓取页面
查看>>
MYSQL错误代码#1045 Access denied for user 'root'@'localhost'
查看>>
linux 无需主机密码传输文件
查看>>
利用emacs调试C++程序教程
查看>>
VSCode快捷键
查看>>
[Raytracing]四种主要类型的追踪光线
查看>>
requirejs加载layerdate.js遇到的各种坑
查看>>
jQuery酷炫的文字动画效果代码
查看>>
css3 属性
查看>>
迪杰斯特拉算法
查看>>
查询SQL数据库名和表字段名
查看>>
拨号助手pppd插件--用openwrt路由共享上网
查看>>
浅谈高并发的理解
查看>>
强大的拖拽插件
查看>>
JavaScript基础知识(DOM)
查看>>
java Socket 学习
查看>>
composer install 出现“Please provide a valid cache path”
查看>>
腻子脚本
查看>>