1、获取网页源代码及设置http请求头 1 import requests、2 from bs4 imp泠贾高框ort BeautifulSoup、3 import os。4headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}.
2、BeautifulSoup模块解析网页1 Soup = BeautifulSoup(start_html.content,'lxml')。
3、解析后获取对应的标签并创建文件夹all_a = Soup.find('div',class_='all').findAll('a')for a in all_a
4、获取标签中单个页面href = a['href']
5、获取单个页面中的源码,并解析,同12步骤类似
6、headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWe芟鲠阻缒bKit/537.1 " "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" ,'Referer'。