抓取知乎評論下小姐姐圖片(更新於1.28)

NO IMAGE

這次的程式碼主要來源於某天在py交流群倆看到有大佬在寫了個關於爬取知乎評論下的小姐姐美照的,原文如下:有了知乎還要什麼福利?python抓取長腿小姐姐。自己看了下也覺得挺不錯的,就參考著改動了一下,用pool程序池開了8個程序,一小會爬下來2000多個圖片,這2000多個圖片裡面大概有10%為表情包,其餘皆為照片。

下面上程式碼進行分析

def getpage(i):
headers = {
'Cache-Control': 'max-age=0',
'Host': 'www.zhihu.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': '#你的user-agent'
#
}
cookies = {#你的cookie}
base_url = 'https://www.zhihu.com/api/v4/questions/29815334/answers?include=data%5B*%5D.is_normal%2Cadmin_closed' \
'_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis' \
'_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent&offset='   str(
i)   '&limit=1&sort' \
'_by=default'
response = requests.get(base_url, headers=headers, cookies=cookies)
html = response.text
img_json = json.loads(html)
print('正在抓取知乎長腿小姐姐圖片 第%s條評論' % i)
contentpage(img_json)

這部分是請求函式,主要功能是新增headers和cookie,然後定義我們要迴圈的頁面,將待解析的json整頁的搞下來。

def contentpage(img_json):
try:
data = img_json["data"][0]
content = data["content"]
# print(content)
html = BeautifulSoup(content,'lxml')
# 提取img標籤 由於會抓到兩張一頁的圖片所以每隔一個提取一次
img_page = html.select('img')[::2]
for i in img_page:
address = i.get('src')
# print(address)
imgpage(address)
except:
print('此評論沒有圖片')

這部分是解析函式,將請求函式獲得的json資料通過json.loads進行解析,搭配Beautifulsoup庫,我們把所有的圖片按照其地址進行命名。沒有圖片的評論我們直接try過去。

def imgpage(address):
#用圖片地址字尾當圖片名
fname = address.split('/')[-1]
response = requests.get(address)
#直接返回二進位制資料
html = response.content
with open('F:/Picture/'  fname , 'wb') as f:
f.write(html)

這部分是儲存函式,包括了獲得前面的圖片的地址,選擇一個資料夾進行儲存。

最後放上全部的程式碼:

import requests
from bs4 import BeautifulSoup
import json
from multiprocessing import Pool

def getpage(i):
headers = {
'Cache-Control': 'max-age=0',
'Host': 'www.zhihu.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': '#你的user-agent'
#
}
cookies = {#你的cookie}
base_url = 'https://www.zhihu.com/api/v4/questions/29815334/answers?include=data%5B*%5D.is_normal%2Cadmin_closed' \
'_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis' \
'_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent&offset='   str(
i)   '&limit=1&sort' \
'_by=default'
response = requests.get(base_url, headers=headers, cookies=cookies)
html = response.text
img_json = json.loads(html)
print('正在抓取知乎長腿小姐姐圖片 第%s條評論' % i)
contentpage(img_json)
#傳送請求函式
def getpage(i):
headers = {
'Cache-Control': 'max-age=0',
'Host': 'www.zhihu.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)'
' Chrome/63.0.3239.132 Safari/537.36'
#
}
cookies = {'_zap': '2bd20b68-aaa8-40a2-81f0-9f505c21570e',
' d_c0': '"ADDCha-zrgyPTnXj2tBOOofbZknN1b37Ddw=|1510662476"',
' z_c0': 'Mi4xdW9iU0FnQUFBQUFBTU1LRnI3T3VEQmNBQUFCaEFsVk5DemN1V3dEUlNsLUdlSVBPNm4tMnlfWE1BemlaanN2ekpR|1514203403|1028c015c51f4a729e9c6c9fd1edd8ff03f2fdc7',
' __utmv': '51854390.100-1|2=registration_date=20160402=1^3=entry_date=20160402=1', ' q_c1': 'b6b91bb392984be2914e00662a37044f|1515823833000|1509803312000',
' _xsrf': 'dd03cf5734b1d5feb8ee57a162e12420',
' __utma': '51854390.242508447.1510662476.1516376775.1516538064.8',
' __utmz': '51854390.1516538064.8.8.utmcsr=baidu|utmccn=(organic)|utmcmd=organic',
' aliyungf_tc': 'AQAAAOCwOD4shQYAl9Bb359Zh9Kn4KjR'}
base_url = 'https://www.zhihu.com/api/v4/questions/29815334/answers?include=data%5B*%5D.is_normal%2Cadmin_closed' \
'_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis' \
'_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent&offset='   str(
i)   '&limit=1&sort' \
'_by=default'
response = requests.get(base_url, headers=headers, cookies=cookies)
html = response.text
img_json = json.loads(html)
print('正在抓取知乎長腿小姐姐圖片 第%s條評論' % i)
contentpage(img_json)
#解析json資料
def contentpage(img_json):
try:
data = img_json["data"][0]
content = data["content"]
# print(content)
html = BeautifulSoup(content,'lxml')
# 提取img標籤 由於會抓到兩張一頁的圖片所以每隔一個提取一次
img_page = html.select('img')[::2]
for i in img_page:
address = i.get('src')
# print(address)
imgpage(address)
except:
print('此評論沒有圖片')
#儲存函式
def imgpage(address):
#用圖片地址字尾當圖片名
fname = address.split('/')[-1]
response = requests.get(address)
#直接返回二進位制資料
html = response.content
with open('F:/Picture/'  fname , 'wb') as f:
f.write(html)
if __name__ == '__main__':
#getpage(headers, cookies)
pool = Pool(processes=8)
pool.map_async(getpage, range(1,1000))
# wb.save('智聯招聘android'   '.xlsx')
pool.close()
pool.join()