抓取豆瓣2016年電影/分類_python

Description


嗯,這次簡單點
突然很想看電影,於是就抄起了python搞了一發豆瓣的電影年度清單,順便統計了評分排名和分類之類的。還算簡單吧
16年電影都在這個連結(大概)

'https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=time&page_limit=365&page_start=0'

這裡其實是可以get傳輸直接訪問豆瓣的,也能訪問這個連結,limit是顯示多少條,設一個比較大的數字就能反饋全部電影了

大概長這樣
這裡寫圖片描述
想過用beautifulsoup但是不行,老老實實re匹配去吧

趴下來之後儲存在一個dict裡面,至於按key排序就比較好玩了。我們可以先記錄一下dict的key生成list,然後對list排序,那麼遍歷這個list對應的dict值就是排好序的了
具體程式碼

d = {}
d['olahiuj'] = 'handsome'
for key in sorted(d.keys()):
print d[key]

推薦用sorted而不是sort,因為它不改變原本的列表

j接下來就是解析抓到的網址對應找類別,不說了就是re匹配。這一塊特別慢可以多執行緒,但是注意訪問避免過頻繁儘量像真人一點(笑
r然後呢我們還是用dict來儲存類別和對應的計數,輸出到一個csv裡面儲存
0python是自帶csv模組的引用就好了

import csv

0之所以選擇csv而不是其他主要是因為csv能用excel編輯瀏覽
0寫操作我們這麼做

with open('filename.csv', 'wb') as csvfile:
blah = csv.writer(csvfile, dialect = 'excel')
blah.writerow([1, 2, 3])

w為了保證list中的每一個專案都能處在單獨的列裡,設定dialect為’excel’,還有就是輸出一定要是list(大概?

b本來還想著要視覺化一下資料建個圖什麼的,明天再弄吧。話說同性分類有11部電影是什麼鬼,排名第一是又是什麼鬼

Code


# -*- coding: utf-8 -*-
from requests.packages.urllib3.exceptions import InsecureRequestWarning
import threading
import requests
import time
import csv
import os
import re
def getPage(html, url, headers, params = {}, referer = ''):
flags = True
if (url[:5] == 'https'):
flags = False
headers['Referer'] = referer
response = html.get(url, headers = headers, params = params, verify = flags)
page = response.content
return page
def find(string, page, flags = 0):
pattern = re.compile(string, flags = flags)
results = re.findall(pattern, page)
return results
def work(html, url, headers, cnt):
tmp = ''
for q in url:
if q != '\\':
tmp = tmp   q
url = tmp
page = getPage(html, url, headers)
types = find(r'<span property="v:genre">(. ?)</span>', page)
global mutex, rec
mutex.acquire()
print cnt
for item in types:
if rec.has_key(item):
rec[item]  = 1
else:
rec[item] = 1
mutex.release()
def init():
html = requests.session()
doubanUrl = 'https://movie.douban.com'
headers={'User-Agent':'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)'}
page = getPage(html, 'https://movie.douban.com/j/search_subjects', headers, params = {'type': 'movie', 'tag': '熱門', 'sort': 'time', 'page_limit': '400', 'page_start': '0'})
results = find(r'"rate":"(. ?)",. ?"title":"(. ?)","url":"(. ?)"', page)
urls = [item[2] for item in results]
rates = [item[0] for item in results]
titles = [item[1] for item in results]
for i in xrange(len(urls)):
for j in xrange(i   1, len(urls)):
if (rates[i] < rates[j]):
rates[i], rates[j] = rates[j], rates[i]
urls[i], urls[j] = urls[j], urls[i]
titles[i], titles[j] = titles[j], titles[i]
with open('douban.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, dialect = 'excel')
for i in xrange(len(rates)):
spamwriter.writerow([titles[i], urls[i], rates[i]])
global mutex, rec
mutex = threading.Lock()
rec = {}
jobs = []
cnt = 0
for i in xrange(len(urls)):
cnt  = 1
job = threading.Thread(target = work, args = (html, urls[i], headers, cnt))
job.start()
jobs.append(job)
for job in jobs:
job.join()
with open('douban_type.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, dialect = 'excel')
for key in sorted(rec.keys(), reverse = True):
spamwriter.writerow([key, rec[key]])
if __name__ == '__main__':
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
init()