记得高中语文课堂上有一次“进哥”发言讲到最差的广告词,“进哥”表示:“三精牌葡萄糖酸钙口服液,一定要买蓝瓶的呦!没有逻辑,广告词平平无奇特别差!”其实恰恰这种白痴的广告词重复听了多次之后,已经在潜移默化中被洗脑。想要买这个葡糖糖酸钙口服液,脑袋中立刻蹦出个蓝屏的。活的明白不容易,不知不觉中有多少次被洗脑?我就中过招。今天偶然发现知乎热榜有一话题:“有哪些销量极高的烂书?烂在哪里?”回答竟然达到3264个。看来有很多“书呆子”逐渐觉醒不呆了。
佛经讲只要身心受五蕴感染,得到终究是空。俗人终究是俗人。今个做个统计,给各位上仙道友大侠留个印象,如遇到本文提到的书名,要小心了!首先,下载目标页面中所有的回答,这里需要模拟浏览器行为,因为知乎回答是动态展示的。
# -*-coding:utf-8 -*-
from urlparse import urlsplit
from os.path import basename
import urllib2
import re
import requests
import os
import json
import urllib
def mkdir(path):
path = path.strip()
isExists = os.path.exists(path)
if not isExists:
print u'新建了名字叫做', path, u'的文件夹'
os.makedirs(path)
return True
else:
print u'名为', path, u'的文件夹已经创建成功'
return False
def downloadText(id, path):
offset = 0
outfile = open("out.txt", "w")
while offset < 3300:
get_url = 'https://www.zhihu.com/api/v4/questions/'+id + \
'/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&limit=5&offset='+str(
offset)+'&sort_by=default'
header = {
'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0",
'Host': "www.zhihu.com",
}
req = urllib2.Request(get_url, headers=header)
response = urllib2.urlopen(req).read()
txt = json.loads(response)
if txt.get("paging").get("is_end"):
print "爬取完毕!"
break
offset += 1
content = txt.values()[1][0]['content']
content = str(content.encode('utf-8'))
books = re.findall("《.*?》", content)
if len(books) > 0:
outfile.write("\n".join(books) + "\n")
else:
continue
outfile.close()
if __name__ == '__main__':
path = u'/home/dustin/temp/zhihu'
mkdir(path)
downloadText("25434255", path)
下载好当前页面数据之后,有些回答中书名号带有链接,只需要删除html码,一般是两个尖括号括起来。vim用:%s/<..\{-\}>//g
即可全部删除。然后使用已有做wordcloud脚本做图。
#++++++++++++++++++++++++++++++++++
# rquery.wordcloud() : Word cloud generator
# - http://www.sthda.com
#+++++++++++++++++++++++++++++++++++
# x : character string (plain text, web url, txt file path)
# type : specify whether x is a plain text, a web page url or a file path
# lang : the language of the text
# excludeWords : a vector of words to exclude from the text
# textStemming : reduces words to their root form
# colorPalette : the name of color palette taken from RColorBrewer package,
# or a color name, or a color code
# min.freq : words with frequency below min.freq will not be plotted
# max.words : Maximum number of words to be plotted. least frequent terms dropped
# value returned by the function : a list(tdm, freqTable)
rquery.wordcloud <- function(x, type=c("text", "url", "file"),
lang="english", excludeWords=NULL,
textStemming=FALSE, colorPalette="Dark2",
min.freq=3, max.words=200)
{
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
if(type[1]=="file") text <- readLines(x)
else if(type[1]=="url") text <- html_to_text(x)
else if(type[1]=="text") text <- x
# Load the text as a corpus
docs <- Corpus(VectorSource(text))
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove stopwords for the language
docs <- tm_map(docs, removeWords, stopwords(lang))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Remove your own stopwords
if(!is.null(excludeWords))
docs <- tm_map(docs, removeWords, excludeWords)
# Text stemming
if(textStemming) docs <- tm_map(docs, stemDocument)
# Create term-document matrix
tdm <- TermDocumentMatrix(docs)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# check the color palette name
if(!colorPalette %in% rownames(brewer.pal.info)) colors = colorPalette
else colors = brewer.pal(8, colorPalette)
# Plot the word cloud
set.seed(1234)
wordcloud(d$word,d$freq, min.freq=min.freq, max.words=max.words,
random.order=FALSE, rot.per=0.35,
use.r.layout=FALSE, colors=colors)
invisible(list(tdm=tdm, freqTable = d))
}
#++++++++++++++++++++++
# Helper function
#++++++++++++++++++++++
# Download and parse webpage
html_to_text<-function(url){
library(RCurl)
library(XML)
# download html
html.doc <- getURL(url)
#convert to plain text
doc = htmlParse(html.doc, asText=TRUE)
# "//text()" returns all text outside of HTML tags.
# We also don’t want text such as style and script codes
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
# Format text vector into one character string
return(paste(text, collapse = " "))
}
做图命令就一行就可以
res<-rquery.wordcloud("./out.txt", type ="file", lang = "english")
得到烂书云图
果然!什么破《摆渡人》、《天才在左,疯子在右》等等烂到极致。我很幸运地躲过了。但《乌合之众》、《追风筝的人》、《人性的弱点》没躲过。我个人以为这三本书只能说比较浅显,还算不上烂书。图中还有有少数人提到《万历十五年》、《红楼梦》之类,应该是做对比用的。