烂书排行榜

记得高中语文课堂上有一次“进哥”发言讲到最差的广告词,“进哥”表示:“三精牌葡萄糖酸钙口服液,一定要买蓝瓶的呦!没有逻辑,广告词平平无奇特别差!”其实恰恰这种白痴的广告词重复听了多次之后,已经在潜移默化中被洗脑。想要买这个葡糖糖酸钙口服液,脑袋中立刻蹦出个蓝屏的。活的明白不容易,不知不觉中有多少次被洗脑?我就中过招。今天偶然发现知乎热榜有一话题:“有哪些销量极高的烂书?烂在哪里?”回答竟然达到3264个。看来有很多“书呆子”逐渐觉醒不呆了。

佛经讲只要身心受五蕴感染,得到终究是空。俗人终究是俗人。今个做个统计,给各位上仙道友大侠留个印象,如遇到本文提到的书名,要小心了!首先,下载目标页面中所有的回答,这里需要模拟浏览器行为,因为知乎回答是动态展示的。

# -*-coding:utf-8 -*-
from urlparse import urlsplit
from os.path import basename
import urllib2
import re
import requests
import os
import json
import urllib


def mkdir(path):
    path = path.strip()
    isExists = os.path.exists(path)
    if not isExists:
        print u'新建了名字叫做', path, u'的文件夹'
        os.makedirs(path)
        return True
    else:
        print u'名为', path, u'的文件夹已经创建成功'
        return False


def downloadText(id, path):
    offset = 0
    outfile = open("out.txt", "w")
    while offset < 3300:

        get_url = 'https://www.zhihu.com/api/v4/questions/'+id + \
            '/answers?include=data[*].is_normal,admin_closed_comment,reward_info,is_collapsed,annotation_action,annotation_detail,collapse_reason,is_sticky,collapsed_by,suggest_edit,comment_count,can_comment,content,editable_content,voteup_count,reshipment_settings,comment_permission,created_time,updated_time,review_info,relevant_info,question,excerpt,relationship.is_authorized,is_author,voting,is_thanked,is_nothelp;data[*].mark_infos[*].url;data[*].author.follower_count,badge[*].topics&limit=5&offset='+str(
                offset)+'&sort_by=default'
        header = {
            'User-Agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0",
            'Host': "www.zhihu.com",
        }

        req = urllib2.Request(get_url, headers=header)
        response = urllib2.urlopen(req).read()
        txt = json.loads(response)
        if txt.get("paging").get("is_end"):
            print "爬取完毕!"
            break
        offset += 1
        content = txt.values()[1][0]['content']
        content = str(content.encode('utf-8'))
        books = re.findall("《.*?》", content)
        if len(books) > 0:
            outfile.write("\n".join(books) + "\n")
        else:
            continue

    outfile.close()


if __name__ == '__main__':
    path = u'/home/dustin/temp/zhihu'
    mkdir(path)
    downloadText("25434255", path)

下载好当前页面数据之后,有些回答中书名号带有链接,只需要删除html码,一般是两个尖括号括起来。vim用:%s/<..\{-\}>//g即可全部删除。然后使用已有做wordcloud脚本做图。

#++++++++++++++++++++++++++++++++++
# rquery.wordcloud() : Word cloud generator
# - http://www.sthda.com
#+++++++++++++++++++++++++++++++++++
# x : character string (plain text, web url, txt file path)
# type : specify whether x is a plain text, a web page url or a file path
# lang : the language of the text
# excludeWords : a vector of words to exclude from the text
# textStemming : reduces words to their root form
# colorPalette : the name of color palette taken from RColorBrewer package, 
  # or a color name, or a color code
# min.freq : words with frequency below min.freq will not be plotted
# max.words : Maximum number of words to be plotted. least frequent terms dropped

# value returned by the function : a list(tdm, freqTable)
rquery.wordcloud <- function(x, type=c("text", "url", "file"), 
                          lang="english", excludeWords=NULL, 
                          textStemming=FALSE,  colorPalette="Dark2",
                          min.freq=3, max.words=200)
{ 
  library("tm")
  library("SnowballC")
  library("wordcloud")
  library("RColorBrewer") 
  
  if(type[1]=="file") text <- readLines(x)
  else if(type[1]=="url") text <- html_to_text(x)
  else if(type[1]=="text") text <- x
  
  # Load the text as a corpus
  docs <- Corpus(VectorSource(text))
  # Convert the text to lower case
  docs <- tm_map(docs, content_transformer(tolower))
  # Remove numbers
  docs <- tm_map(docs, removeNumbers)
  # Remove stopwords for the language 
  docs <- tm_map(docs, removeWords, stopwords(lang))
  # Remove punctuations
  docs <- tm_map(docs, removePunctuation)
  # Eliminate extra white spaces
  docs <- tm_map(docs, stripWhitespace)
  # Remove your own stopwords
  if(!is.null(excludeWords)) 
    docs <- tm_map(docs, removeWords, excludeWords) 
  # Text stemming
  if(textStemming) docs <- tm_map(docs, stemDocument)
  # Create term-document matrix
  tdm <- TermDocumentMatrix(docs)
  m <- as.matrix(tdm)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)
  # check the color palette name 
  if(!colorPalette %in% rownames(brewer.pal.info)) colors = colorPalette
  else colors = brewer.pal(8, colorPalette) 
  # Plot the word cloud
  set.seed(1234)
  wordcloud(d$word,d$freq, min.freq=min.freq, max.words=max.words,
            random.order=FALSE, rot.per=0.35, 
            use.r.layout=FALSE, colors=colors)
  
  invisible(list(tdm=tdm, freqTable = d))
}

#++++++++++++++++++++++
# Helper function
#++++++++++++++++++++++
# Download and parse webpage
html_to_text<-function(url){
  library(RCurl)
  library(XML)
  # download html
  html.doc <- getURL(url)  
  #convert to plain text
  doc = htmlParse(html.doc, asText=TRUE)
 # "//text()" returns all text outside of HTML tags.
 # We also don’t want text such as style and script codes
  text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
  # Format text vector into one character string
  return(paste(text, collapse = " "))
}

做图命令就一行就可以

res<-rquery.wordcloud("./out.txt", type ="file", lang = "english")

得到烂书云图

wordcloud

果然!什么破《摆渡人》、《天才在左,疯子在右》等等烂到极致。我很幸运地躲过了。但《乌合之众》、《追风筝的人》、《人性的弱点》没躲过。我个人以为这三本书只能说比较浅显,还算不上烂书。图中还有有少数人提到《万历十五年》、《红楼梦》之类,应该是做对比用的。

石见 石页 /
在共享协议(CC)下发布于
类别: technology 
标签: 烂书  知乎  爬虫  中