写到excel中就乱码了,所以实行了进一步的调换钻探

    继上一篇【Python数据解析】Python3操作Excel-以豆瓣图书Top250为例 对豆瓣图书Top250进展爬取现在,鉴于还有一部分难点远非缓解,所以举行了尤其的调换切磋,那中间获得了二只尼玛的佑助与启示,10分谢谢!

【Python数据解析】Python3操作Excel(二) 一些题材的化解与优化,pythonpython3

   
继上一篇【Python数据解析】Python3操作Excel-以豆瓣图书Top250为例 对豆瓣图书Top250展开爬取以往,鉴于还有一些标题绝非化解,所以实行了特其他沟通座谈,那之间取得了三头尼玛的增派与启示,十三分多谢!

    上次留存的难题如下:

    1.写入无法继承的标题

    2.在Python IDLE中鲜明输出正确的结果,写到excel中就乱码了。

    上述多个难点促使自身改换excel处理模块,因为听大人说xlwt只援救到Excel
二〇〇四,很有只怕会出标题。

   
固然“3头尼玛”给了3个Validate函数,然则那是本着去除Windows下文件名中国和亚洲法字符的函数,跟写入excel乱码没有提到,所以还是考虑更换模块。

   
上次设有的难点如下:

更换xlsxwriter模块

   
这一次小编改成xlsxwriter这些模块,https://pypi.python.org/pypi/XlsxWriter.
同样能够pip3 install xlsxwriter,自动下载安装,简便易行。一些用法样例:

import xlsxwriter

# Create an new Excel file and add a worksheet.
workbook = xlsxwriter.Workbook('demo.xlsx')
worksheet = workbook.add_worksheet()

# Widen the first column to make the text clearer.
worksheet.set_column('A:A', 20)

# Add a bold format to use to highlight cells.
bold = workbook.add_format({'bold': True})

# Write some simple text.
worksheet.write('A1', 'Hello')

# Text with formatting.
worksheet.write('A2', 'World', bold)

# Write some numbers, with row/column notation.
worksheet.write(2, 0, 123)
worksheet.write(3, 0, 123.456)

# Insert an image.
worksheet.insert_image('B5', 'logo.png')

workbook.close()

坚决更换写入excel的代码。效果如下:

澳门永利娱乐总站 1

果不其然鼻子是鼻子脸是脸,该是链接正是链接,不管怎么着字符都能写,终归unicode。

所以说,选对模块很重点选对模块很主要选对模块很要紧!(重说三)

若是要爬的内容不是很公正规范的字符串或数字来说,小编是不会用xlwt啦。

此地有4中Python写入excel的模块比较:http://ju.outofmemory.cn/entry/56671

自个儿截了二个对照图如下,具体能够看上边那篇小说,格外详细!

澳门永利娱乐总站 2

    1.写入不可能继续的难题

追溯

那么些既然如此顺畅,还足以写入图片,那我们何不尝试看吗?

对象:把图片链接那一列的剧情换来真的的图样!

实则一点也不细略,因为大家在此之前已经有了图片的存款和储蓄路径,把它插入到里头就可以了。

    the_img = "I:\\douban\\image\\"+bookName+".jpg"
    writelist=[i+j,bookName,nickname,rating,nums,the_img,bookurl,notion,tag]
    for k in range(0,9):
        if k == 5:
            worksheet.insert_image(i+j,k,the_img)
        else:
            worksheet.write(i+j,k,writelist[k])

出去是那般的效力,显明不雅观,那大家应有适当调整一些每行的莫大,以及让他俩居中间试验试看:

澳门永利娱乐总站 3

查阅xlsxwriter文书档案可见,能够那样设置行列宽度和居中:(当然,这么些操作在excel中能够直接做,而且只怕会比写代码更快,可是本身倒是想越来越多试试这一个模块)

format = workbookx.add_format()
format.set_align('justify')
format.set_align('center')
format.set_align('vjustify')
format.set_align('vcenter')
format.set_text_wrap()

worksheet.set_row(0,12,format)
for i in range(1,251):
    worksheet.set_row(i,70)
worksheet.set_column('A:A',3,format)
worksheet.set_column('B:C',17,format)
worksheet.set_column('D:D',4,format)
worksheet.set_column('E:E',7,format)
worksheet.set_column('F:F',10,format)
worksheet.set_column('G:G',19,format)
worksheet.set_column('H:I',40,format)

至此达成了excel的写入,只可是设置格式那块实在繁杂,得不断调节和测试距离,大小,所以在excel里面做会不难些。

末尾代码:

澳门永利娱乐总站 4# -*-
coding:utf-8 -*- import requests import re import xlwt import
xlsxwriter from bs4 import BeautifulSoup from datetime import datetime
import codecs now = datetime.now() #起首计时 print(now) def
validate(title): #from nima rstr =
r”[\/\\\:\*\?\”\<\>\|]” # ‘/\:*?”<>|-‘
new_title = re.sub(rstr, “”, title) return new_title txtfile =
codecs.open(“top2501.txt”,’w’,’utf-8′) url =
http://book.douban.com/top250?” header = { “User-Agent”: “Mozilla/5.0
(Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/49.0.2623.13 Safari/537.36”, “Referer”: “http://book.douban.com/
} image_dir = “I:\\douban\\image\\” #下载图片 def
download_img(imageurl,imageName = “xxx.jpg”): rsp =
requests.get(imageurl, stream=True) image = rsp.content path =
image_dir + imageName +’.jpg’ #print(path) with open(path,’wb’) as
file: file.write(image) #建立Excel workbookx =
xlsxwriter.Workbook(‘I:\\douban\\btop250.xlsx’) worksheet =
workbookx.add_worksheet() format = workbookx.add_format()
format.set_align(‘justify’) format.set_align(‘center’)
format.set_align(‘vjustify’) format.set_align(‘vcenter’)
format.set_澳门永利娱乐总站,text_wrap() worksheet.set_row(0,12,format) for i in
range(1,251): worksheet.set_row(i,70)
worksheet.set_column(‘A:A’,3,format)
worksheet.set_column(‘B:C’,17,format)
worksheet.set_column(‘D:D’,4,format)
worksheet.set_column(‘E:E’,7,format)
worksheet.set_column(‘F:F’,10,format)
worksheet.set_column(‘G:G’,19,format)
worksheet.set_column(‘H:I’,40,format) item =
[‘书名’,’别称’,’评分’,’评价人数’,’封面’,’图书链接’,’出版消息’,’标签’]
for i in range(1,9): worksheet.write(0,i,item[i-1]) s =
requests.Session() #创立会话 s.get(url,headers=header) for i in
range(0,250,25): geturl = url + “/start=” + str(i) #要取得的页面地址
print(“Now to get ” + geturl) postData = {“start”:i} #post数据 res =
s.post(url,data = postData,headers = header) #post soup =
BeautifulSoup(res.content.decode(),”html.parser”) #BeautifulSoup解析
table = soup.findAll(‘table’,{“width”:”100%”}) #找到全数图书新闻的table
sz = len(table) #sz = 25,每页列出25篇小说 for j in range(1,sz+1): #j =
1~25 sp = BeautifulSoup(str(table[j-1]),”html.parser”)
#解析每本图书的新闻 imageurl = sp.img[‘src’] #找图片链接 bookurl =
sp.a[‘href’] #找图书链接 bookName = sp.div.a[‘title’] nickname =
sp.div.span #找别名 if(nickname): #倘使有别名则存款和储蓄别名不然存’无‘
nickname = nickname.string.strip() else: nickname = “” notion =
str(sp.find(‘p’,{“class”:”pl”}).string)
#抓取出版消息,注意里面包车型大巴.string还不是真的str类型 rating =
str(sp.find(‘span’,{“class”:”rating_nums”}).string) #抓取平分数据 nums
= sp.find(‘span’,{“class”:”pl”}).string #抓取评分人数 nums =
nums.replace(‘(‘,”).replace(‘)’,”).replace(‘\n’,”).strip() nums =
re.findall(‘(\d+)人评价’,nums)[0] download_img(imageurl,bookName)
#下载图片 book = requests.get(bookurl) #开辟该图书的网页 sp3 =
BeautifulSoup(book.content,”html.parser”) #解析 taglist =
sp3.find_all(‘a’,{“class”:” tag”}) #找标签音讯 tag = “” lis = [] for
tagurl in taglist: sp4 = BeautifulSoup(str(tagurl),”html.parser”)
#解析各类标签 lis.append(str(sp4.a.string)) tag = ‘,’.join(lis)
#加逗号 the_img = “I:\\douban\\image\\”+bookName+”.jpg”
writelist=[i+j,bookName,nickname,rating,nums,the_img,bookurl,notion,tag]
for k in range(0,9): if k == 5: worksheet.insert_image(i+j,k,the_img)
else: worksheet.write(i+j,k,writelist[k])
txtfile.write(str(writelist[k])) txtfile.write(‘\t’)
txtfile.write(u’\r\n’) end = datetime.now() #终结计时 print(end)
print(“程序耗费时间: ” + str(end-now)) txtfile.close() workbookx.close()
View Code

运营结果如下:

2016-03-28 11:40:50.525635
Now to get http://book.douban.com/top250?/start=0
Now to get http://book.douban.com/top250?/start=25
Now to get http://book.douban.com/top250?/start=50
Now to get http://book.douban.com/top250?/start=75
Now to get http://book.douban.com/top250?/start=100
Now to get http://book.douban.com/top250?/start=125
Now to get http://book.douban.com/top250?/start=150
Now to get http://book.douban.com/top250?/start=175
Now to get http://book.douban.com/top250?/start=200
Now to get http://book.douban.com/top250?/start=225
2016-03-28 11:48:14.946184
程序耗时: 0:07:24.420549

澳门永利娱乐总站 5

得手爬完250本书。此次爬取行动就不错来说已告完结!

这次耗费时间柒分24秒,照旧显得太慢了。下一步就活该是何等在提升效用上面下武功了。

http://www.bkjia.com/Pythonjc/1114491.htmlwww.bkjia.comtruehttp://www.bkjia.com/Pythonjc/1114491.htmlTechArticle【Python数据分析】Python3操作Excel(二)
一些题指标解决与优化,pythonpython3
继上一篇【Python数据解析】Python3操作Excel-以豆瓣图书Top250为例对豆…

 
  2.在Python IDLE中肯定输出正确的结果,写到excel中就乱码了。

   
上述多个难题促使自个儿改换excel处理模块,因为听说xlwt只帮忙到Excel
二零零二,很有恐怕会出标题。

   
尽管“二只尼玛”给了二个Validate函数,可是那是指向去除Windows下文件名中国和澳洲法字符的函数,跟写入excel乱码没有关联,所以照旧考虑更换模块。

更换xlsxwriter模块

    这一次本人改成xlsxwriter这么些模块,https://pypi.python.org/pypi/XlsxWriter.
同样能够pip3 install
xlsxwriter,自动下载安装,简便易行。一些用法样例:

import xlsxwriter

# Create an new Excel file and add a worksheet.
workbook = xlsxwriter.Workbook('demo.xlsx')
worksheet = workbook.add_worksheet()

# Widen the first column to make the text clearer.
worksheet.set_column('A:A', 20)

# Add a bold format to use to highlight cells.
bold = workbook.add_format({'bold': True})

# Write some simple text.
worksheet.write('A1', 'Hello')

# Text with formatting.
worksheet.write('A2', 'World', bold)

# Write some numbers, with row/column notation.
worksheet.write(2, 0, 123)
worksheet.write(3, 0, 123.456)

# Insert an image.
worksheet.insert_image('B5', 'logo.png')

workbook.close()

二话不说更换写入excel的代码。效果如下:

澳门永利娱乐总站 6

果真鼻子是鼻子脸是脸,该是链接正是链接,不管什么字符都能写,究竟unicode。

所以说,选对模块很关键选对模块很要紧选对模块很重庆大学!(重说三)

假诺要爬的剧情不是很公正规范的字符串或数字来说,小编是不会用xlwt啦。

此处有4中Python写入excel的模块比较:http://ju.outofmemory.cn/entry/56671

作者截了八个相对而言图如下,具体能够看上边那篇小说,格外详细!

澳门永利娱乐总站 7

百川归海

这几个既然如此顺畅,还是能写入图片,那我们何不尝试看吗?

对象:把图片链接那一列的始末换到真的的图样!

实际很不难,因为我们事先曾经有了图片的蕴藏路径,把它插入到里头就能够了。

    the_img = "I:\\douban\\image\\"+bookName+".jpg"
    writelist=[i+j,bookName,nickname,rating,nums,the_img,bookurl,notion,tag]
    for k in range(0,9):
        if k == 5:
            worksheet.insert_image(i+j,k,the_img)
        else:
            worksheet.write(i+j,k,writelist[k])

出去是这么的效果,明显不雅观,那大家应当适度调整一些每行的莫斯中国科学技术大学学,以及让她们居中间试验试看:

澳门永利娱乐总站 8

查阅xlsxwriter文书档案可见,能够如此设置行列宽度和居中:(当然,那一个操作在excel中得以一直做,而且或者会比写代码更快,不过小编倒是想越来越多试试那些模块)

format = workbookx.add_format()
format.set_align('justify')
format.set_align('center')
format.set_align('vjustify')
format.set_align('vcenter')
format.set_text_wrap()

worksheet.set_row(0,12,format)
for i in range(1,251):
    worksheet.set_row(i,70)
worksheet.set_column('A:A',3,format)
worksheet.set_column('B:C',17,format)
worksheet.set_column('D:D',4,format)
worksheet.set_column('E:E',7,format)
worksheet.set_column('F:F',10,format)
worksheet.set_column('G:G',19,format)
worksheet.set_column('H:I',40,format)

至此完成了excel的写入,只但是设置格式那块实在繁杂,得频频调节和测试距离,大小,所以在excel里面做会简单些。

末尾代码:

澳门永利娱乐总站 9澳门永利娱乐总站 10

# -*- coding:utf-8 -*-
import requests
import re
import xlwt
import xlsxwriter
from bs4 import BeautifulSoup
from datetime import datetime
import codecs

now = datetime.now()             #开始计时
print(now)

def validate(title):                        #from nima
    rstr = r"[\/\\\:\*\?\"\<\>\|]"          # '/\:*?"<>|-'
    new_title = re.sub(rstr, "", title)
    return new_title

txtfile = codecs.open("top2501.txt",'w','utf-8')
url = "http://book.douban.com/top250?"

header = { "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.13 Safari/537.36",
           "Referer": "http://book.douban.com/"
           }

image_dir = "I:\\douban\\image\\"
#下载图片
def download_img(imageurl,imageName = "xxx.jpg"):
    rsp = requests.get(imageurl, stream=True)
    image = rsp.content
    path = image_dir + imageName +'.jpg'
    #print(path)
    with open(path,'wb') as file:
        file.write(image)

#建立Excel
workbookx = xlsxwriter.Workbook('I:\\douban\\btop250.xlsx')
worksheet = workbookx.add_worksheet()
format = workbookx.add_format()
format.set_align('justify')
format.set_align('center')
format.set_align('vjustify')
format.set_align('vcenter')
format.set_text_wrap()

worksheet.set_row(0,12,format)
for i in range(1,251):
    worksheet.set_row(i,70)
worksheet.set_column('A:A',3,format)
worksheet.set_column('B:C',17,format)
worksheet.set_column('D:D',4,format)
worksheet.set_column('E:E',7,format)
worksheet.set_column('F:F',10,format)
worksheet.set_column('G:G',19,format)
worksheet.set_column('H:I',40,format)

item = ['书名','别称','评分','评价人数','封面','图书链接','出版信息','标签']
for i in range(1,9):
    worksheet.write(0,i,item[i-1])

s = requests.Session()      #建立会话
s.get(url,headers=header)

for i in range(0,250,25):  
    geturl = url + "/start=" + str(i)                     #要获取的页面地址
    print("Now to get " + geturl)
    postData = {"start":i}                                #post数据
    res = s.post(url,data = postData,headers = header)    #post
    soup = BeautifulSoup(res.content.decode(),"html.parser")       #BeautifulSoup解析
    table = soup.findAll('table',{"width":"100%"})        #找到所有图书信息的table
    sz = len(table)                                       #sz = 25,每页列出25篇文章
    for j in range(1,sz+1):                               #j = 1~25
        sp = BeautifulSoup(str(table[j-1]),"html.parser") #解析每本图书的信息

        imageurl = sp.img['src']                          #找图片链接
        bookurl = sp.a['href']                            #找图书链接
        bookName = sp.div.a['title']
        nickname = sp.div.span                            #找别名
        if(nickname):                                     #如果有别名则存储别名否则存’无‘
            nickname = nickname.string.strip()
        else:
            nickname = ""

        notion = str(sp.find('p',{"class":"pl"}).string)   #抓取出版信息,注意里面的.string还不是真的str类型
        rating = str(sp.find('span',{"class":"rating_nums"}).string)    #抓取平分数据
        nums = sp.find('span',{"class":"pl"}).string                    #抓取评分人数
        nums = nums.replace('(','').replace(')','').replace('\n','').strip()
        nums = re.findall('(\d+)人评价',nums)[0]
        download_img(imageurl,bookName)                     #下载图片
        book = requests.get(bookurl)                        #打开该图书的网页
        sp3 = BeautifulSoup(book.content,"html.parser")     #解析
        taglist = sp3.find_all('a',{"class":"  tag"})       #找标签信息
        tag = ""
        lis = []
        for tagurl in taglist:
            sp4 = BeautifulSoup(str(tagurl),"html.parser")  #解析每个标签
            lis.append(str(sp4.a.string))

        tag = ','.join(lis)        #加逗号
        the_img = "I:\\douban\\image\\"+bookName+".jpg"
        writelist=[i+j,bookName,nickname,rating,nums,the_img,bookurl,notion,tag]
        for k in range(0,9):
            if k == 5:
                worksheet.insert_image(i+j,k,the_img)
            else:
                worksheet.write(i+j,k,writelist[k])
            txtfile.write(str(writelist[k]))
            txtfile.write('\t')
        txtfile.write(u'\r\n')

end = datetime.now()    #结束计时
print(end)
print("程序耗时: " + str(end-now))
txtfile.close()
workbookx.close()

View Code

运维结果如下:

2016-03-28 11:40:50.525635
Now to get http://book.douban.com/top250?/start=0
Now to get http://book.douban.com/top250?/start=25
Now to get http://book.douban.com/top250?/start=50
Now to get http://book.douban.com/top250?/start=75
Now to get http://book.douban.com/top250?/start=100
Now to get http://book.douban.com/top250?/start=125
Now to get http://book.douban.com/top250?/start=150
Now to get http://book.douban.com/top250?/start=175
Now to get http://book.douban.com/top250?/start=200
Now to get http://book.douban.com/top250?/start=225
2016-03-28 11:48:14.946184
程序耗时: 0:07:24.420549

澳门永利娱乐总站 11

胜利爬完250本书。此次爬取行动就正确来说已告完毕!

此次耗时7分24秒,依然显得太慢了。下一步就相应是怎样在升高功用上面下武术了。

相关文章