Python爬虫练习

小程序：扫一扫查出行
【扫一扫了解最新限行尾号】
复制小程序

例一：爬取信息关于'gbk' codec can't encode character '\xa0' in position 6: illegal 错误提示：

 from DrawStu.DrawStu import DrawStu; #初始化class 得到对象

 draw=DrawStu();

 if __name__ == '__main__':

     print('爬取研究生调剂信息');

     size=draw.get_page_size();

     print(size)

     for x in range(size):

         start=x*50;

         print(start);

         print('https://yz.chsi.com.cn/kyzx/tjxx/?start='+str(start));

         pass

 import urllib.request;

 from bs4 import BeautifulSoup;

 """爬取核心的核心模块，功能只负责爬取研究生调剂信息"""

 class DrawStu():

     """docstring for DrawStu"""

     def __init__(self):

         self.baseurl='https://yz.chsi.com.cn/kyzx/tjxx/';

         pass;

     #爬取基本列表

     def draw_base_list(self,url):

         print('url is:::',url);

         pass

     #爬取页面的总页数

     def get_page_size(self):

         requesturl=self.baseurl;

         response=urllib.request.urlopen(requesturl);

         html=response.read();#read进行乱码处理

         print(html);

         doc=BeautifulSoup(html);

         pcxt=doc.find('div',{'class':'pageC'}).findAll('span')[0].text;

         print(pcxt);

         #re正则表达式 字符串截取api

         pagesize=pcxt.strip();

         pagearr=pagesize.split('/');

         pagestr=pagearr[1];

         return int(pagestr[0:2]);

         pass

运行时出现如下的错误提示：

找了很多方法仍然无果最后同学提供了一段解决编码格式问题的万能代码段，分享给大家。

import io

import sys

sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

将这段代码加入即可解决。

正确运行结果：

例二：在安装pip instal XX是出现如下错误:

pip._vendor.urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host=‘files.pythonhosted.org’, port=443): Read timed out.

解决办法:

设置超时时间,
pip --default-timeout=100 install -U Pillow(对应的是软件包模块的名称)

找了很久的解决办法才找到。原网址：https://blog.csdn.net/m0_43432638/article/details/84400474

例三：在对csv进行写入操作时出现错误，TypeError: sequence item 0: expected str instance, int found

 number_lst=[1,2,3,4]

 numbei_lst=[str(x) for x in number_lst]

 with open('price2017.csv','a',encoding='utf8')as f:

     f.write(','.join('%s' %id for id in number_lst))#遍历list的元素，把他转化成字符串。

 f.close()

解决办法：print(" ".join('%s' %id for id in number_lst))

原网址：https://blog.csdn.net/laochu250/article/details/67649210