python爬虫入门---第二篇：获取2019年中国大学排名

发布时间：2020-12-16 23:58:25 所属栏目：Python 来源：最好大学网我们需要爬取的内容即

导读：我们需要爬取的网站：我们需要爬取的内容即为该网页中的表格部分：该部分的html关键代码为： ? 其中整个表的标签为标签，每行的标签为因此编写程序的大概思路就是先找到整个表格的标签，再遍历标签下的所有我们用二维列表来存储所有的数据，其中二维

标签，每行的标签为

因此编写程序的大概思路就是先找到整个表格的

标签，再遍历标签下的所有

我们用二维列表来存储所有的数据，其中二维列表中的每个列表用于存储一行中的每个单元格数据，即

代码如下;

bs4 def get_html_text(url):
'''返回网页的HTML代码'''
try:
res = requests.get(url,timeout = 6)
res.raise_for_status()
res.encoding = res.apparent_encoding
return res.text
except:
return ''

def fill_ulist(ulist,html):
'''将我们所需的数据写入一个列表ulist'''

<span style="color: #008000"&gt;#</span><span style="color: #008000"&gt;解析HTML代码，并获得解析后的对象soup</span>
soup = BeautifulSoup(html,<span style="color: #800000"&gt;'</span><span style="color: #800000"&gt;html.parser</span><span style="color: #800000"&gt;'</span><span style="color: #000000"&gt;)
</span><span style="color: #008000"&gt;#</span><span style="color: #008000"&gt;遍历得到第一个<tbody>标签</span>
tbody =<span style="color: #000000"&gt; soup.tbody
</span><span style="color: #008000"&gt;#</span><span style="color: #008000"&gt;遍历<tbody>标签的孩子，即<tbody>下的所有<tr&gt;标签及字符串</span>
<span style="color: #0000ff"&gt;for</span> tr <span style="color: #0000ff"&gt;in</span><span style="color: #000000"&gt; tbody.children:
    </span><span style="color: #008000"&gt;#</span><span style="color: #008000"&gt;排除字符串</span>
    <span style="color: #0000ff"&gt;if</span><span style="color: #000000"&gt; isinstance(tr,bs4.element.Tag):
        </span><span style="color: #008000"&gt;#</span><span style="color: #008000"&gt;使用find_all()函数找到tr标签中的所有<td&gt;标签</span>
        u = tr.find_all(<span style="color: #800000"&gt;'</span><span style="color: #800000"&gt;td</span><span style="color: #800000"&gt;'</span><span style="color: #000000"&gt;)
        </span><span style="color: #008000"&gt;#</span><span style="color: #008000"&gt;将<td&gt;标签中的字符串内容写入列表ulist</span>
        ulist.append([u[0].string,u[1].string,u[2].string,u[3<span style="color: #000000"&gt;].string])

def display_urank(ulist):
'''格式化输出大学排名'''
tplt = "{:^5}t{:{ocp}^12}t{:{ocp}^5}t{:^5}"
#方便中文对其显示，使用中文字宽作为站字符，chr(12288)为中文空格符
print(tplt.format("排名","大学名称","省市","总分",ocp = chr(12288)))
for u in ulist:
print(tplt.format(u[0],u[1],u[2],u[3],ocp = chr(12288)))

def write_in_file(ulist,file_path):
'''将大学排名写入文件'''
tplt = "{:^5}t{:{ocp}^12}t{:{ocp}^5}t{:^5}n"
with open(file_path,'w') as file_object:
file_object.write('软科中国最好大学排名2019版：nn')
file_object.write(tplt.format("排名",ocp = chr(12288)))
for u in ulist:
file_object.write(tplt.format(u[0],u[1],ocp = chr(12288)))

def main():
'''主函数'''
ulist = []
url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'
file_path = 'university rankings.txt'
html = get_html_text(url)
fill_ulist(ulist,html)
display_urank(ulist)
write_in_file(ulist,file_path)

main()

打印显示：

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

我们需要爬取的网站：

我们需要爬取的内容即为该网页中的表格部分：

该部分的html关键代码为：

其中整个表的标签为