加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 编程开发 > Python > 正文

Wikipedia抓取-需要帮助来构建它

发布时间:2020-12-17 17:39:49 所属栏目:Python 来源:网络整理
导读:我正在尝试抓取this Wikipedia page. 我遇到一些问题,非常感谢您的协助: Some rows have more than one name or link and I want them all to be assigned to the correct country. Is there anyway I can do that? I want to skip the Name(native) column

我正在尝试抓取this Wikipedia page.

我遇到一些问题,非常感谢您的协助:

  1. Some rows have more than one name or link and I want them all to be assigned to the correct country. Is there anyway I can do that?

  2. I want to skip the ‘Name(native)’ column. How can I do that?

  3. If I’m scraping the ‘Name(native)’ column. I get some gibberish,is there anyway to encode that?

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_government_gazettes'
source = requests.get(url).text

soup = BeautifulSoup(source,'lxml')
table = soup.find('table',class_='wikitable').tbody

rows = table.findAll('tr')

columns = [col.text.encode('utf').replace('xc2xa0','').replace('n','') for col in rows[1].find_all('td')]
print(columns)
最佳答案
您可以使用pandas函数read_html并从DataFrames列表中获取第二个DataFrame:

url = 'https://en.wikipedia.org/wiki/List_of_government_gazettes'
df = pd.read_html(url)[1].head()
print (df)
       Country/region                                              Name  
0              Albania       Official Gazette of the Republic of Albania   
1              Algeria                                  Official Gazette   
2              Andorra  Official Bulletin of the Principality of Andorra   
3  Antigua and Barbuda              Antigua and Barbuda Official Gazette   
4            Argentina     Official Gazette of the Republic of Argentina   

                                 Name (native)                    Website  
0  Fletorja Zyrtare E Republik?s S? Shqip?ris?                 qbz.gov.al  
1                   Journal Officiel d'Algérie              joradp.dz/HAR  
2     Butlletí Oficial del Principat d'Andorra                www.bopa.ad  
3         Antigua and Barbuda Official Gazette    www.legalaffairs.gov.ag  
4    Boletín Oficial de la República Argentina  www.boletinoficial.gob.ar 

如果检查输出,则第26行有问题,因为Wiki页面中的数据也有误.

解决方案应按列名和行设置值:

df.loc[26,'Name (native)'] = np.nan 

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读