使用Selenium导航并使用Python中的BeautifulSoup进行刮擦
发布时间:2020-12-20 11:10:03 所属栏目:Python 来源:网络整理
导读:好的,这就是我想要实现的目标: 使用动态过滤的搜索结果列表调用URL 点击第一个搜索结果(5 /页) 抓取标题,段落和图像,并将它们作为json对象存储在单独的文件中,例如 { ?“标题”:“个人条目的标题要素”, “内容”:“个人条目中DOM顺序中的图表和图像” }
好的,这就是我想要实现的目标:
>使用动态过滤的搜索结果列表调用URL { 再次想象一下这些内容: 到目前为止我所拥有的是: #import libraries from selenium import webdriver from bs4 import BeautfifulSoup #URL url = "https://URL.com" #Create a browser session driver = webdriver.Chrome("PATH TO chromedriver.exe") driver.implicitly_wait(30) driver.get(url) #click consent btn on destination URL ( overlays rest of the content ) python_consentButton = driver.find_element_by_id('acceptAllCookies') python_consentButton.click() #click cookie consent btn #Seleium hands the page source to Beautiful Soup soup_results_overview = BeautifulSoup(driver.page_source,'lxml') for link in soup_results_overview.findAll("a",class_="searchResults__detail"): #Selenium visits each Search Result Page searchResult = driver.find_element_by_class_name('searchResults__detail') searchResult.click() #click Search Result #Ask Selenium to go back to the search results overview page driver.back() #Tell Selenium to click paginate "next" link #probably needs to be in a sourounding for loop? paginate = driver.find_element_by_class_name('pagination-link-next') paginate.click() #click paginate next driver.quit() 问题 这可能是一个递归方法的预定案例,不确定. 任何关于如何解决这个问题的建议都表示赞赏. 解决方法
没有Selenium,您只能使用请求和BeautifulSoup刮擦.它将更快,并将消耗更少的资源:
import json import requests from bs4 import BeautifulSoup # Get 1000 results params = {"$filter": "TemplateName eq 'Application Article'","$orderby": "ArticleDate desc","$top": "1000","$inlinecount": "allpages",} response = requests.get("https://www.cst.com/odata/Articles",params=params).json() # iterate 1000 results articles = response["value"] for article in articles: article_json = {} article_content = [] # title of article article_title = article["Title"] # article url article_url = str(article["Url"]).split("|")[1] print(article_title) # request article page and parse it article_page = requests.get(article_url).text page = BeautifulSoup(article_page,"html.parser") # get header header = page.select_one("h1.head--bordered").text article_json["Title"] = str(header).strip() # get body content with images links and descriptions content = page.select("section.content p,section.content img,section.content span.imageDescription," "section.content em") # collect content to json format for x in content: if x.name == "img": article_content.append("https://cst.com/solutions/article/" + x.attrs["src"]) else: article_content.append(x.text) article_json["Content"] = article_content # write to json file with open(f"{article_json['Title']}.json",'w') as to_json_file: to_json_file.write(json.dumps(article_json)) print("the end") (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |