python – 为什么这个pickle在没有递归的情况下达到最大递归深度
发布时间:2020-12-20 12:40:38 所属栏目:Python 来源:网络整理
导读:这是我的代码,它不包含递归,但它在第一个pickle上达到最大递归深度… 码: #!/usr/bin/env pythonfrom bs4 import BeautifulSoupfrom urllib2 import urlopenimport pickle# open page and return soup listdef get_page_startups(page_url): html = urlopen
这是我的代码,它不包含递归,但它在第一个pickle上达到最大递归深度…
码: #!/usr/bin/env python from bs4 import BeautifulSoup from urllib2 import urlopen import pickle # open page and return soup list def get_page_startups(page_url): html = urlopen(page_url).read() soup = BeautifulSoup(html,"lxml") return soup.find_all("div","startup item") # # Get certain text from startup soup # def get_name(startup): return startup.find("a","profile").string def get_website(startup): return startup.find("a","visit")["href"] def get_status(startup): return startup.find("p","status").strong.string[8:] def get_twitter(startup): return startup.find("a","comment").string def get_high_concept_pitch(startup): return startup.find("div","headline").find_all("em")[1].string def get_elevator_pitch(startup): startup_soup = BeautifulSoup(urlopen("http://startupli.st" + startup.find("a","profile")["href"]).read(),"lxml") return startup_soup.find("p","desc").string.rstrip().lstrip() def get_tags(startup): return startup.find("p","tags").string def get_blog(startup): try: return startup.find("a","visit blog")["href"] except TypeError: return None def get_facebook(startup): try: return startup.find("a","visit facebook")["href"] except TypeError: return None def get_angellist(startup): try: return startup.find("a","visit angellist")["href"] except TypeError: return None def get_linkedin(startup): try: return startup.find("a","visit linkedin")["href"] except TypeError: return None def get_crunchbase(startup): try: return startup.find("a","visit crunchbase")["href"] except TypeError: return None # site to scrape BASE_URL = "http://startupli.st/startups/latest/" # scrape all pages for page_no in xrange(1,142): startups = get_page_startups(BASE_URL + str(page_no)) # search soup and pickle data for i,startup in enumerate(startups): s = {} s['name'] = get_name(startup) s['website'] = get_website(startup) s['status'] = get_status(startup) s['high_concept_pitch'] = get_high_concept_pitch(startup) s['elevator_pitch'] = get_elevator_pitch(startup) s['tags'] = get_tags(startup) s['twitter'] = get_twitter(startup) s['facebook'] = get_facebook(startup) s['blog'] = get_blog(startup) s['angellist'] = get_angellist(startup) s['linkedin'] = get_linkedin(startup) s['crunchbase'] = get_crunchbase(startup) f = open(str(i)+".pkl","wb") pickle.dump(s,f) f.close() print "Done " + str(page_no) 这是引发异常后的0.pkl的内容: http://pastebin.com/DVS1GKzz千行长! 在泡菜中有一些来自BASE_URL的HTML ……但是我没有腌制任何html字符串…… 解决方法
BeautifulSoup .string属性实际上不是字符串:
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('<div>Foo</div>') >>> soup.find('div').string u'Foo' >>> type(soup.find('div').string) bs4.element.NavigableString 尝试使用str(soup.find(‘div’).string),看看它是否有帮助.另外,我不认为Pickle真的是这里最好的格式.在这种情况下,JSON更容易. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |