python – 错误：在尝试使用scrappy登录时发生ValueError(“％s

发布时间：2020-12-20 13:06:47 所属栏目：Python 来源：网络整理

导读：问题描述：我想从我大学的bbs中抓取一些信息.这是地址：http://bbs.byr.cn 下面是我的蜘蛛的代码： from lxml import etreeimport scrapytry:from scrapy.spiders import Spiderexcept:from scrapy.spiders import BaseSpider as Spiderfrom scrapy.http im

问题描述：

我想从我大学的bbs中抓取一些信息.这是地址：http://bbs.byr.cn
下面是我的蜘蛛的代码：

from lxml import etree
import scrapy
try:
from scrapy.spiders import Spider
except:
from scrapy.spiders import BaseSpider as Spider
from scrapy.http import Request

class ITJobInfoSpider(scrapy.Spider):
name = "ITJobInfoSpider"
start_urls = ["http://bbs.byr.cn/#!login"]

def parse(self,response):
    return scrapy.FormRequest.from_response(
        response,formdata={'method':'post','id': 'username','passwd':'password'},formxpath='//form[@action="/login"]',callback=self.after_login
)

def after_login(self,response):
    print "######response body: " + response.body +"n"
    if "authentication failed" in response.body:
        print "#######Login failed#########n"
    return

但是,使用此代码,我经常会收到错误：引发ValueError(“在％s中找不到元素”％响应)

我的调查：

我发现当scrapy尝试解析url：http://bbs.byr.cn的HTML代码时会发生此错误,但是,scrappy用lxml解析页面.下面是代码

root = LxmlDocument(response,lxml.html.HTMLParser)
forms = root.xpath('//form')
if not forms:
    raise ValueError("No <form> element found in %s" % response)

所以我用代码查看代码：
????print etree.tostring(root)
并找到HTML元素：< / form>被解析为& lt; / form& gt;
难怪代码表单= root.xpath(‘// form’)将返回一个空表单列表.

But I don’t know why this is happening,maybe the HTML code encoding? (The HTML code is encoded with GBK not UTF8.)
Thanks advance for anyone who can help me out? BTW,if anyone want to write code against the website,I can give you an test account,pls leave me an email address in the comment.

非常感谢,伙计们！

解决方法

似乎有一些JavaScript重定向发生.

在这种情况下,使用Splash将是过度杀伤.只需将/ index附加到起始URL：http：//bbs.byr.cn→http：//bbs.byr.cn/index

这将是完整的工作蜘蛛：

from scrapy import Spider
from scrapy.http import FormRequest

class ByrSpider(Spider):
    name = 'byr'
    start_urls = ['http://bbs.byr.cn/index']

    def parse(self,response):
        return FormRequest.from_response(
            response,callback=self.after_login)

    def after_login(self,response):
        self.logger.debug(response.text)
        if 'authentication failed' in response.text:
            self.logger.debug('Login failed')

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!