php – 如何在网站上获取动态内容并保存？

发布时间：2020-12-13 17:03:07 所属栏目：PHP教程来源：网络整理

导读：例如,我需要从 http://gmail.com/中获取免费存储空间的数量： Over span id=quota2757.272164/span megabytes (and counting) of free storage. 然后将这些数字存储在MySql数据库中. 如您所见,该数字正在动态变化. 有没有办法我可以设置一个服务器端脚本,每

例如,我需要从 http://gmail.com/中获取免费存储空间的数量：

Over <span id=quota>2757.272164</span> megabytes (and counting) of free storage.

然后将这些数字存储在MySql数据库中.
如您所见,该数字正在动态变化.

有没有办法我可以设置一个服务器端脚本,每次更改时都会抓取该数字,并将其保存到数据库中？

谢谢.

解决方法

由于Gmail不提供任何API来获取此信息,因此听起来您想要做一些 web scraping.

Web scraping (also called Web
harvesting or Web data extraction) is
a computer software technique of
extracting information from websites

有许多方法可以做到这一点,如之前链接的维基百科文章所述：

Human copy-and-paste: Sometimes even
the best Web-scraping technology can
not replace human’s manual examination
and copy-and-paste,and sometimes this
may be the only workable solution when
the websites for scraping explicitly
setup barriers to prevent machine
automation.

Text grepping and regular expression
matching: A simple yet powerful
approach to extract information from
Web pages can be based on the UNIX
grep command or regular expression
matching facilities of programming
languages (for instance Perl or
Python).

HTTP programming: Static and dynamic
Web pages can be retrieved by posting
HTTP requests to the remote Web server
using socket programming.

DOM parsing: By embedding a
full-fledged Web browser,such as the
Internet Explorer or the Mozilla Web
browser control,programs can retrieve
the dynamic contents generated by
client side scripts. These Web browser
controls also parse Web pages into a
DOM tree,based on which programs can
retrieve parts of the Web pages.

HTML parsers: Some semi-structured
data query languages,such as the XML
query language (XQL) and the
hyper-text query language (HTQL),can
be used to parse HTML pages and to
retrieve and transform Web content.

Web-scraping software: There are many
Web-scraping software available that
can be used to customize Web-scraping
solutions. These software may provide
a Web recording interface that removes
the necessity to manually write
Web-scraping codes,or some scripting
functions that can be used to extract
and transform Web content,and
database interfaces that can store the
scraped data in local databases.

Semantic annotation recognizing: The
Web pages may embrace metadata or
semantic markups/annotations which can
be made use of to locate specific data
snippets. If the annotations are
embedded in the pages,as Microformat
does,this technique can be viewed as
a special case of DOM parsing. In
another case,the annotations,
organized into a semantic layer07001,
are stored and managed separated to
the Web pages,so the Web scrapers can
retrieve data schema and instructions
from this layer before scraping the
pages.

在我继续之前,请记住所有这些的legal implications.我不知道它是否符合gmail的条款,我建议在继续之前检查它们.您最终可能会被列入黑名单或遇到其他问题.

所有这一切,我会说,在你的情况下,你需要某种蜘蛛和DOM解析器登录到Gmail并找到你想要的数据.此工具的选择取决于您的技术堆栈.

作为一个ruby dev,我喜欢使用Mechanize和nokogiri.使用PHP你可以看看像Sphider这样的解决方案.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!