加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 综合聚焦 > 服务器 > 安全 > 正文

bash – WGet下载顺序的逻辑

发布时间:2020-12-15 17:01:09 所属栏目:安全 来源:网络整理
导读:这是一个更普遍的问题,但它对我正在运行的数据挖掘项目有更广泛的影响.我一直在使用wget镜像归档网页进行分析.这是一个大量的数据,我目前的镜像过程已经持续了将近一个星期.这给了我很多时间来观看读数. wget如何确定下载页面的顺序?我似乎无法辨别其决策制
这是一个更普遍的问题,但它对我正在运行的数据挖掘项目有更广泛的影响.我一直在使用wget镜像归档网页进行分析.这是一个大量的数据,我目前的镜像过程已经持续了将近一个星期.这给了我很多时间来观看读数.

wget如何确定下载页面的顺序?我似乎无法辨别其决策制定过程的一致性逻辑(它不按字母顺序,按原始网站创建日期或文件类型进行).当我开始处理数据时,这将非常有助于掌握.

FWIW,这是我正在使用的命令(它需要cookie,而网站的TOS允许以任何方式“访问”我不想冒任何机会) – 其中SITE = URL:

wget -m --cookies=on --keep-session-cookies --load-cookies=cookie3.txt --save-cookies=cookie4.txt --referer=SITE --random-wait --wait=1 --limit-rate=30K --user-agent="Mozilla 4.0" SITE

编辑添加:在对Chown的有用答案的评论中,我稍微改进了我的问题,所以在这里.对于较大的网站 – 比如epe.lac-bac.gc.ca/100/205/301/ic/cdc/E/Alphabet.asp – 我发现它最初创建了一个目录结构和一些index.html / default.html页面,但后来又回到了几个不同的网站(抓了几个图片和例如,每次传递的子页面

从 gnu.org wget Recursive Download开始:
  • Recursive Download

GNU Wget is capable of traversing parts of the Web (or a single http
or ftp server),following links and directory structure. We refer to
this as to recursive retrieval,or recursion.

With http urls,Wget retrieves and parses the html or css from the
given url,retrieving the files the document refers to,through markup
like href or src,or css uri values specified using the ‘url()’
functional notation. If the freshly downloaded file is also of type
text/html,application/xhtml+xml,or text/css,it will be parsed and
followed further.

Recursive retrieval of http and html/css content is breadth-first.
This means that Wget first downloads the requested document,then the
documents linked from that document,then the documents linked by
them,and so on. In other words,Wget first downloads the documents at
depth 1,then those at depth 2,and so on until the specified maximum
depth.

The maximum depth to which the retrieval may descend is specified with
the ‘-l’ option. The default maximum depth is five layers.

When retrieving an ftp url recursively,Wget will retrieve all the
data from the given directory tree (including the subdirectories up to
the specified depth) on the remote server,creating its mirror image
locally. ftp retrieval is also limited by the depth parameter. Unlike
http recursion,ftp recursion is performed depth-first.

By default,Wget will create a local directory tree,corresponding to
the one found on the remote server.

…. snip ….

Recursive retrieval should be used with care. Don’t say you were not
warned.

从我自己的基本测试开始,当结构深度为1时,它按照从页面顶部到底部的外观顺序排列:

[ 16:28 root@host /var/www/html ]# cat index.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en-US">
    <head>
        <link rel="stylesheet" type="text/css" href="style.css">
    </head>
    <body>
        <div style="text-align:center;">
            <h2>Mobile Test Page</h2>
        </div>
        <a href="/c.htm">c</a>
        <a href="/a.htm">a</a>
        <a href="/b.htm">b</a>
    </body>
</html>



[ 16:28 jon@host ~ ]$wget -m http://98.164.214.224:8000
--2011-10-15 16:28:51--  http://98.164.214.224:8000/
Connecting to 98.164.214.224:8000... connected.
HTTP request sent,awaiting response... 200 OK
Length: 556 [text/html]
Saving to: "98.164.214.224:8000/index.html"

100%[====================================================================================================================================================================================================>] 556         --.-K/s   in 0s

2011-10-15 16:28:51 (19.7 MB/s) - "98.164.214.224:8000/index.html" saved [556/556]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/style.css
Connecting to 98.164.214.224:8000... connected.
HTTP request sent,awaiting response... 200 OK
Length: 221 [text/css]
Saving to: "98.164.214.224:8000/style.css"

100%[====================================================================================================================================================================================================>] 221         --.-K/s   in 0s

2011-10-15 16:28:51 (777 KB/s) - "98.164.214.224:8000/style.css" saved [221/221]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/c.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent,awaiting response... 200 OK
Length: 0 [text/html]
Saving to: "98.164.214.224:8000/c.htm"

    [ <=>                                                                                                                                                                                                 ] 0           --.-K/s   in 0s

2011-10-15 16:28:51 (0.00 B/s) - "98.164.214.224:8000/c.htm" saved [0/0]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/a.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent,awaiting response... 200 OK
Length: 2 [text/html]
Saving to: "98.164.214.224:8000/a.htm"

100%[====================================================================================================================================================================================================>] 2           --.-K/s   in 0s

2011-10-15 16:28:51 (102 KB/s) - "98.164.214.224:8000/a.htm" saved [2/2]

--2011-10-15 16:28:51--  http://98.164.214.224:8000/b.htm
Connecting to 98.164.214.224:8000... connected.
HTTP request sent,awaiting response... 200 OK
Length: 2 [text/html]
Saving to: "98.164.214.224:8000/b.htm"

100%[====================================================================================================================================================================================================>] 2           --.-K/s   in 0s

2011-10-15 16:28:51 (85.8 KB/s) - "98.164.214.224:8000/b.htm" saved [2/2]

FINISHED --2011-10-15 16:28:51--
Downloaded: 5 files,781 in 0s (2.15 MB/s)

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读