bash – WGet下载顺序的逻辑
这是一个更普遍的问题,但它对我正在运行的数据挖掘项目有更广泛的影响.我一直在使用wget镜像归档网页进行分析.这是一个大量的数据,我目前的镜像过程已经持续了将近一个星期.这给了我很多时间来观看读数.
wget如何确定下载页面的顺序?我似乎无法辨别其决策制定过程的一致性逻辑(它不按字母顺序,按原始网站创建日期或文件类型进行).当我开始处理数据时,这将非常有助于掌握. FWIW,这是我正在使用的命令(它需要cookie,而网站的TOS允许以任何方式“访问”我不想冒任何机会) – 其中SITE = URL: wget -m --cookies=on --keep-session-cookies --load-cookies=cookie3.txt --save-cookies=cookie4.txt --referer=SITE --random-wait --wait=1 --limit-rate=30K --user-agent="Mozilla 4.0" SITE 编辑添加:在对Chown的有用答案的评论中,我稍微改进了我的问题,所以在这里.对于较大的网站 – 比如epe.lac-bac.gc.ca/100/205/301/ic/cdc/E/Alphabet.asp – 我发现它最初创建了一个目录结构和一些index.html / default.html页面,但后来又回到了几个不同的网站(抓了几个图片和例如,每次传递的子页面
从
gnu.org wget Recursive Download开始:
从我自己的基本测试开始,当结构深度为1时,它按照从页面顶部到底部的外观顺序排列: [ 16:28 root@host /var/www/html ]# cat index.html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html lang="en-US"> <head> <link rel="stylesheet" type="text/css" href="style.css"> </head> <body> <div style="text-align:center;"> <h2>Mobile Test Page</h2> </div> <a href="/c.htm">c</a> <a href="/a.htm">a</a> <a href="/b.htm">b</a> </body> </html> [ 16:28 jon@host ~ ]$wget -m http://98.164.214.224:8000 --2011-10-15 16:28:51-- http://98.164.214.224:8000/ Connecting to 98.164.214.224:8000... connected. HTTP request sent,awaiting response... 200 OK Length: 556 [text/html] Saving to: "98.164.214.224:8000/index.html" 100%[====================================================================================================================================================================================================>] 556 --.-K/s in 0s 2011-10-15 16:28:51 (19.7 MB/s) - "98.164.214.224:8000/index.html" saved [556/556] --2011-10-15 16:28:51-- http://98.164.214.224:8000/style.css Connecting to 98.164.214.224:8000... connected. HTTP request sent,awaiting response... 200 OK Length: 221 [text/css] Saving to: "98.164.214.224:8000/style.css" 100%[====================================================================================================================================================================================================>] 221 --.-K/s in 0s 2011-10-15 16:28:51 (777 KB/s) - "98.164.214.224:8000/style.css" saved [221/221] --2011-10-15 16:28:51-- http://98.164.214.224:8000/c.htm Connecting to 98.164.214.224:8000... connected. HTTP request sent,awaiting response... 200 OK Length: 0 [text/html] Saving to: "98.164.214.224:8000/c.htm" [ <=> ] 0 --.-K/s in 0s 2011-10-15 16:28:51 (0.00 B/s) - "98.164.214.224:8000/c.htm" saved [0/0] --2011-10-15 16:28:51-- http://98.164.214.224:8000/a.htm Connecting to 98.164.214.224:8000... connected. HTTP request sent,awaiting response... 200 OK Length: 2 [text/html] Saving to: "98.164.214.224:8000/a.htm" 100%[====================================================================================================================================================================================================>] 2 --.-K/s in 0s 2011-10-15 16:28:51 (102 KB/s) - "98.164.214.224:8000/a.htm" saved [2/2] --2011-10-15 16:28:51-- http://98.164.214.224:8000/b.htm Connecting to 98.164.214.224:8000... connected. HTTP request sent,awaiting response... 200 OK Length: 2 [text/html] Saving to: "98.164.214.224:8000/b.htm" 100%[====================================================================================================================================================================================================>] 2 --.-K/s in 0s 2011-10-15 16:28:51 (85.8 KB/s) - "98.164.214.224:8000/b.htm" saved [2/2] FINISHED --2011-10-15 16:28:51-- Downloaded: 5 files,781 in 0s (2.15 MB/s) (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |