加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 大数据 > 正文

perl多线程抓取网页

发布时间:2020-12-15 21:00:19 所属栏目:大数据 来源:网络整理
导读:perl抓取网页的功能特别强大,所以尝试用多线程来抓网页。。 ? #!/usr/bin/perl use threads; use threads::shared; use LWP; use LWP::Simple; use LWP::UserAgent; use LWP::ConnCache; use HTML::TreeBuilder; my @urls:shared; my %uniq_url:shared; my

perl抓取网页的功能特别强大,所以尝试用多线程来抓网页。。

?

#!/usr/bin/perl
use threads;
use threads::shared;
use LWP;
use LWP::Simple;
use LWP::UserAgent;
use LWP::ConnCache;
use HTML::TreeBuilder;

my @urls:shared;
my %uniq_url:shared;
my $starturl=$ARGV[0];
push @urls,$starturl;
$uniq_url{$starturl}=1;
my $browser=LWP::UserAgent->new();
$browser->timeout(10);
$browser->protocols_allowed(['http','gopher']);
$browser->conn_cache(LWP::ConnCache->new());
my $num=1;

while(scalar @urls >0)
{
?if(scalar @urls == 1)
? {
??? my $url=shift @urls;
??? &parse($url,'old');
? }
?if(scalar @urls >= 2)
?{
? if(scalar @urls >= 20)
??? {
?????? $num=20;
??? }
? else{
?????? $num= scalar @urls;
??? }
? my @thread;
? for(my $j=0;$j<$num;$j++)
??? {
????? my $url=shift @urls;
????? $thread[$j]=threads->create(&;parse,$url,"thread$j");
??? }
? for(my $j=0;$j<$num;$j++)
??? {
?????? $thread[$j]->join();
??? }
?}
}
sub parse()
{
? my $url=shift;
? my $type=shift;
? my $response=$browser->get($url);
? unless($response->is_success)
?? {
????? print "cant access $url",$response->status_line."n";
????? return 0;
?? }
?my $html=$response->content;
?if(scalar @urls <300)
? {
??? while($html=~/href="(.*?)"/ig)
????? {
??????? my $new_url=URI->new_abs($1,$response->base);
??????? if(!exists($uniq_url{$new_url}))
???????? {
????????? push @urls,"$new_url";
????????? $uniq_url{$new_url}=1;
???????? }
????? }
? }
$|=1;
my $root=HTML::TreeBuilder->new_from_content($html);
my $title=$root->find_by_tag_name('title');
if($title)
{
my $str_title=$title->as_text();
print "$urlt$str_titlet$typen";
}
else{
print "$urltno titlet$typen";
}
}

感觉抓百度的音乐还不错,让可以把链接以及歌名放到mysql里面,写个cgi+sql的模糊查询等,简单实现一下搜索。 很粗陋啊,希望大家不要见笑。

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读