加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 大数据 > 正文

Perl 实现简单的html 标签筛选

发布时间:2020-12-15 23:40:45 所属栏目:大数据 来源:网络整理
导读:此程序提供简单的获取html 页面代码并筛选出以下标签和一些基本属性: script : 属性 src, type a : 属性 href img: 属性 src 后续会添加一些更有用的功能,并逐步完善命令行接口。 使用方法: perl filter_html.pl? URL #!/usr/bin/perl# ----------------

此程序提供简单的获取html 页面代码并筛选出以下标签和一些基本属性:

<script> : 属性 src, type

<a> : 属性 href

<img>: 属性 src


后续会添加一些更有用的功能,并逐步完善命令行接口。

使用方法:

perl filter_html.pl? <URL>


#!/usr/bin/perl

# --------------------------
# author : ez
# date : 2015/8/23
# describe : this script send http request for a http url and filter some special 
#        tag you input
# --------------------------

use strict;
use warnings;
use LWP::UserAgent;
use Data::Dumper;
use HTML::TreeBuilder;
# use HTML::Parser;

our $VERSION = 1.0;

my %disp_func = ( 
	a => sub {
		my $em = shift;
		return if ! defined ($em) and $em -> tag () ne 'a';
		my $href = $em -> attr ('href');
		print "a url = "
			. ($href ? $href : 'none') . "n";
	},script => sub {
		my $em = shift;
		return if ! defined ($em) and $em -> tag () ne 'script';
		my $type = $em -> attr ('type');
		my $src = $em -> attr ('src');
		print "script type = "
			. ($type ? $type : 'none') . ",src = "
			. ($src ? $src : 'none') . "n";
	},img => sub {
		my $em = shift;
		return if ! defined ($em) and $em -> tag () ne 'img';
		my $src = $em -> attr ('src');
		print "img src = "
			. ($src ? $src : 'none') . "n";
	}
);


&_usage () if @ARGV < 1; 
my $url = shift @ARGV;

my @tags = qw(a script form img);
@tags = @ARGV if @ARGV >= 1;

my $useragent = LWP::UserAgent -> new;
my $request = HTTP::Request -> new ('GET' => $url);

$request -> content_type ('application/x-www-form-urlencoded');
$request -> header ('Accept-Language' => 'zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3');

print "[-] sending request to $url ...n";
my $html = $useragent -> request ($request);
print "[-] get response !n";

my $tree = new HTML::TreeBuilder;
$tree -> parse ($html -> content ());
$tree -> eof ();
my $html_tag = $tree -> elementify ();
# my @decendants = $html_tag -> descendants ();

# maybe the parameter could be more exciting  :-)
my @find_tags = $html_tag -> find_by_tag_name ('a','script','img');

foreach (@find_tags) {
	next if !defined ($_) and $_ -> tag () eq '';
	&{$disp_func {$_ -> tag ()}} ($_);
}

$tree -> delete ();

sub get_tags {
	my ($tag,$node) = @_;
	return if ! $tag;
}

sub _usage {
	print "usage: filter_html.pl <webpage_url>";
	exit;
}

# debug
# my $tag = $val -> tag (); # get 'html'

# TODO : parse start
# my $items = $tree -> findnodes ('/html/body//a');
# for my $item ($items -> get_nodelist ()) {
# 	my $str = $item -> content -> [0];
# 	print "$strn";
# }


# print $html -> as_string ();
# print $html -> content ();

# my @line =  $html -> content ();


# /<(S*?)[^>]>.*?</1>|<.*?/>/
# foreach (@line) {
# 	print "$_n";
# }

__END__

注:perl中可能没有需要的HTML::TreeBuilder模块和Data::Dumper模块,可以CPAN自己下载安装。

运行环境: Linux 3.18.0-kail3-amd64 #1 SMP Debian x86_64 GNU / Linux?

Perl: v5.14.2 built for x86_64-linux-gnu-thread-multi

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读