做采集经常遇到的问题是内容排版问题,用了一些时间写了个用正则替换html标签和样式的函数,共享下。 <div class="codetitle"><a style="CURSOR: pointer" data="7130" class="copybut" id="copybut7130" onclick="doCopy('code7130')"> 代码如下:<div class="codebody" id="code7130"> /* 格式化内容 @param string $content 内容最好统一用utf-8编码 @return string !本函数需要开启tidy扩展 / function removeFormat($content) { $replaces = array ( "/<font.?>/i" => '', "/</font>/i" => '', "//i" => '', "/</h3>/i" => '', "/<span.?>/i" => '', "/</span>/i" => '', "/<div.?>/i" => "", "/</div>/i" => " ", "/<!--<.?>-->/i"=>'', / "/<table.?>/i" => '',//遇到有表格的内容就不要启用 "/</table>/i" => '', "/<tbody.?>/i" => '', "/</tbody>/i" => '', "/<tr.?>/i" => '', "/</tr>/i" => ' ', "/<td.?>/i" => '',/ "/style=.+?['|"]/i" => '', "/class=.+?['|"]/i" => '', "/id=.+?['|"]/i"=>'', "/lang=.+?['|"]/i"=>'', //"/width=.+?['|"]/i"=>'',//不好控制注释掉 //"/height=.+?['|"]/i"=>'', "/border=.+?['|"]/i"=>'', "/face=.+?['|"]/i"=>'', "/<br.?>[ ]/i" => " ", "/<iframe.?>.</iframe>/i" => '', "//i" => ' ',//空格替换掉 "/<p.?>[ |x{3000}|rn]*/ui" => ' ',//替换半角、全角空格,换行符,用排除写入数据库时产生的编码问题); $config = array( //'indent' => TRUE,//是否缩进 'output-html' => TRUE,//是否是输出xhtml 'show-body-only'=>TRUE,//是否只获得到body 'wrap' => 0 ); $content = tidy_repair_string($content,$config,'utf8');//先利用php自带的tidy类库修复html标签,不然替换的时候容易出现各种诡异的情况 $content = trim($content); foreach ( $replaces as $k => $v ) { $content = preg_replace ( $k,$v,$content ); }if(strpos($content,' ')>6)//部分内容开头可能缺失 标签 $content = ' '.$content;$content = tidy_repair_string($content,'utf8');//再修复一次,可以去除html空标签 $content = trim($content); return $content; }
(编辑:李大同)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|