加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 百科 > 正文

如何定义正则表达式从Java字符串中删除文本掩码垃圾邮件链接(“s

发布时间:2020-12-14 06:07:49 所属栏目:百科 来源:网络整理
导读:我有一个代表垃圾链接的网站列表: ListString bannedSites = ["spam1.com","spam2.com","spam3.com"]; 是否有正则表达式从此文本中删除与这些禁止的网站匹配的链接: Dear Arezzo,Please check out my website at spam1.com or http://www.spam1.com or htt
我有一个代表垃圾链接的网站列表:

List<String> bannedSites = ["spam1.com","spam2.com","spam3.com"];

是否有正则表达式从此文本中删除与这些禁止的网站匹配的链接:

Dear Arezzo,Please check out my website at spam1.com or http://www.spam1.com 
or http://spam1.com or spam1 dot com to win millions of dollars in prizes.
Thank you.
Big Spammer

请注意,链接可能有多种URL格式,aioobe‘s solution可以很好地识别:

String input = "Dear Arezzo,n"
        + "Please check out my website at spam1.com or http://www.spam1.com" 
        + "or http://spam1.com or spam1 dot com to win millions of dollars in prizes."
        + "Thank you.";

    List<String> bannedSites = Arrays.asList("spam1.com","spam3.com");

    StringBuilder re = new StringBuilder();
    for (String bannedSite : bannedSites) {
        if (re.length() > 0)
            re.append("|");
        re.append(String.format("http://(www.)?%sS*|%1$s",Pattern.quote(bannedSite)));
    }

    System.out.println(input.replaceAll(re.toString(),"LINK REMOVED"));

但是,虽然上面的代码适用于URL格式spam1.com或http://www.spam1.com或http://spam1.com,但它错过了多种文本格式:

如何修改正则表达式以定位这些文本格式?

spam1 dot com
spam1[.com]
spam1 .com
spam1 . com

想法是产生这样的结果:

Dear Arezzo,Please check out my website at [LINK REMOVED] or [LINK REMOVED] 
or [LINK REMOVED] or [LINK REMOVED] to win millions of dollars in prizes.
Thank you.
Big Spammer

正如我在下面的评论中所说,我可能不需要禁止整个字符串spam1 dot com.如果我可以消除垃圾邮件1部分,使其成为:[LINK REMOVED] dot com – 这将完成这项工作.

解决方法

这是一个开始.

import java.util.*;
import java.util.regex.Pattern;

class Test {
    public static void main(String[] args) {

        String input = "Dear Arezzo,n"
            + "Please check out my website at spam1.com "
            + "or http://www.spam1.com or http://spam1.com or " 
            + "spam1 dot com to win millions of dollars in prizes.n"
            + "Thank you.";

        List<String> bannedSites = Arrays.asList("spam1","spam2","spam3");

        StringBuilder re = new StringBuilder();
        for (String bannedSite : bannedSites) {
            if (re.length() > 0)
                re.append("|");
            String quotedSite = Pattern.quote(bannedSite);
            re.append("https?://(www.)?" + quotedSite + "S*");
            re.append("|" + quotedSite + "s*(dot|.)?s*(com|net|org)");
            //re.append("|" ... your variation here);
        }

        System.out.println(input.replaceAll(re.toString(),"LINK REMOVED"));
    }
}

输出:

Dear Arezzo,

Please check out my website at LINK REMOVED or LINK REMOVED or LINK REMOVED or LINK
REMOVED to win millions of dollars in prizes.
Thank you.

根据需要扩展正则表达式.

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读