如何定义正则表达式从Java字符串中删除文本掩码垃圾邮件链接(“s
发布时间:2020-12-14 06:07:49 所属栏目:百科 来源:网络整理
导读:我有一个代表垃圾链接的网站列表: ListString bannedSites = ["spam1.com","spam2.com","spam3.com"]; 是否有正则表达式从此文本中删除与这些禁止的网站匹配的链接: Dear Arezzo,Please check out my website at spam1.com or http://www.spam1.com or htt
我有一个代表垃圾链接的网站列表:
List<String> bannedSites = ["spam1.com","spam2.com","spam3.com"]; 是否有正则表达式从此文本中删除与这些禁止的网站匹配的链接: Dear Arezzo,Please check out my website at spam1.com or http://www.spam1.com or http://spam1.com or spam1 dot com to win millions of dollars in prizes. Thank you. Big Spammer 请注意,链接可能有多种URL格式,aioobe‘s solution可以很好地识别: String input = "Dear Arezzo,n" + "Please check out my website at spam1.com or http://www.spam1.com" + "or http://spam1.com or spam1 dot com to win millions of dollars in prizes." + "Thank you."; List<String> bannedSites = Arrays.asList("spam1.com","spam3.com"); StringBuilder re = new StringBuilder(); for (String bannedSite : bannedSites) { if (re.length() > 0) re.append("|"); re.append(String.format("http://(www.)?%sS*|%1$s",Pattern.quote(bannedSite))); } System.out.println(input.replaceAll(re.toString(),"LINK REMOVED")); 但是,虽然上面的代码适用于URL格式spam1.com或http://www.spam1.com或http://spam1.com,但它错过了多种文本格式: 如何修改正则表达式以定位这些文本格式? spam1 dot com spam1[.com] spam1 .com spam1 . com 想法是产生这样的结果: Dear Arezzo,Please check out my website at [LINK REMOVED] or [LINK REMOVED] or [LINK REMOVED] or [LINK REMOVED] to win millions of dollars in prizes. Thank you. Big Spammer 正如我在下面的评论中所说,我可能不需要禁止整个字符串spam1 dot com.如果我可以消除垃圾邮件1部分,使其成为:[LINK REMOVED] dot com – 这将完成这项工作. 解决方法
这是一个开始.
import java.util.*; import java.util.regex.Pattern; class Test { public static void main(String[] args) { String input = "Dear Arezzo,n" + "Please check out my website at spam1.com " + "or http://www.spam1.com or http://spam1.com or " + "spam1 dot com to win millions of dollars in prizes.n" + "Thank you."; List<String> bannedSites = Arrays.asList("spam1","spam2","spam3"); StringBuilder re = new StringBuilder(); for (String bannedSite : bannedSites) { if (re.length() > 0) re.append("|"); String quotedSite = Pattern.quote(bannedSite); re.append("https?://(www.)?" + quotedSite + "S*"); re.append("|" + quotedSite + "s*(dot|.)?s*(com|net|org)"); //re.append("|" ... your variation here); } System.out.println(input.replaceAll(re.toString(),"LINK REMOVED")); } } 输出:
根据需要扩展正则表达式. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |