正则表达式 – 如何在perl中匹配字符串与变音符?
例如,在“???ér???????????????”中匹配“Nation”,而不需要额外的模块。在新的Perl版本(5.14,5.15等)中是否可能?
Rigth解决方案与UCA匹配(thnx到http://stackoverflow.com/users/471272/tchrist)。 # found start/end offsets for matched utf-substring (without intersections) use 5.014; use strict; use warnings; use utf8; use Unicode::Collate; binmode STDOUT,':encoding(UTF-8)'; my $str = "???ér???????????????" x 2; my $look = "Nation"; my $Collator = Unicode::Collate->new( normalization => undef,level => 1 ); my @match = $Collator->match($str,$look); if (@match) { my $found = $match[0]; my $f_len = length($found); say "match result: $found (length is $f_len)"; my $offset = 0; while ((my $start = index($str,$found,$offset)) != -1) { my $end = $start + $f_len; say sprintf("found at: %s,%s",$start,$end); $offset = $end + 1; } } http://www.perlmonks.org/?node_id=485681错误(但工作)解决方案
$str = Unicode::Normalize::NFD($str); $str =~ s/pM//g;
use 5.014; use utf8; use Unicode::Normalize; binmode STDOUT,':encoding(UTF-8)'; my $str = "???ér???????????????"; my $look = "Nation"; say "before: $strn"; $str = NFD($str); # M is short alias for p{Mark} (http://perldoc.perl.org/perluniprops.html) $str =~ s/pM//og; # remove "marks" say "after: $str";? say "is_match: ",$str =~ /$look/i || 0;
“没有额外的模块”是什么意思?
这是一个使用Unicode :: Normalize的解决方案see on perl doc 我从你的字符串中删除了“?”和“?”,我的日食不想和他们一起保存脚本。 use strict; use warnings; use UTF8; use Unicode::Normalize; my $str = "??tér??t????l???t???"; for ( $str ) { # the variable we work on ## convert to Unicode first ## if your data comes in Latin-1,then uncomment: #$_ = Encode::decode( 'iso-8859-1',$_ ); $_ = NFD( $_ ); ## decompose s/pM//g; ## strip combining characters s/[^ -x80]//g; ## clear everything else } if ($str =~ /nation/) { print $str . "n"; } 输出是
“?”从字符串中删除,似乎不是一个组合的字符。 for循环的代码是从这边How to remove diacritic marks from characters 另一个有趣的读者是Joel Spolsky的The Absolute Minimum Every Software Developer Absolutely,Positively Must Know About Unicode and Character Sets (No Excuses!) 更新: 正如@tchrist所指出的那样,存在一种称为UCA(Unicode排序算法)的算法。 @nordicdyno已经在他的问题中提供了一个实现。 算法在这里描述Unicode Technical Standard #10,Unicode Collation Algorithm perl模块在这里描述为perldoc.perl.org (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |