正则表达式 – 如何在perl中匹配字符串与变音符？

发布时间：2020-12-14 06:35:05 所属栏目：百科来源：网络整理

导读：例如，在“???ér???????????????”中匹配“Nation”，而不需要额外的模块。在新的Perl版本(5.14,5.15等)中是否可能？ I found an answer! Thanks to tchrist Rigth解决方案与UCA匹配(thnx到http://stackoverflow.com/users/471272/tchrist)。 # found start

例如，在“???ér???????????????”中匹配“Nation”，而不需要额外的模块。在新的Perl版本(5.14,5.15等)中是否可能？

I found an answer! Thanks to tchrist

Rigth解决方案与UCA匹配(thnx到http://stackoverflow.com/users/471272/tchrist)。

# found start/end offsets for matched utf-substring (without intersections)
use 5.014;
use strict; 
use warnings;
use utf8;
use Unicode::Collate;
binmode STDOUT,':encoding(UTF-8)';
my $str  = "???ér???????????????" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
    normalization => undef,level => 1
   );

my @match = $Collator->match($str,$look);
if (@match) {
    my $found = $match[0];
    my $f_len  = length($found);
    say "match result: $found (length is $f_len)"; 
    my $offset = 0;
    while ((my $start = index($str,$found,$offset)) != -1) {                                                  
        my $end   = $start + $f_len;
        say sprintf("found at: %s,%s",$start,$end);
        $offset = $end + 1;
    }
}

http://www.perlmonks.org/?node_id=485681错误(但工作)解决方案

Magic piece of code is:

$str = Unicode::Normalize::NFD($str); $str =~ s/pM//g;

code example:

use 5.014;
    use utf8;
    use Unicode::Normalize;

    binmode STDOUT,':encoding(UTF-8)';
    my $str  = "???ér???????????????";
    my $look = "Nation";
    say "before: $strn";
    $str = NFD($str);
    # M is short alias for p{Mark} (http://perldoc.perl.org/perluniprops.html)
    $str =~ s/pM//og; # remove "marks"
    say "after: $str";?
    say "is_match: ",$str =~ /$look/i || 0;

“没有额外的模块”是什么意思？

这是一个使用Unicode :: Normalize的解决方案see on perl doc

我从你的字符串中删除了“?”和“?”，我的日食不想和他们一起保存脚本。

use strict;
use warnings;
use UTF8;
use Unicode::Normalize;

my $str = "??tér??t????l???t???";

for ( $str ) {  # the variable we work on
   ##  convert to Unicode first
   ##  if your data comes in Latin-1,then uncomment:
   #$_ = Encode::decode( 'iso-8859-1',$_ );  
   $_ = NFD( $_ );   ##  decompose
   s/pM//g;         ##  strip combining characters
   s/[^-x80]//g;  ##  clear everything else
 }

if ($str =~ /nation/) {
  print $str . "n";
}

输出是

Internationaliation

“?”从字符串中删除，似乎不是一个组合的字符。

for循环的代码是从这边How to remove diacritic marks from characters

另一个有趣的读者是Joel Spolsky的The Absolute Minimum Every Software Developer Absolutely,Positively Must Know About Unicode and Character Sets (No Excuses!)

更新：

正如@tchrist所指出的那样，存在一种称为UCA(Unicode排序算法)的算法。 @nordicdyno已经在他的问题中提供了一个实现。

算法在这里描述Unicode Technical Standard #10,Unicode Collation Algorithm

perl模块在这里描述为perldoc.perl.org

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!