Perl 前缀树实现

发布时间：2020-12-16 00:20:49 所属栏目：大数据来源：网络整理

导读：前缀树，用来处理大量字符串的查找、排序，也称为字典树，可以代替hash table。 http://en.wikipedia.org/wiki/Trie 以下翻译自Wikipedia： The following are the main advantages of tries over?binary search trees?(BSTs): Looking up keys is faster. L

前缀树，用来处理大量字符串的查找、排序，也称为字典树，可以代替hash table。

http://en.wikipedia.org/wiki/Trie

以下翻译自Wikipedia：

The following are the main advantages of tries over?binary search trees?(BSTs):

Looking up keys is faster. Looking up a key of length?m?takes worst case?O(m) time. A BST performs O(log(n)) comparisons of keys,where?n?is the number of elements in the tree,because lookups depend on the depth of the tree,which is logarithmic in the number of keys if the tree is balanced. Hence in the worst case,a BST takes O(m?log?n) time. Moreover,in the worst case log(n) will approach?m. Also,the simple operations tries use during lookup,such as array indexing using a character,are fast on real machines.
Tries are more space-efficient when they contain a large number of short keys,since nodes are shared between keys with common initial subsequences.
Tries facilitate?longest-prefix matching.
The number of internal nodes from root to leaf equals the length of the key. Balancing the tree is therefore of no concern.

相对二叉搜索树的优点：
1. 搜索更快。搜索一个长度为m 的字串最坏情况下需要O(m)，而二叉树取决与树的深度，最坏情况需要O(m log n)的时间。
2. Trie 树在存储空间上更加高效。由于相同的前缀，他们会共享前缀节点.
3. Trie 可以用来完成最大前缀字符串匹配。
4. 从跟节点到叶节点的节点数与搜索字符串长度相同。

The following are the main advantages of tries over?hash tables:

Tries support ordered iteration,whereas iteration over a hash table will result in a?pseudorandom?order given by the hash function (and further affected by the order of hash collisions,which is determined by the implementation).
Tries facilitate?longest-prefix matching,but hashing does not,as a consequence of the above. Performing such a "closest fit" find can,depending on implementation,be as quick as an exact find.
Tries tend to be faster on average at insertion than hash tables because hash tables must rebuild their index when it becomes full - a very expensive operation. Tries therefore have much better bounded worst-case time costs,which is important for latency-sensitive programs.
Since no hash function is used,tries are generally faster than hash tables for small keys.

相对hash table 的优点：
1. Trie 的迭代是已经排序的，而hash table 则是根据hash 函数的结果排序。
2. Trie 可以用来完成最大前缀字符串匹配，hash table不支持。
3. Trie 插入性能好于hash table。当hash table分配已满并继续增加key，需要重建index，这种情况下非常耗费资源。因而在边界情况下Trie有更好的的插入性能。
4. Trie 不需要计算hash值，因而对于短字符串有更加快速。

As replacement of other data structures

As mentioned,a trie has a number of advantages over binary search trees.^[4]?A trie can also be used to replace a?hash table,over which it has the following advantages:

Looking up data in a trie is faster in the worst case,O(m) time,compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is?O(N)?time,but far more typically is O(1),with O(m) time spent evaluating the hash.
There are no collisions of different keys in a trie.
Buckets in a trie which are analogous to hash table buckets that store key collisions are necessary only if a single key is associated with more than one value.
There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.

Trie可以用来代替hash table，有如下优点：
1. 最坏情况下查找更加高效，需要O(m)时间。不完美的hash table会产生key碰撞。
2. Trie 没有key碰撞。
3. Trie 只有在允许一个key存储多个值的情况下才有类似hash碰撞发生。
4. 当更多的值插入时不需要hash 函数或者调整hash。
5. Trie 提供经过字典排序的key。

Tries do have some drawbacks as well:

Tries can be slower in some cases than hash tables for looking up data,especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random-access time is high compared to main memory.^[5]
Some keys,such as floating point numbers,can lead to long chains and prefixes that are not particularly meaningful. Nevertheless a bitwise trie can handle standard IEEE single and double format floating point numbers.

Perl代码：

use Data::Dumper;

# 默认只支持字符串的输入
sub trie_tree{
    my $root = {};
    for(@_){
        insert($root,$_);
    }
    return $root;
}

# 方便起见，每次只处理一个连续字符串
sub insert{
    my ($tree,$str) = @_;
    # 不包含字符串情况下返回
    return unless $str =~ /[a-zA-Z]/;
    
    # 所有节点都存小写字符
    $str =~ /([a-zA-Z]+)/;
    my @tmp = split //,lc $1;

    my $root = $tree;
    for(@tmp){
        if($root->{$_}){
            $root = $root->{$_};
        }
        else{
            # count用来标记一个完整的字符串到此结束如不为0，count值表示字符串出现的次数
            $root->{$_} = {count=>0,};
            $root = $root->{$_};
        }
    }
    $root->{count} ++;    
}

# 返回值
# undef 完整字符串未出现
# 0 查找字符串作为前缀字符串出现
# N 字符串出现N次
sub search{
    my ($tree,$str) = @_;

    return if $str =~ /[^a-zA-Z]/;
    
    my @tmp = split //,lc $str;
    my $root = $tree;
    for(@tmp){
        if($root->{$_}){
            $root = $root->{$_};
        }
        else{
            return
        }
    }
    return $root->{count};        
}

# 用三个array 来标记中间信息，例子两个字符串a，bc
# [root]
# [[a,b]] <- 判断循环条件
# []
# 第一次取a节点
# [root,a]
# [[b],[]]
# [a]
# 第二次取空节点，出栈
# [root]
# [[b]] <- 判断循环条件
# []
# 第三次取b节点
# [root,b]
# [[],[c]] <- 判断循环条件
# [b]
# 第四次取c节点
# [root,b,c]
# [[],[]] <- 判断循环条件
# [b,c]
# 第五次取空节点
# [root,b]
# [[]] <- 判断循环条件
# [b]
# 第六次取空节点
# [root]
# [] <- 判断循环条件为空，结束
# []
sub travel{
    my ($tree) = @_;
    # 储存当前节点指针
    my @node = ($tree);
    # 储存当前节点下的字符序列
    my @node_chars = ([grep { $_ ne 'count' } sort keys %$tree ]);
    # 储存字符串
    my @chars = ();
    
    while(@node_chars){
        if(! @{$node_chars[-1]}){
            pop @node;
            pop @node_chars;
            pop @chars;
            next;
        }
        else{
            # 存储信息压栈顺序要注意，否则会出错
            my $char = shift @{$node_chars[-1]};
            push @chars,$char;
            
            if($node[-1]->{$char}->{count} > 0){
                print join('',@chars),' => ',$node[-1]->{$char}->{count},"n";
            }
            
            push @node_chars,[grep { $_ ne 'count' } sort keys %{ $node[-1]->{$char} } ];
            push @node,$node[-1]->{$char};
            
        }
    }
}

my $tree = trie_tree(qw(this is just a testing));

insert($tree,'and');
insert($tree,'testing');

#~ print Dumper $tree;

#~ print search($tree,'testing');
#~ print search($tree,'and');

travel($tree);

输出：

a => 1
and => 1
is => 1
just => 1
testing => 2
this => 1

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!