为什么使用String#count比使用Ruby中的String #chars更快地计算
使用以下基准:
def create_genome "gattaca" * 100 end def count_frequency_using_chars(sequence) 100000.times do sequence.chars.group_by{|x| x}.map{|letter,array| [letter,array.count]} end end def count_frequency_using_count(sequence) 100000.times do ["a","c","g","t"].map{|letter| sequence.count(letter)} end end sequence = create_genome count_frequency_using_chars(sequence) count_frequency_using_count(sequence) 我发现,在基于C的Ruby中,对于1.8和1.9.2,使用String #count(字母)比使用Enumerable#group_by和Array#count对它们进行排序和计算快约50倍.我对此感到有些惊讶,因为String#count方法每次迭代读取字符串四次,而后者只读取一次. 我尝试在ruby-prof和perftools.rb下运行代码,并且它们都只是表明String #chars占用了90%的时间,没有分解花费90%的时间. 如果我不得不猜测为什么会有差异,我会说创造7000万个单字符字符串会很昂贵,但我怎么能知道呢? (更新:字符串#chars不是罪魁祸首 – 请参阅mostly_execute_a_trivial_block的基准测试) 编辑:使用1.9.2补丁级别180的当前基准: require 'pp' require 'benchmark' def create_genome "gattaca" * 100 end ZILLION = 100000 def count_frequency_using_count(sequence) ZILLION.times do ["a","t"].map{|letter| sequence.count(letter)} end end def count_frequency_using_chars(sequence) ZILLION.times do sequence.chars.group_by{|x| x}.map{|letter,array.count]} end end def count_frequency_using_inject_hash(sequence) ZILLION.times do sequence.chars.inject(Hash.new(0)) { |h,e| h[e] += 1 ; h } end end def count_frequency_using_each_with_object(sequence) ZILLION.times do sequence.chars.each_with_object(Hash.new(0)) { |char,hash| hash[char] += 1} end end def just_group_by(sequence) ZILLION.times do sequence.chars.group_by{|x| x} end end def just_chars_and_trivial_block(sequence) ZILLION.times do sequence.chars() {} end end def mainly_execute_a_trivial_block(sequence) ZILLION.times do sequence.length.times() {} end end def execute_an_empty_loop_instead(sequence) ZILLION.times do i = 0 max = sequence.length until i == max i += 1 end end end sequence = create_genome puts RUBY_VERSION Benchmark.bm do |benchmark| benchmark.report do count_frequency_using_count(sequence) end benchmark.report do count_frequency_using_chars(sequence) end benchmark.report do count_frequency_using_inject_hash(sequence) end benchmark.report do count_frequency_using_each_with_object(sequence) end benchmark.report do just_group_by(sequence) end benchmark.report do just_chars_and_trivial_block(sequence) end benchmark.report do mainly_execute_a_trivial_block(sequence) end benchmark.report do execute_an_empty_for_loop_instead(sequence) end end 结果: user system total real 0.510000 0.000000 0.510000 ( 0.508499) # count_frequency_using_count 23.470000 0.030000 23.500000 ( 23.575487) # count_frequency_using_chars 32.770000 0.050000 32.820000 ( 32.874634) # count_frequency_using_inject_hash 31.880000 0.040000 31.920000 ( 31.942437) # count_frequency_using_each_with_object 22.950000 0.030000 22.980000 ( 22.970500) # just_group_by 13.300000 0.020000 13.320000 ( 13.314466) # just_chars_and_trivial_block 5.660000 0.000000 5.660000 ( 5.661516) # mainly_execute_a_trivial_block 1.930000 0.010000 1.940000 ( 1.934861) # execute_an_empty_loop_instead 解决方法
它与ruby内部无关.你正在比较苹果和橘子.
在您的第一个示例中,您将700个字符串分组100000次并查找计数.所以这是你的逻辑问题.不计算.在第二种方法,你只是在计算, 在这两种方法中,您只使用计数 只需更改第一个示例 def count_frequency_using_chars(sequence) grouped = sequence.chars.group_by{|x| x} 100000.times do grouped.map{|letter,array.count]} end end 它和你的第二个一样快 编辑 这种方法比count_frequency_using_count快3倍,检查基准测试 def count_frequency_using_chars_with_single_group(sequence) grouped = sequence.chars.group_by{|x| x} 100000.times do grouped.map{|letter,array.count]} end end def count_frequency_using_count(sequence) 100000.times do ["a","t"].map{|letter| sequence.count(letter)} end end Benchmark.bm do |benchmark| benchmark.report do pp count_frequency_using_chars_with_single_group(sequence) end benchmark.report do pp count_frequency_using_count(sequence) end end user system total real 0.410000 0.000000 0.410000 ( 0.419100) 1.330000 0.000000 1.330000 ( 1.324431) 安德鲁给你的评论, 每次测量100000个序列的字符组成,而不是一个序列的字符组成100000次,仍然你的计数方法比group_by方法慢得多.正如你所说,我只是对大字符串进行了基准测试 seq = "gattaca" * 10000 #seq length is 70000 arr_seq = (1..10).map {|x| seq} #10 seq items 并改变了处理多个序列的方法 def count_frequency_using_chars_with_single_group(sequences) sequences.each do |sequence| grouped = sequence.chars.group_by{|x| x} 100000.times do grouped.map{|letter,array.count]} end end end def count_frequency_using_count(sequence) sequences.each do |sequence| 100000.times do ["a","t"].map{|letter| sequence.count(letter)} end end end Benchmark.bm do |benchmark| benchmark.report do pp count_frequency_using_chars_with_single_group(arr_seq) end benchmark.report do pp count_frequency_using_count(arr_seq) end end 对于处理100000次,10个序列,每个序列长度为70000 user system total real 3.710000 0.040000 3.750000 ( 23.452889) #modified group_by approach 1039.180000 6.920000 1046.100000 (1071.374730) #simple char count approach 对于高音量字符串,您的简单字符计数方法比修改后的group_by方法慢47%.我运行了上述基准测试,只有10个序列,每个序列长度为70000.假设有100或1000个序列,简单计数永远不会是一个选项.对? (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |