Haskell正则表达式

发布时间：2020-12-14 06:07:37 所属栏目：百科来源：网络整理

导读：我一直在研究 Haskell中正则表达式的现有选项,我想了解在将各种选项相互比较时,特别是通过简单调用grep时,性能差距来自哪里… 我有一个相对较小的(约110M,与我通常的几十个G在我的大多数用例中)跟踪文件： $du radixtracefile113120 radixtracefile$wc -l ra

我一直在研究 Haskell中正则表达式的现有选项,我想了解在将各种选项相互比较时,特别是通过简单调用grep时,性能差距来自哪里…

我有一个相对较小的(约110M,与我通常的几十个G在我的大多数用例中)跟踪文件：

$du radixtracefile
113120 radixtracefile
$wc -l radixtracefile
1051565 radixtracefile

>我首先尝试找到(任意)模式的多少匹配.* 504. * ll通过grep在那里：

$time grep -nE ".*504.*ll" radixtracefile | wc -l
309

real   0m0.211s
user   0m0.202s
sys    0m0.010s

>我使用Data.ByteString查看了Text.Regex.TDFA(版本1.2.1)：

import Control.Monad.Loops
import Data.Maybe
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Text.Regex.TDFA
import qualified Data.ByteString as B

main = do
    f <- B.readFile "radixtracefile"
    matches :: [[B.ByteString]] <- f =~~ ".*504.*ll"
    mapM_ (putStrLn . show . head) matches

建立和运行：

$ghc -O2 test-TDFA.hs -XScopedTypeVariables
[1 of 1] Compiling Main             ( test-TDFA.hs,test-TDFA.o )
Linking test-TDFA ...
$time ./test-TDFA | wc -l
309

real   0m4.463s
user   0m4.431s
sys    0m0.036s

>然后,我查看了具有Unicode支持的Data.Text.ICU.Regex(版本0.7.0.1)：

import Control.Monad.Loops
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Data.Text.ICU.Regex

main = do
    re <- regex [] $T.pack ".*504.*ll"
    f <- TIO.readFile "radixtracefile"
    setText re f
    whileM_ (findNext re) $do
        a <- start re 0
        putStrLn $"last match at :"++(show a)

建立和运行：

$ghc -O2 test-ICU.hs
[1 of 1] Compiling Main             ( test-ICU.hs,test-ICU.o )
Linking test-ICU ...
$time ./test-ICU | wc -l
309

real   1m36.407s
user   1m36.090s
sys    0m0.169s

我使用ghc版本7.6.3.我还没有测试其他Haskell正则表达式选项的机会.我知道我不会得到我用grep的表现,并且对此感到非常满意,但对于TDFA和ByteString来说,或多或少慢了20倍……这非常可怕.我无法理解为什么它是什么,因为我天真地虽然这是一个本机后端的包装…我不知道怎么没有正确使用该模块？

(我们暂不提及正在通过屋顶的ICU Text组合)

有没有我尚未测试的选项会让我更开心？

编辑：

>带有Data.ByteString的Text.Regex.PCRE(版本0.94.4)：

import Control.Monad.Loops
import Data.Maybe
import Text.Regex.PCRE
import qualified Data.ByteString as B

main = do
    f <- B.readFile "radixtracefile"
    matches :: [[B.ByteString]] <- f =~~ ".*504.*ll"
    mapM_ (putStrLn . show . head) matches

建立和运行：

$ghc -O2 test-PCRE.hs -XScopedTypeVariables
[1 of 1] Compiling Main             ( test-PCRE.hs,test-PCRE.o )
Linking test-PCRE ...
$time ./test-PCRE | wc -l
309

real   0m1.442s
user   0m1.412s
sys    0m0.031s

更好,但仍然有~7-ish …

解决方法

所以,在查看了其他库之后,我最终尝试了PCRE.Ligth(版本0.4.0.4)：

import Control.Monad
import Text.Regex.PCRE.Light
import qualified Data.ByteString.Char8 as B

main = do
    f <- B.readFile "radixtracefile"
    let lines = B.split 'n' f
    let re = compile (B.pack ".*504.*ll") []
    forM_ lines $l -> maybe (return ()) print $match re l []

这是我从中得到的：

$ghc -O2 test-PCRELight.hs -XScopedTypeVariables
[1 of 1] Compiling Main             ( test-PCRELight.hs,test-PCRELight.o )
Linking test-PCRELight ...
$time ./test-PCRELight | wc -l
309

real   0m0.832s
user   0m0.803s
sys    0m0.027s

我认为这对我的目的而言足够好.我可能会试着看看当我手动进行线分割时其他库会发生什么,就像我在这里做的那样,尽管我怀疑它会产生很大的不同.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!