如何用有限的资源解析Haskell中的大型XML文件?
我想从
Haskell中的大型
XML文件(大约20G)中提取信息.由于它是一个大文件,我使用了
Hexpath的SAX解析函数.
这是我测试的一个简单代码: import qualified Data.ByteString.Lazy as L import Text.XML.Expat.SAX as Sax parse :: FilePath -> IO () parse path = do inputText <- L.readFile path let saxEvents = Sax.parse defaultParSEOptions inputText :: [SAXEvent Text Text] let txt = foldl' processEvent "" saxEvents putStrLn txt 在Cabal中激活分析后,它说parse.saxEvents占用了85%的已分配内存.我也使用了foldr,结果是一样的. 如果processEvent变得足够复杂,程序会因堆栈空间溢出错误而崩溃. 我究竟做错了什么? 解决方法
你没有说processEvent是什么样的.原则上,使用惰性ByteString对延迟生成的输入进行严格的左侧折叠应该没有问题,所以我不确定在您的情况下出了什么问题.但是在处理巨大的文件时应该使用适合流媒体的类型!
实际上,hexpat确实有’streaming’接口(就像xml-conduit).它使用了不太知名的 一旦我们有了parseProducer,我们就可以将ByteString或Text Producer转换为带有Text或ByteString组件的SaxEvents生产者.这是一些简单的操作.我使用的是238M“input.xml”;程序永远不需要超过6 MB的内存,从顶部来判断. – Sax.hs大多数IO动作都使用在底部定义的registerIds管道,该管道是针对xml的巨大位而定制的,这是一个有效的1000片段http://sprunge.us/WaQK {-#LANGUAGE OverloadedStrings #-} import PipesSax ( parseProducer ) import Data.ByteString ( ByteString ) import Text.XML.Expat.SAX import Pipes -- cabal install pipes pipes-bytestring import Pipes.ByteString (toHandle,fromHandle,stdin,stdout ) import qualified Pipes.Prelude as P import qualified System.IO as IO import qualified Data.ByteString.Char8 as Char8 sax :: MonadIO m => Producer ByteString m () -> Producer (SAXEvent ByteString ByteString) m () sax = parseProducer defaultParSEOptions -- stream xml from stdin,yielding hexpat tagstream to stdout; main0 :: IO () main0 = runEffect $sax stdin >-> P.print -- stream the extracted 'IDs' from stdin to stdout main1 :: IO () main1 = runEffect $sax stdin >-> registryIds >-> stdout -- write all IDs to a file main2 = IO.withFile "input.xml" IO.ReadMode $inp -> IO.withFile "output.txt" IO.WriteMode $out -> runEffect $sax (fromHandle inp) >-> registryIds >-> toHandle out -- folds: -- print number of IDs main3 = IO.withFile "input.xml" IO.ReadMode $inp -> do n <- P.length $sax (fromHandle inp) >-> registryIds print n -- sum the meaningful part of the IDs - a dumb fold for illustration main4 = IO.withFile "input.xml" IO.ReadMode $inp -> do let pipeline = sax (fromHandle inp) >-> registryIds >-> P.map readIntId n <- P.fold (+) 0 id pipeline print n where readIntId :: ByteString -> Integer readIntId = maybe 0 (fromIntegral.fst) . Char8.readInt . Char8.drop 2 -- my xml has tags with attributes that appear via hexpat thus: -- StartElement "FacilitySite" [("registryId","110007915364")] -- and the like. This is just an arbitrary demo stream manipulation. registryIds :: Monad m => Pipe (SAXEvent ByteString ByteString) ByteString m () registryIds = do e <- await -- we look for a 'SAXEvent' case e of -- if it matches,we yield,else we go to the next event StartElement "FacilitySite" [("registryId",a)] -> do yield a yield "n" registryIds _ -> registryIds – ‘library’:PipesSax.hs 这只是newtypes Pipes.ListT来获取适当的实例.我们不导出任何与List或ListT有关的东西,只是使用标准的Pipes.Producer概念. {-#LANGUAGE TypeFamilies,GeneralizedNewtypeDeriving #-} module PipesSax (parseProducerLocations,parseProducer) where import Data.ByteString (ByteString) import Text.XML.Expat.SAX import Data.List.Class import Control.Monad import Control.Applicative import Pipes import qualified Pipes.Internal as I parseProducer :: (Monad m,GenericXMLString tag,GenericXMLString text) => ParSEOptions tag text -> Producer ByteString m () -> Producer (SAXEvent tag text) m () parseProducer opt = enumerate . enumerate_ . parseG opt . Select_ . Select parseProducerLocations :: (Monad m,GenericXMLString text) => ParSEOptions tag text -> Producer ByteString m () -> Producer (SAXEvent tag text,XMLParseLocation) m () parseProducerLocations opt = enumerate . enumerate_ . parseLocationsG opt . Select_ . Select newtype ListT_ m a = Select_ { enumerate_ :: ListT m a } deriving (Functor,Monad,MonadPlus,MonadIO,Applicative,Alternative,Monoid,MonadTrans) instance Monad m => List (ListT_ m) where type ItemM (ListT_ m) = m joinL = Select_ . Select . I.M . liftM (enumerate . enumerate_) runList = liftM emend . next . enumerate . enumerate_ where emend (Right (a,q)) = Cons a (Select_ (Select q)) emend _ = Nil (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |