c# – 有一个快速的方法来解析大型文件与正则表达式？

发布时间：2020-12-15 06:31:53 所属栏目：百科来源：网络整理

导读：问题：非常非常非常大的文件我需要逐行解析以从每行中获取3个值.一切正常,但需要很长时间来解析整个文件.有可能在几秒钟内完成吗？典型的服用时间为1分钟至2分钟. 示例文件大小为148,208KB 我正在使用正则表达式来解析每一行：这是我的c#代码： private st

问题：
非常非常非常大的文件我需要逐行解析以从每行中获取3个值.一切正常,但需要很长时间来解析整个文件.有可能在几秒钟内完成吗？典型的服用时间为1分钟至2分钟.

示例文件大小为148,208KB

我正在使用正则表达式来解析每一行：

这是我的c#代码：

private static void ReadTheLines(int max,Responder rp,string inputFile)
{
    List<int> rate = new List<int>();
    double counter = 1;
    try
    {
        using (var sr = new StreamReader(inputFile,Encoding.UTF8,true,1024))
        {
            string line;
            Console.WriteLine("Reading....");
            while ((line = sr.ReadLine()) != null)
            {
                if (counter <= max)
                {
                    counter++;
                    rate = rp.GetRateLine(line);
                }
                else if (max == 0)
                {
                    counter++;
                    rate = rp.GetRateLine(line);
                }
            }
            rp.GetRate(rate);
            Console.ReadLine();
        }
    }
    catch (Exception e)
    {
        Console.WriteLine("The file could not be read:");
        Console.WriteLine(e.Message);
    }
}

这是我的正则表达式：

public List<int> GetRateLine(string justALine)
{
    const string reg = @"^d{1,}.+[(.*)s[-]d{1,}].+GET.*HTTP.*d{3}[s](d{1,})[s](d{1,})$";
    Match match = Regex.Match(justALine,reg,RegexOptions.IgnoreCase);

    // Here we check the Match instance.
    if (match.Success)
    {
        // Finally,we get the Group value and display it.

        string theRate = match.Groups[3].Value;
        Ratestorage.Add(Convert.ToInt32(theRate));
    }
    else
    {
        Ratestorage.Add(0);
    }
    return Ratestorage;
}

这是一个解析的例子,通常是大约20万行：

10.10.10.10 – – [27/Nov/2002:16:46:20 -0500] “GET /solr/ HTTP/1.1” 200 4926 789

解决方法

Memory Mapped Files和 Task Parallel Library寻求帮助.

>创建具有多个随机访问视图的持久化MMF.每个视图对应于文件的特定部分
>使用IEnumerable< string>之类的参数定义解析方法,基本上是抽象一组未分析的行
>使用Parse(IEnumerable< string>)作为任务操作,每一个MMF视图创建并启动一个TPL任务
>每个工作任务都将解析的数据添加到BlockingCollection类型的共享队列中
>另一个任务听BC(GetConsumingEnumerable())并处理已经由工作任务分析的所有数据

请参阅MSDN上的Pipelines pattern

必须说这个解决方案适用于.NET Framework> = 4

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!