使用.NET VB或C#中的acrobat.tlb从.pdf中提取完整的带连字符的单
发布时间:2020-12-17 00:23:08 所属栏目:大数据 来源:网络整理
导读:我正在使用acrobat.tlb库解析.pdf 在连续删除连字符的新行中,连字符被分开. 例如 ABC-123-XXX-987 解析为: ABC 123 XXX 987 如果我使用iTextSharp解析文本,它会解析文件中显示的整个字符串,这是我想要的行为.但是,我需要在.pdf和iTextSharp中突出显示这些字
我正在使用acrobat.tlb库解析.pdf
在连续删除连字符的新行中,连字符被分开. 例如 解析为: 如果我使用iTextSharp解析文本,它会解析文件中显示的整个字符串,这是我想要的行为.但是,我需要在.pdf和iTextSharp中突出显示这些字符串(序列号),而不是将突出显示放在正确的位置…因此acrobat.tlb 我正在使用此代码,从这里:http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf ' filey = "*your full file name including directory here*" AcroExchApp = CreateObject("AcroExch.App") AcroExchAVDoc = CreateObject("AcroExch.AVDoc") ' Open the [strfiley] pdf file AcroExchAVDoc.Open(filey,"") ' Get the PDDoc associated with the open AVDoc AcroExchPDDoc = AcroExchAVDoc.GetPDDoc sustext = "accessorizes" suktext = "accessorises" ' get JavaScript Object ' note jso is related to PDDoc of a PDF,jso = AcroExchPDDoc.GetJSObject ' count nCount = 0 nCount1 = 0 gbStop = False bUSCnt = False bUKCnt = False ' search for the text If Not jso Is Nothing Then ' total number of pages nPages = jso.numpages ' Go through pages For i = 0 To nPages - 1 ' check each word in a page nWords = jso.getPageNumWords(i) For j = 0 To nWords - 1 ' get a word word = Trim(CStr(jso.getPageNthWord(i,j))) 'If VarType(word) = VariantType.String Then If word <> "" Then ' compare the word with what the user wants If Trim(sustext) <> "" Then result = StrComp(word,sustext,vbTextCompare) ' if same If result = 0 Then nCount = nCount + 1 If bUSCnt = False Then iUSCnt = iUSCnt + 1 bUSCnt = True End If End If End If If suktext<> "" Then result1 = StrComp(word,suktext,vbTextCompare) ' if same If result1 = 0 Then nCount1 = nCount1 + 1 If bUKCnt = False Then iUKCnt = iUKCnt + 1 bUKCnt = True End If End If End If End If Next j Next i jso = Nothing End If 代码执行突出显示文本的工作,但带有’word’变量的FOR循环将带连字符的字符串拆分为组件部分. For i = 0 To nPages - 1 ' check each word in a page nWords = jso.getPageNumWords(i) For j = 0 To nWords - 1 ' get a word word = Trim(CStr(jso.getPageNthWord(i,j))) 有谁知道如何使用acrobat.tlb维护整个字符串?我的相当广泛的搜索空白.
我可以理解iTextSharp在突出显示文本时很麻烦,因为你必须绘制一个矩形并变得复杂,但acrobat.tlb的解决方案也有它的缺点.它不是免费的,很少有人会使用它.对我们其他人来说更好的解决方案是免费且易于使用的Spire.Pdf.你可以从NuGet包中获得它.代码执行以下操作:
码: Dim pdf As PdfDocument = New PdfDocument("Path") Dim pattern As String = "([A-Z,0-9]{3}[-][A-Z,0-9]{3})" Dim matches As MatchCollection Dim result As PdfTextFind() = Nothing Dim content As New StringBuilder() Dim matchList As New List(Of String) For Each page As PdfPageBase In pdf.Pages 'get text from current page content.Append(page.ExtractText()) 'find matches matches = Regex.Matches(content.ToString,pattern,RegexOptions.None) matchList.Clear() 'Assign each match to a string list. For Each match As Match In matches matchList.Add(match.Value) Next 'Eliminate duplicates. matchList = matchList.Distinct.ToList 'for each string in list For i = 0 To matchList.Count - 1 'find all occurances of matchList(i) string in page and highlight it result = page.FindText(matchList(i)).Finds For Each find As PdfTextFind In result find.ApplyHighLight(Color.BlueViolet) 'you can set your color preference Next Next 'matchList Next 'page pdf.SaveToFile("New Path") pdf.Close() pdf.Dispose() 我在正则表达方面不太好,所以你可以实现你的.无论如何,那是我的方法. (编辑:李大同) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |