使用.NET VB或C#中的acrobat.tlb从.pdf中提取完整的带连字符的单

发布时间：2020-12-17 00:23:08 所属栏目：大数据来源：网络整理

导读：我正在使用acrobat.tlb库解析.pdf 在连续删除连字符的新行中,连字符被分开. 例如 ABC-123-XXX-987 解析为： ABC 123 XXX 987 如果我使用iTextSharp解析文本,它会解析文件中显示的整个字符串,这是我想要的行为.但是,我需要在.pdf和iTextSharp中突出显示这些字

我正在使用acrobat.tlb库解析.pdf

在连续删除连字符的新行中,连字符被分开.

例如
ABC-123-XXX-987

解析为：
ABC
123
XXX
987

如果我使用iTextSharp解析文本,它会解析文件中显示的整个字符串,这是我想要的行为.但是,我需要在.pdf和iTextSharp中突出显示这些字符串(序列号),而不是将突出显示放在正确的位置…因此acrobat.tlb

我正在使用此代码,从这里：http://www.vbforums.com/showthread.php?561501-RESOLVED-2003-How-to-highlight-text-in-pdf

' filey = "*your full file name including directory here*"
        AcroExchApp = CreateObject("AcroExch.App")
        AcroExchAVDoc = CreateObject("AcroExch.AVDoc")
        ' Open the [strfiley] pdf file
        AcroExchAVDoc.Open(filey,"")       

        ' Get the PDDoc associated with the open AVDoc
        AcroExchPDDoc = AcroExchAVDoc.GetPDDoc
        sustext = "accessorizes"
        suktext = "accessorises" 
        ' get JavaScript Object
        ' note jso is related to PDDoc of a PDF,jso = AcroExchPDDoc.GetJSObject
        ' count
        nCount = 0
        nCount1 = 0
        gbStop = False
        bUSCnt = False
        bUKCnt = False
        ' search for the text
        If Not jso Is Nothing Then
            ' total number of pages
            nPages = jso.numpages           

                ' Go through pages
                For i = 0 To nPages - 1
                    ' check each word in a page
                    nWords = jso.getPageNumWords(i)
                    For j = 0 To nWords - 1
                        ' get a word

                        word = Trim(CStr(jso.getPageNthWord(i,j)))
                        'If VarType(word) = VariantType.String Then
                        If word <> "" Then
                            ' compare the word with what the user wants
                            If Trim(sustext) <> "" Then
                                result = StrComp(word,sustext,vbTextCompare)
                                ' if same
                                If result = 0 Then
                                    nCount = nCount + 1
                                    If bUSCnt = False Then
                                        iUSCnt = iUSCnt + 1
                                        bUSCnt = True
                                    End If
                                End If
                            End If
                            If suktext<> "" Then
                                result1 = StrComp(word,suktext,vbTextCompare)
                                ' if same
                                If result1 = 0 Then
                                    nCount1 = nCount1 + 1
                                    If bUKCnt = False Then
                                        iUKCnt = iUKCnt + 1
                                        bUKCnt = True
                                    End If
                                End If
                            End If
                        End If
                    Next j
                Next i
jso = Nothing
        End If

代码执行突出显示文本的工作,但带有’word’变量的FOR循环将带连字符的字符串拆分为组件部分.

For i = 0 To nPages - 1
                        ' check each word in a page
                        nWords = jso.getPageNumWords(i)
                        For j = 0 To nWords - 1
                            ' get a word

                            word = Trim(CStr(jso.getPageNthWord(i,j)))

有谁知道如何使用acrobat.tlb维护整个字符串？我的相当广泛的搜索空白.

我可以理解iTextSharp在突出显示文本时很麻烦,因为你必须绘制一个矩形并变得复杂,但acrobat.tlb的解决方案也有它的缺点.它不是免费的,很少有人会使用它.对我们其他人来说更好的解决方案是免费且易于使用的Spire.Pdf.你可以从NuGet包中获得它.代码执行以下操作：

Opens .pdf

Read each text page

using regular expression find matches

save them to a list of strings eliminating duplicates

for each string in this list search page and highlight the word

码：

Dim pdf As PdfDocument = New PdfDocument("Path")
Dim pattern As String = "([A-Z,0-9]{3}[-][A-Z,0-9]{3})"
Dim matches As MatchCollection

Dim result As PdfTextFind() = Nothing
Dim content As New StringBuilder()
Dim matchList As New List(Of String)

For Each page As PdfPageBase In pdf.Pages
    'get text from current page
    content.Append(page.ExtractText())

    'find matches
    matches = Regex.Matches(content.ToString,pattern,RegexOptions.None)

    matchList.Clear()

    'Assign each match to a string list.
    For Each match As Match In matches
        matchList.Add(match.Value)
    Next

    'Eliminate duplicates.
    matchList = matchList.Distinct.ToList

    'for each string in list
    For i = 0 To matchList.Count - 1
        'find all occurances of matchList(i) string in page and highlight it
        result = page.FindText(matchList(i)).Finds

        For Each find As PdfTextFind In result
            find.ApplyHighLight(Color.BlueViolet) 'you can set your color preference
        Next

    Next 'matchList

Next 'page

pdf.SaveToFile("New Path")

pdf.Close()
pdf.Dispose()

我在正则表达方面不太好,所以你可以实现你的.无论如何,那是我的方法.

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!