python – PDFMiner – 迭代页面并将它们转换为文本

发布时间：2020-12-20 13:39:39 所属栏目：Python 来源：网络整理

导读：所以我试图从一些PDF中获取一些特定的文本,并且我正在使用 Python和PDFMiner,但由于在 November 2013中发生的API更改而遇到一些麻烦.基本上,要获取文本的一部分我想要脱离PDF,我目前必须将整个文件转换为文本,然后使用字符串函数来获取我想要的部分.我想要做

所以我试图从一些PDF中获取一些特定的文本,并且我正在使用 Python和PDFMiner,但由于在 November 2013中发生的API更改而遇到一些麻烦.基本上,要获取文本的一部分我想要脱离PDF,我目前必须将整个文件转换为文本,然后使用字符串函数来获取我想要的部分.我想要做的是遍历PDF的每一页,并逐一将每一页转换为文本.然后,一旦我找到了我想要的部分,我就会阻止它阅读PDF.

我将发布位于我的文本编辑器atm中的代码,但它不是工作版本,它更多是中途到高效的解决方案版本：P

#!/usr/bin/env python
# -*- coding: utf-8 -*- 

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import LTChar,TextConverter
from pdfminer.layout import LAParams
from subprocess import call
from cStringIO import StringIO
import re
import sys
import os

argNum = len(sys.argv)
pdfLoc = str(sys.argv[1]) #CLI arguments

def convert_pdf_to_txt(path): #converts pdf to raw text (not my function)
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr,retstr,codec=codec,laparams=laparams)
    fp = file(path,'rb')
    interpreter = PDFPageInterpreter(rsrcmgr,device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp,pagenos,maxpages=maxpages,password=password,caching=caching,check_extractable=True):
        interpreter.process_page(page)

    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

if (pdfLoc[-4:] == ".pdf"):
    contents = ""
    try: # Get the outlines (contents) of the document
        fp = open(pdfLoc,'rb') #open a pdf document for reading
        parser = PDFParser(fp)
        document = PDFDocument(parser)
        outlines = document.get_outlines()
        for (level,title,dest,a,se) in outlines:
            title = re.sub(r".*s","",title) #get raw titles,stripped of formatting
            contents += title + "n"
    except: #if pdfMiner can't get contents then manually get contents from text conversion
        #contents = convert_pdf_to_txt(pdfLoc)
        #startToCpos = contents.find("TABLE OF CONTENTS")
        #endToCpos = contents.rfind(". . .")
        #contents = contents[startToCpos:endToCpos+8]

        fp = open(pdfLoc,'rb') #open a pdf document for reading
        parser = PDFParser(fp)
        document = PDFDocument(parser)
        pages = PDFPage(document,3,{'Resources':'thing','MediaBox':'Thing'}) #God knows what's going on here
        for pageNumber,page in enumerate(pages.get_pages(PDFDocument,fp)): #The hell is the first argument?
            if pageNumber == 42:
                print "Hello"

        #for line in s:
        #   print line
        #   if (re.search("(.s){2,}",line) and not re.search("NOTES|SCOPE",line)):
        #       line = re.sub("(.s){2,line)
        #       line = re.sub("(s?)*[0-9]*n","n",line)
        #       line = re.sub("^s",line)
        #       print line,#contents = contents.lower()
        #contents = re.sub("“",""",contents)
        #contents = re.sub("”",contents)
        #contents = re.sub("?","f",contents)
        #contents = re.sub(r"(TABLE OF CONTENTS|LIST OF TABLES|SCOPE|REFERENCED DOCUMENTS|Identification|System (o|O)verview|Document (o|O)verview|Title|Page|Table|Tab)(n)?|.s?|Section|[0-9]",contents)
        #contents = re.sub(r"This document contains proprietary information and may not be reproduced in any form whatsoever,nor may be used by or its contents divulged to thirdnparties without written permission from the ownerAll rights reservedNumber:  STP SMEDate: -Jul-Issue: A  of CMC STPNHIndustriesCLASSIFICATIONnNATO UNCLASSIFIED                  AGUSTAEUROCOPTEREUROCOPTER DEUTSCHLAND                 FOKKER",contents)
        #contents = re.sub(r"(r?n){2,contents)
        #contents = contents.lstrip()
        #contents = contents.rstrip()
    #print contents
else:
    print "Not a valid PDF file"

This is the old way of doing it(或至少知道旧方法是如何做到的,这个线程对我来说不是很有用tbh).但现在我必须使用PDFPage.get_pages而不是PDFDocument.get_pages,方法和它们的参数完全不同.

目前,我正在试图找出我传递给PDFPage的get_pages方法的’Klass’变量.

如果有人能够对API的这一部分有所了解甚至提供一个工作示例我会非常感激.

解决方法

尝试使用 PyPDF2.它使用起来要简单得多,而不是像PDFMiner那样不必要的功能丰富(在您的情况下很好).这是你想要的,它实现起来非常简单.

from PyPDF2 import PdfFileReader

PDF = PdfFileReader(file(pdf_fp,'rb'))

if PDF.isEncrypted:
    decrypt = PDF.decrypt('')
    if decrypt == 0:
        print "Password Protected PDF: " + pdf_fp
        raise Exception("Nope")
    elif decrypt == 1 or decrypt == 2:
        print "Successfully Decrypted PDF"

for page in PDF.pages:
    print page.extractText()
    '''page.extractText() is the unicode string of the contents of the page
    And I am assuming you know how to play with a string and use regex
    If you find what you want just break like so:
    if some_condition == True:
        break'''

（编辑：李大同）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!