加入收藏 | 设为首页 | 会员中心 | 我要投稿 李大同 (https://www.lidatong.com.cn/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 编程开发 > Python > 正文

如何从PDF文件中提取文本和文本坐标?

发布时间:2020-12-20 10:32:38 所属栏目:Python 来源:网络整理
导读:我想用PDFMiner从PDF文件中提取所有文本框和文本框坐标. 许多其他Stack Overflow帖子解决了如何以有序方式提取所有文本,但是如何进行获取文本和文本位置的中间步骤? 给定一个PDF文件,输出应该类似于: 489,41,"Signature" 500,52,"b" 630,202,"a_g_i_r" 解
我想用PDFMiner从PDF文件中提取所有文本框和文本框坐标.

许多其他Stack Overflow帖子解决了如何以有序方式提取所有文本,但是如何进行获取文本和文本位置的中间步骤?

给定一个PDF文件,输出应该类似于:

489,41,"Signature"
   500,52,"b"
   630,202,"a_g_i_r"

解决方法

新行在最终输出中转换为下划线.这是我发现的最小工作解决方案.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer

# Open a PDF file.
fp = open('/Users/me/Downloads/test.pdf','rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not,abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr,laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr,device)

def parse_obj(lt_objs):

    # loop over the object list
    for obj in lt_objs:

        # if it's a textbox,print text and location
        if isinstance(obj,pdfminer.layout.LTTextBoxHorizontal):
            print "%6d,%6d,%s" % (obj.bbox[0],obj.bbox[1],obj.get_text().replace('n','_'))

        # if it's a container,recurse
        elif isinstance(obj,pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in PDFPage.create_pages(document):

    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()

    # extract text from this object
    parse_obj(layout._objs)

(编辑:李大同)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    推荐文章
      热点阅读