python – 从.pdf中提取特定数据并保存在Excel文件中
发布时间:2020-12-20 11:47:13 所属栏目:Python 来源:网络整理
导读:每个月我都需要从.pdf文件中提取一些数据来创建Excel表格. 我能够将.pdf文件转换为文本,但我不确定如何提取和保存我想要的特定信息.现在我有这个代码: from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreterfrom pdfminer.converter impor
每个月我都需要从.pdf文件中提取一些数据来创建Excel表格.
我能够将.pdf文件转换为文本,但我不确定如何提取和保存我想要的特定信息.现在我有这个代码: from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from cStringIO import StringIO def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr,retstr,codec=codec,laparams=laparams) fp = file(path,'rb') interpreter = PDFPageInterpreter(rsrcmgr,device) password = "" maxpages = 0 caching = True pagenos=set() fstr = '' for page in PDFPage.get_pages(fp,pagenos,maxpages=maxpages,password=password,caching=caching,check_extractable=True): interpreter.process_page(page) str = retstr.getvalue() fstr += str fp.close() device.close() retstr.close() return fstr print convert_pdf_to_txt("FA20150518.pdf") 这就是结果: >>> AVILA?72,?VALLDOREIX 08197?SANT?CUGAT?DEL?VALLES (BARCELONA) TELF:?935441851 NIF:?B65512725 EMAIL:?buendialogistica@gmail.com JOSE?LUIS?MARTINEZ?LOPEZ AVDA.?DEL?ESLA,?33-D 24240?SANTA?MARIA?DEL?PARAMO LEON TELF:?600871170 FECHA 17/06/15 FACTURA ??20150518 CLIENTE 43000335 N.I.F. 71548163?B PáG. 1 No?VIAJE RUTA DESTINATARIO?/?REFERENCIA KG BULTOS IMPORTE 2015064210-08/06/15 CERDANYOLA?DEL?VALLES?->?VINAROS FERRER?ALIMENTACION?-?VINAROZ 2,000.0 1 ?????????150,00 TOTAL?IMP. % IMPORTE BASE ?????????150,00 ?????????150,00 % ?21,00 IVA % REC. TOTAL?FRA. ( |