Witam
Chciałem wyciągnąć dane z pdfa
Moje próby:
import os
from PyPDF2 import PdfReader
directory = 'E:\CODE\pdf_attachments' # Podaj ścieżkę do katalogu
for filename in os.listdir(directory):
if filename.endswith('.pdf'):
pdf_file = open(os.path.join(directory, filename), 'rb')
pdf_reader = PdfReader(pdf_file)
txt_file = open(os.path.join(directory, filename[:-4] + '.txt'), 'w', encoding='utf-8')
for page in range(len(pdf_reader.pages)):
txt_file.write(pdf_reader.pages[page].extract_text())
pdf_file.close()
txt_file.close()
Dane są nieciekawie wyświetla i rozjeżdżają się
Za pomocą pandas jest lepiej ale nadal nie wygląda to ciekawie:
import os
import pandas as pd
from PyPDF2 import PdfReader
directory = 'E:\CODE\pdf_attachments' # Podaj ścieżkę do katalogu
for filename in os.listdir(directory):
if filename.endswith('.pdf'):
pdf_file = open(os.path.join(directory, filename), 'rb')
pdf_reader = PdfReader(pdf_file)
txt_file = open(os.path.join(directory, filename[:-4] + '.txt'), 'w', encoding='utf-8')
for page in range(len(pdf_reader.pages)):
txt_file.write(pdf_reader.pages[page].extract_text())
pdf_file.close()
txt_file.close()
txt_file = open(os.path.join(directory, filename[:-4] + '.txt'), 'r', encoding='utf-8')
lines = txt_file.readlines()
df = pd.DataFrame(lines)
print(df.style.set_properties(**{'white-space': 'pre-wrap'}))
print(df)
txt_file.close()
Znalazłem jakiś lib camelot:
import camelot
tables = camelot.read_pdf('E:\CODE\pdf_attachments\example.pdf')
tables
tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
tables[0]
tables[0].parsing_report
{
'accuracy': 99.02,
'whitespace': 12.24,
'order': 1,
'page': 1
}
tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
tables[0].df # get a pandas DataFrame!
print(tables)
I dostaje:
Traceback (most recent call last):
File "e:\CODE\pdf-txt-v6.py", line 3, in <module>
tables = camelot.read_pdf('E:\CODE\pdf_attachments\example.pdf')
^^^^^^^^^^^^^^^^
AttributeError: module 'camelot' has no attribute 'read_pdf'
Proszę o pomoc