I am trying to extract text from a pdf file I usually have to deal with at work, so that I can automize it. When using PyPDF2, it works for my CV for instance, but not for my work-document. The problem is, that the text is then like that: "Helloworldthisisthetext". I then tried to use .join(" "), but this is not working. I read that this is a known problem with PyPDF2 - it seems to depend on the way the pdf was built. Does anyone know another approach how to extract text out of it which I then can use for further steps? Thank you in advance
5,978 9 9 gold badges 54 54 silver badges 87 87 bronze badges asked Dec 2, 2019 at 20:08 33 4 4 bronze badgesHi and welcome to SO! Please consider adding some more details to help others help you. How is the pdf generated? Do you get any error messages? How is your own CV generated? Et cetera.
Commented Dec 2, 2019 at 20:33I can suggest you to try another tool - pdfreader. You can extract the both plain strings and "PDF markdown" (decoded text strings + operators). "PDF markdown" can be parsed as a regular text (with regular expressions for example).
Below you find the code sample for walking pages and extracting PDF content for further parsing.
from pdfreader import SimplePDFViewer, PageDoesNotExist fd = open(your_pdf_file_name, "rb") viewer = SimplePDFViewer(fd) try: while True: viewer.render() pdf_markdown = viewer.canvas.text_content result = my_text_parser(pdf_markdown) # The one below will probably be the same as PyPDF2 returns plain_text += "".join(viewer.canvas.strings) viewer.next() except PageDoesNotExist: pass . def my_text_parser(text): """ Code your parser here """ .
pdf_markdown variable contains all texts including PDF commands (positioning, display): all strings come in brackets followed by Tj or TJ operator. For more on PDF text operators see PDF 1.7 sec. 9.4 Text Objects
You can parse it with regular expressions for example.