Extract elements from a PDF using Python¶

The high level functions can be used to achieve common tasks. In this case, we can use extract_pages :

from pdfminer.high_level import extract_pages for page_layout in extract_pages("test.pdf"): for element in page_layout: print(element) 

Each element will be an LTTextBox , LTFigure , LTLine , LTRect or an LTImage . Some of these can be iterated further, for example iterating though an LTTextBox will give you an LTTextLine , and these in turn can be iterated through to get an LTChar . See the diagram here: Layout analysis algorithm .

Let’s say we want to extract all of the text. We could do:

from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer for page_layout in extract_pages("test.pdf"): for element in page_layout: if isinstance(element, LTTextContainer): print(element.get_text()) 

Or, we could extract the fontname or size of each individual character:

from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer, LTChar for page_layout in extract_pages("test.pdf"): for element in page_layout: if isinstance(element, LTTextContainer): for text_line in element: for character in text_line: if isinstance(character, LTChar): print(character.fontname) print(character.size) 

pdfminer.six

Navigation