pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh <torqu...@ufl.edu>
Subject Best way to perform text extraction
Date Thu, 08 Dec 2011 07:49:15 GMT
Hello,

I am trying to extract text from a PDF file using the PHP JavaBridge so 
that I can insert finances into a database.
After getting the raw text, I am using PHP regular expressions which is 
really cludgy because the regex is relying on PDFBox to extract each PDF 
the same way.
I seem to be lucky so far with PDFs have the same font in the whole 
document, everything is extracting more or less in the same view from 
left to right as it appears on the PDF so the regex expressions to 
capture the data values make more sense based off the header, the 
"header" being "price" for instance with the value being "$9.99".

The problem I'm having is with documents where there is one font for 
the header and a different font for the value.  PDFBoz first extracts 
one font, and then extracts the other for performance reasons.

Below is an example of how the extracted text looks like using the 
PDFTextStripper class ("\n" represents Linux linefeed characters):

( this is the top of the page )
PIECES: SUBTOTAL\n
DISCOUNT\n
SALES TAX\n
SHIPPING/HANDLING\n
TOTAL\n
\n
\n
( middle of the page with more text is here )
\n
( bottom of the page is below )
\n
                                                     1.0                 
          8.28\n
\n
\n

                                                                      
0.00\n
\n
\n
\n
                                                                         
    0.00\n
\n
\n
\n
                                                                         
     0.00\n
\n
\n
\n
                                                                         
          8.28\n

As you can see, using regex to extract those cost values is almost 
impossible as it's like grasping at straws.


What I did to preserve the left to right view in the PDF is to pass a 
boolean true to the following function:

setSortByPosition
public void setSortByPosition(boolean newSortByPosition)The order of 
the text tokens in a PDF file may not be in the same as they appear 
visually on the screen. For example, a PDF writer may write out all text 
by font, so all bold or larger text, then make a second pass and write 
out the normal text.
The default is to not sort by position.

A PDF writer could choose to write each character in a different order. 
By default PDFBox does not sort the text tokens before processing them 
due to performance reasons.

Parameters:
newSortByPosition - Tell PDFBox to sort the text positions.

--------------------------------------------------------------------------------

This was better, however now the issue is that the headers themselves 
look strange and spread out with whitespace characters.

Here is how the extracted text looked like after setting the line 
"$pdf_to_text->setSortByPosition(true);" in PHP JavaBridge:


PIECES: SUBTOTAL\n
                                                     1.0                 
          8.28\n
\n
\n
\n
DISCOUNT                         0.00\n
\n
S A L E S T A X            0.00\n
\n
\n

S H I PP I N G /H A N D LI N G         0.00\n

TOTAL\n
                                                                         
          8.28\n




As you can see it's left to right which is good but the header 
characters are spread out showing "S A L E S T A X" instead of "SALES 
TAX" and "S H I PP I N G /H A N D LI N G" instead of 
"SHIPPING/HANDLING".  This makes the regex look unpredictable as how can 
I guarantee PDFBox will extract like this every time for a new PDF?

Is there a way to set the font to one font for the entire PDF prior to 
text extraction to see if the header names will be normal looking?

If not, are there better alternatives like PDF to HTML or PDF to XML 
that uses PDFBox?

Thanks for the help!

Sincerely,

Josh

Mime
View raw message