pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jason mazzotta <jazzd...@gmail.com>
Subject Reading fractions in text
Date Wed, 31 Jan 2018 01:51:28 GMT
Hello,
     I know next to nothing about the PDF document format.  I am using
pdfbox to read the text out of PDF files that contain recipes.  The
PDFs are created on a Fujitsu ScanSnap S1300i document scanner.  The
software that creates the PDF files is called ABBYY FineReader.  The
PDF files themselves are readable, but when I use the following code
to extract text:


 try (PDDocument document = PDDocument.load(file))
      {
          //Instantiate PDFTextStripper class
          PDFTextStripper pdfStripper = new PDFTextStripper();

          //Retrieving text from PDF document
          String text = pdfStripper.getText(document);
          System.out.println(text);

      }
      catch(InvalidPasswordException ipe)
      {
          JOptionPane.showMessageDialog(null, ipe.toString(), "Invalid
Password", JOptionPane.INFORMATION_MESSAGE);
      }
      catch(IOException ioe)
      {
          JOptionPane.showMessageDialog(null, ioe.toString(), "IO
Error", JOptionPane.INFORMATION_MESSAGE);
      }

Often fractions like:

1/2 teaspoon ground red pepper

end up being parsed as:

V2 teaspoon ground red pepper

I've read a brief description of what a PDF document should look like:

https://blog.idrsolutions.com/2010/04/understanding-the-pdf-file-format-text-streams/

When I search through the PDF file, I can see Tj sequences, but the
values before them are not surrounded by parentheses.

Can someone suggest either

1)  What the problem might be
2)  What steps I can take to get closer to under

Thanks for your help.

Best Regards,

Jason Mazzotta

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message