pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CM Reddy <mas...@netisoftware.com>
Subject Read jumbled text from multi-line highlighted text - PDFBox 1.8.14
Date Mon, 23 Jul 2018 17:08:45 GMT
Hi All,

We are using PDFBox 1.8.14 to manage PDF documents in our application. 
Implemented algorithm listed in link 
<https://stackoverflow.com/questions/33253757/java-apache-pdfbox-extract-highlighted-text/51446785#51446785>to

read the highlighted from PDF documents. During testing the code, we 
noticed that, text read from multiple line highlights got jumbled. 
Please find the attached document with three highlights.

 1. First highlight is a single line highlight - It works fine
      * Extracted text : "Only a resident of Michigan may be issued a
        Michigan driver's license"

 2. Second and third are multi-line highlights - Text jumbled.
      * Extracted text for 2nd highlight is:
          o You ask whether, in light of OAG, 1995-1996, No 6883, p 120
            (December 14, 1995) (OAG No 6883), the Michigan Secretary of
            State is
            No 68
            alien1
            required to issue a driver's license to an illegal
            living in Michigan

      * Extracted text for3rd highlight is:
          o iad circumstances, including cashing a check,
            At one time, the federal government assigned social
            closing on a loan, gaining employment, and securing access
            to a commercial airplane. At one
            security numbers for certain valid nonwork purposes,
            including for the purpose of obtaining

Help us resolving the above issues.

- Thanks in advance.



Mime
View raw message