pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: associating text with a PDActionURI?
Date Thu, 07 Jul 2016 19:49:12 GMT

> Hmm, because it's you, I'll try it myself :-)

Thank you, Tilman!  

> You can't really know for sure with the classic text extraction, but you could use the
extractTextByArea example with the rect coordinates.

Based on your example, though, I think this should work. If I cache the rectangle coordinates
(Rectangle2D) before processing the page and then test for whether the rectangle contains
each TextPosition in writeString(String text, List<TextPosition> textPositions), this
might work?...have to implement to test this idea...

The key from your example is to subtract the rectangle's y values from the height of the page.

PDAnnotationLink's rectangle is:
 (lowerleftx) 69.75 : (lly)440.83 : (upperrightx)153.45 : (upry)415.38

TEXT: This is a hyperlink 
x: 72.024 xDirAdj: 72.024 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52
x: 77.40048 xDirAdj: 77.40048 y: 425.93 yDirAdj: 425.93 height: 5.52 heightDir: 5.52
....
So the "This is a hyperlink" text goes from 425.93 -> 430.45 on the y axis, which now fits
within the rectangle's y range of 415.38 to 440.83.

Does this seem right, or did I happen to get this right on this one doc?


-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Thursday, July 7, 2016 12:55 PM
To: users@pdfbox.apache.org
Subject: Re: associating text with a PDActionURI?

here's code that works - for some reason, I can't take the rectangle as it is, I have to flip
the coordinates. I wonder if this is documented. 
The coordinates in the PDF are PDF coordinates (bottom is y = 0), but the coordinates I had
to use are top is y = 0) Tilman

package org.apache.pdfbox.examples.util;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.text.PDFTextStripperByArea;

import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;

/**
  * This is an example on how to extract text from a specific area on the PDF document.
  *
  * @author Ben Litchfield
  */
public final class ExtractTextByArea
{
     private ExtractTextByArea()
     {
         //utility class and should not be constructed.
     }


     /**
      * This will print the documents text in a certain area.
      *
      * @param args The command line arguments.
      *
      * @throws IOException If there is an error parsing the document.
      */
     public static void main( String[] args ) throws IOException
     {
         if( args.length != 1 )
         {
             usage();
         }
         else
         {
             PDDocument document = null;
             try
             {
                 document = PDDocument.load( new File(args[0]) );
                 PDFTextStripperByArea stripper = new 
PDFTextStripperByArea();
                 stripper.setSortByPosition( true );
                 float pageHeight = 
document.getPage(0).getCropBox().getHeight();
                 Rectangle2D rect = new Rectangle2D.Float( 69.75f, 
pageHeight - 376.62f, 153.45f - 69.75f, 376.62f - 351.17f); 
/////////////////////////////////////////////////
                 stripper.addRegion( "class1", rect );
                 PDPage firstPage = document.getPage(0);
                 stripper.extractRegions( firstPage );
                 System.out.println( "Text in the area:" + rect );
                 System.out.println( stripper.getTextForRegion( "class1" 
) );
             }
             finally
             {
                 if( document != null )
                 {
                     document.close();
                 }
             }
         }
     }

     /**
      * This will print the usage for this document.
      */
     private static void usage()
     {
         System.err.println( "Usage: java " + 
ExtractTextByArea.class.getName() + " <input-pdf>" );
     }

}


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message