pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Extracting page "correctly"
Date Wed, 07 Nov 2018 18:24:35 GMT
Am 06.11.2018 um 22:30 schrieb jorgeeflorez:
> Thanks a lot Tilman for your help.
>
> What it seems to me is that, regarding text extraction from a page, 
> some improvements can be made (I used PDFBox 2.0.11). The idea, I 
> think, is that one could just invoke a method and get the text of the 
> page, just as you would get it if you select the text from the page 
> using Adobe Reader.
>
> Looking at the code of LegacyPDFStreamEngine, ancestor of 
> PDFTextStripper, I found in several ocassions the expresion "THIS CODE 
> IS DELIBERATELY INCORRECT" (I don't know if this affects what I am 
> trying to do). Anyway, I made a subclass of PDFStreamEngine and tried 
> to get the text of the page (I am not familiar with the pdf 
> specification, operators, fonts and all that stuff). I just took some 
> code from the examples, that I think I understood, and added a couple 
> lines.
>
> I could extract the text of the file I used to test, regardless the 
> page rotation. I also used the pdf file from PDFBOX-4368 and it seems 
> it got the text correctly. In a third file I used, it took the text, 
> but no spaces between words (I guess spaces were not stored in the pdf).
>
> I attached the test files and the class I created, I know it doesn't 
> cover all the cases, but maybe it can be helpful.
>
> By the way, text extraction was a part of a bigger problem. I needed 
> the text of the page and also group text in words and store the 
> coordinates (x, y, width height) of each word. The grouping part I 
> could do it (more or less) but the first part was giving me trouble :)


I don't know what is incorrect there except the height. I think John 
wanted to do something about it but it didn't happen.

Your attachment didn't get through. Please upload it somewhere.

Yes most PDFs don't have spaces. The PDFTextStripper class uses 
heuristics to make them up. If you are working on an own algorithm then 
use a test like TestTextStripper.java and maybe some or all the files 
that are part of the test. You can then compare your extraction with the 
current code, or just keep it to retest your own code as your algorithm 
evolves.

Btw here's some updated code. The last code had several bugs, it didn't 
work on multiple pages and didn't work on pages with a /Rotate entry.

Tilman


public class ExtractAngledText
{
     /**
      * This will print the documents data.
      *
      * @param args The command line arguments.
      *
      * @throws IOException If there is an error parsing the document.
      */
     public static void main(String[] args) throws IOException
     {
         if (args.length != 1)
         {
             usage();
         }
         else
         {
             try (PDDocument doc = PDDocument.load(new File(args[0])))
             {
                 for (int p = 1; p <= doc.getNumberOfPages(); ++p)
                 {
                     System.out.printf("Page: %3d\n", p);
                     System.out.println("----------");

                     AngleCollector angleCollector = new 
AngleCollector(); // alternatively, reset angles
                     angleCollector.setStartPage(p);
                     angleCollector.setEndPage(p);
                     angleCollector.getText(doc);
                     System.out.println("Collected angles: " + 
angleCollector.getAngles());
                     System.out.println();

                     PDPage page = doc.getPage(p - 1);
                     int rotation = page.getRotation();
                     page.setRotation(0);
                     PDFTextStripper filteredTextStripper = new 
FilteredTextStripper();
                     for (int angle : angleCollector.getAngles())
                     {
                         filteredTextStripper.setStartPage(p);
                         filteredTextStripper.setEndPage(p);

                         System.out.printf("Angle: %3d\n", angle);
                         System.out.println("----------");
                         String text;
                         if (angle == 0)
                         {
                             text = filteredTextStripper.getText(doc);
                         }
                         else
                         {
                             // prepend a transformation
                             try (PDPageContentStream cs = new

PDPageContentStream(doc, page, AppendMode.PREPEND, false))
                             {
cs.transform(Matrix.getRotateInstance(-Math.toRadians(angle), 0, 0));
                             }

                             text = filteredTextStripper.getText(doc);

                             // remove transformation
                             COSArray contents = (COSArray) 
page.getCOSObject().getItem(COSName.CONTENTS);
                             contents.remove(0);
                         }
                         System.out.println(text);
                     }
                     page.setRotation(rotation);
                 }
             }
         }
     }

     /**
      * This will print the usage for this document.
      */
     private static void usage()
     {
         System.err.println("Usage: java " + 
AngleCollector.class.getName() + " <input-pdf>");
     }
}

class AngleCollector extends PDFTextStripper
{
     Set<Integer> angles = new HashSet<>();

     public Set<Integer> getAngles()
     {
         return angles;
     }

     /**
      * Instantiate a new PDFTextStripper object.
      *
      * @throws IOException If there is an error loading the properties.
      */
     AngleCollector() throws IOException
     {
     }

     @Override
     protected void processTextPosition(TextPosition text)
     {
         Matrix m = text.getTextMatrix();
         int angle = (int) 
Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
         angle = (angle + 360) % 360;
         angles.add(angle);
     }
}

class FilteredTextStripper extends PDFTextStripper
{
     FilteredTextStripper() throws IOException
     {
     }

     @Override
     protected void processTextPosition(TextPosition text)
     {
         Matrix m = text.getTextMatrix();
         int angle = (int) 
Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
         if (angle == 0)
         {
             super.processTextPosition(text);
         }
     }
}






>
> Thanks.
> Best Regards.
> Jorge Eduardo Flórez
>
>
>
>     I've been thinking about similar strategies for the same problem for
>     some time but never worked on it.
>     So yes, we could try all 4 rotations and then see what extract makes
>     more sense.
>     Another idea that I just came up with: take the
>     DrawPrintTextLocations.java example from the source code download,
>     then
>     find this line
>     AffineTransform at = text.getTextMatrix().createAffineTransform();
>     below that, add this line:
>     System.out.println("Angle: " +
>     Math.toDegrees(Math.atan2(at.getShearY(),
>     at.getScaleY())));
>     Then look at the output....
>     This gets the rotation angle, which will hopefully be one of 0,
>     90, 180,
>     270.
>     Now run text extraction by preparing each page with
>     page.setRotation(page.getRotation()-angle);
>     However this won't work with fine rotations, e.g. the file from
>     PDFBOX-4368.
>     That would need something different, e.g. collecting all
>     rotations, and
>     then somehow run a filtered extract for each one.
>     Tilman
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message