pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: PDFTextStripperByArea coordinates
Date Sun, 15 Jan 2012 18:48:37 GMT
Hi,

Am 04.01.2012 12:53, schrieb Ilija Pavlic:
> I am having issues with coordinates. The PDFTextStripperByArea region
> seems to be pushed too high.
>
> Consider the following example snippet:
>
> ...
>      PDPage page = (PDPage) allPages.get(0);
>      PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>
>      // define region for extraction -- the coordinates and dimensions
> are x, y, width, height
>      Rectangle region = new Rectangle((int) x, (int)y, (int)width, (int)height);
>      stripper.addRegion("test region", region);
>
>      // overlay the region with a cyan rectangle to check if I got the
> coordinates and dimensions right
>      PDPageContentStream contentStream = new
> PDPageContentStream(document, page, true, true);
>      contentStream.setNonStrokingColor( Color.CYAN );
>      contentStream.fillRect( (int)x, (int)y, (int)width, (int)height );
>      contentStream.close();
>
>      // extract the text from the defined region
>      stripper.extractRegions(page);
>      String content = stripper.getTextForRegion("test region");
> ...
>      document.save(...);
> ...
>
> The cyan rectangle overlays the desired region nicely. On the other
> hand, stripper misses a couple of lines at the bottom of the rectangle
> and includes couple of lines above the rectangle. What is going on?
Maybe an issue with the current transformation matrix? You probably should use
the PDPageContentStream contructor containing 5 parameters, setting the last
one to "true". See [1] for further information.

> Thank you,
> Ilija.

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-854

Mime
View raw message