pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Monday <paul.mon...@docsforce.com>
Subject Re: How to define regions in PDFTextStripperByArea?
Date Tue, 29 Apr 2014 15:29:08 GMT

On Apr 29, 2014, at 8:59 AM, Qingchao Kong <kqingchao@gmail.com> wrote:

> Paul,
> Could you explain me why you use "stripper.setSortByPosition(true);"
> and what does it do actually?
I copied this from the JavaDoc for you:

The order of the text tokens in a PDF file may not be in the same as they appear visually
on the screen. For example, a PDF writer may write out all text by font, so all bold or larger
text, then make a second pass and write out the normal text.

The default is to not sort by position.

A PDF writer could choose to write each character in a different order. By default PDFBox
does not sort the text tokens before processing them due to performance reasons.

> When I use "stripper.setSortByPosition(true);", I got the following errors:
> Exception in thread "main" java.lang.IllegalArgumentException:
> Comparison method violates its general contract!
> at java.util.TimSort.mergeLo(TimSort.java:747)
> at java.util.TimSort.mergeAt(TimSort.java:483)
> at java.util.TimSort.mergeCollapse(TimSort.java:408)
> at java.util.TimSort.sort(TimSort.java:214)
> at java.util.TimSort.sort(TimSort.java:173)
> at java.util.Arrays.sort(Arrays.java:659)
> at java.util.Collections.sort(Collections.java:217)
> at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:565)
> at org.apache.pdfbox.util.PDFTextStripperByArea.writePage(PDFTextStripperByArea.java:190)
> at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:457)
> at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:153)
> Do you know why?

I don't know why you would get that.  Perhaps you have a different version of PDFBox than
I'm using.  I don't have time to debug your PDF and I'm not sure what your program is doing
from the stack trace.  You may not have adapted the code I gave you to your particular cropbox
size if you went with the manual rectangle setup, or perhaps your PDF is funny.  Try using
the mediabox or bleed box dimensions perhaps.

I am rather new to this approach as well.

> PS: The pdf file I use are attached.
> On Tue, Apr 29, 2014 at 9:00 PM, Paul Monday <paul.monday@docsforce.com> wrote:
>> It's not really PDFBox that mixed the main content up.  It's just a basic algorithm
for extracting text.  You run into this quite often when interpreting PDF files.  I've been
playing with this all week so I actually have some code.
>> Theere are two things you can try.  You could get the rectangle that the cropbox
defines and have the text stripper attempt to sort by position.  Depending on how your headers
and footers were inserted, this may sort it out.  Here is where I did that on a per page basis:
>>                for (PDPage page : pages) {
>>                        PDRectangle pdr = page.getCropBox();
>>                        Rectangle rec = new Rectangle();
>>                        rec.setBounds(
>>                                        Math.round(pdr.getLowerLeftX())
>>                                        , Math.round(pdr.getLowerLeftY())
>>                                        , Math.round(pdr.getWidth())
>>                                        , Math.round(pdr.getHeight()));
>>                        System.out.println("Crobox: " + rec);
>>                        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>>                        stripper.addRegion("cropbox", rec);
>>                        stripper.setSortByPosition(true);
>>                        stripper.extractRegions(page);
>>                        List<String> regions = stripper.getRegions();
>>                        for (String region : regions) {
>>                                String text = stripper.getTextForRegion(region);
>> This may sort your strings in the order you want.

Paul Monday

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message