pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qingchao Kong <kqingc...@gmail.com>
Subject Re: How to define regions in PDFTextStripperByArea?
Date Tue, 29 Apr 2014 14:59:34 GMT
Paul,
Could you explain me why you use "stripper.setSortByPosition(true);"
and what does it do actually?

When I use "stripper.setSortByPosition(true);", I got the following errors:
Exception in thread "main" java.lang.IllegalArgumentException:
Comparison method violates its general contract!
at java.util.TimSort.mergeLo(TimSort.java:747)
at java.util.TimSort.mergeAt(TimSort.java:483)
at java.util.TimSort.mergeCollapse(TimSort.java:408)
at java.util.TimSort.sort(TimSort.java:214)
at java.util.TimSort.sort(TimSort.java:173)
at java.util.Arrays.sort(Arrays.java:659)
at java.util.Collections.sort(Collections.java:217)
at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:565)
at org.apache.pdfbox.util.PDFTextStripperByArea.writePage(PDFTextStripperByArea.java:190)
at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:457)
at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:153)

Do you know why?
PS: The pdf file I use are attached.


On Tue, Apr 29, 2014 at 9:00 PM, Paul Monday <paul.monday@docsforce.com> wrote:
> It's not really PDFBox that mixed the main content up.  It's just a basic algorithm for
extracting text.  You run into this quite often when interpreting PDF files.  I've been playing
with this all week so I actually have some code.
>
> Theere are two things you can try.  You could get the rectangle that the cropbox defines
and have the text stripper attempt to sort by position.  Depending on how your headers and
footers were inserted, this may sort it out.  Here is where I did that on a per page basis:
>
>                 for (PDPage page : pages) {
>                         PDRectangle pdr = page.getCropBox();
>                         Rectangle rec = new Rectangle();
>                         rec.setBounds(
>                                         Math.round(pdr.getLowerLeftX())
>                                         , Math.round(pdr.getLowerLeftY())
>                                         , Math.round(pdr.getWidth())
>                                         , Math.round(pdr.getHeight()));
>                         System.out.println("Crobox: " + rec);
>                         PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>                         stripper.addRegion("cropbox", rec);
>                         stripper.setSortByPosition(true);
>                         stripper.extractRegions(page);
>                         List<String> regions = stripper.getRegions();
>                         for (String region : regions) {
>                                 String text = stripper.getTextForRegion(region);
>
> This may sort your strings in the order you want.

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message