pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Monday <paul.mon...@docsforce.com>
Subject Re: How to define regions in PDFTextStripperByArea?
Date Wed, 30 Apr 2014 14:19:30 GMT

On Apr 30, 2014, at 12:57 AM, Qingchao Kong <kqingchao@gmail.com> wrote:

> Paul,
>> 
>>                int width = 612;
>>                int height = 792;
>> 
>>                int hX = 320, tX = 340, cX = 100;
>>                int hY = 0, tY = 580, cY = 200;
>>                int hW = width - hX, tW = width - tX, cW = 100;
>>                int hH = 80, tH = height - tY, cH = 60;
>> 
>>                Rectangle header = new Rectangle();
>>                header.setBounds(hX, hY, hW, hH);
>>                Rectangle totals = new Rectangle();
>>                totals.setBounds(tX, tY, tW, tH);
>>                Rectangle customer = new Rectangle();
>>                customer.setBounds(cX, cY, cW, cH);
>> 
>>                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
>>                stripper.addRegion("header", header);
>>                stripper.addRegion("totals", totals);
>>                stripper.addRegion("customer", customer);
>>                stripper.setSortByPosition(true);
>> 
> 
> So it means that you have set the bounds emperically, like header,
> totals and customer, is that correct? The problem is PDF files may be
> of various sizes and you only know the header/footer are at the
> front/end of a PDF page, you would never know the exact locations.

The document that I'm looking at puts a lot of the information in drawn rectangles so I was
able to look at the rectangles that are drawn in the document, study where they are, then
determine the boundaries I wanted.  I don't know if that works for the document you are looking
at but to get all of the existing rectangles on a page:

a) get the tokens on the page
b) for each token that is an "re"
b1) get the previous 4 tokens (token location - 4 is x, -3 is y, -2 is w, -1 is h)
b2) store the rectangle (I actually wrote a routine to see if the rectangle was a part of
another rectangle or intersected, if the latter then I store a union of the two rectangles
and remove the two originals)
c) then I wrote a comparator so I could easily sort rectangles by the y coordinate

d) stare at the output and compare to the page and determine your regions

That only works if your PDF is drawn with rectangles though.

I believe the FIRST way I showed you originally is the better approach to your problem though
since it sorts your tokens, that was your original complaint.

There is no magical "one size fits all" for parsing a PDF.  You need to do the hard work of
understanding the PDF specification, how PDFBox interprets that specification, and then understand
how the AUTHOR of the PDF assembled it in the first place.  This takes time and experience.


> Btw, which version of PDFBox do you use? You never encounter the
> "Exception in thread "main" java.lang.IllegalArgumentException:" ?

Really, an illegal argument exception has to do with your code most likely.  Post your code
here and maybe its obvious.  Your exception and stack trace are sort of irrelevant since you
have a simple coding error.

Paul Monday
paul.monday@docsforce.com




Mime
View raw message