pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Extraction problems with PDFTextStripperByArea
Date Fri, 24 Jul 2015 15:03:07 GMT
Am 24.07.2015 um 10:38 schrieb Pierre Dubillot:
> Hi,
> I've got greats news. That was exactly what you were telling. The
> LowerLeftY was not set at 0, so, the mediabox was broken..
> I spent around 1 week to solve this problem, your help is really
> appreciated !

Glad it works! Although I'm wondering why extract from the splitted file 
instead of from the original file (which are identical except for the 
different media box). Did you really need both, i.e. both the "split in 
7 pages" thing and the text extraction? I hope that I / we didn't 
somehow accidentally made you do uneeded work.

Tilman

>
> Best regards,
> Pierre
>
> 2015-07-23 21:38 GMT+02:00 Tilman Hausherr <THausherr@t-online.de>:
>
>> I ran the ExtractText command utility. In the original PDF, CAVANNA
>> appears once on each day, so 7 times at all. In the "new" file, when
>> extracting all, it appears 49 times.
>>
>> This suggests that the text extraction logic doesn't bother about the
>> cropbox / mediabox / whatever. Hard to tell whether this is OK or not.
>>
>> It would be nice if you could upload the extract code.
>>
>> Can you try to change your extract code so that it uses the changing "y"
>> value (probably getLowerLeftY() ) of the media box (PDPage.getMediaBox())
>> in each page?
>>
>>
>> Tilman
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message