pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lachezar Dobrev <l.dob...@gmail.com>
Subject Re: Detecting if PDF contains only/mostly images.
Date Tue, 31 Oct 2017 10:18:48 GMT
  Ahh... You mean use the tool as a *ahm* tool?
  I'm so used to seeing these as parts of the command-line tools that
I've totally forgotten that their inner elements are suitable for use
in code. Thanks.

  I think I'm going to create a Writer implementation that throws
exception if non-white space is written to it, and use the
writeText(PDDocument,Writer) to quickly cancel processing when
non-white space is found.

2017-10-30 19:54 GMT+02:00 Tilman Hausherr <THausherr@t-online.de>:
> Am 30.10.2017 um 16:52 schrieb Lachezar Dobrev:
>>
>>    I have been looking at it. I am actually using (a similar) approach
>> to read embedded bar-codes, but there I can test all images.
>>    The best I can see in ExtractImages is a way to check if there is
>> only one image. However I can not check if there is additional text or
>> other content, so that I do not mistakenly skip a page that has a
>> single logo (for instance) and lots of other text information.
>>    I tried looking at PDFTextStripper, but that is hard to follow.
>
>
> That one is easy... just create the object, set start and end page, and then
> call getText().
>
> Tilman
>
>
>>
>>    Is there any sure(-ish) sign that there is text on a page that I can
>> use? Can I check for the existence of something that would tell me
>> that there is additional content on the page other than the single
>> image?
>>
>> 2017-10-30 15:53 GMT+02:00 Tilman Hausherr <THausherr@t-online.de>:
>>>
>>> Am 30.10.2017 um 14:04 schrieb Lachezar Dobrev:
>>>>
>>>>     I have to process PDF files, that (supposedly) contain one big image
>>>> per page, which is a result from a Document-Scanner. I'd like to avoid
>>>> performing PDF-To-Image in these cases, and use the underlying image
>>>> instead.
>>>>     I am not well-versed in all things PDF and have no idea how to
>>>> detect if a page has content other than a single image.
>>>>     Please advise.
>>>
>>>
>>> Please have a look at the ExtractImages.java source code. You can change
>>> that one to your needs.
>>>
>>> Tilman
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message