pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lachezar Dobrev <l.dob...@gmail.com>
Subject Re: Detecting if PDF contains only/mostly images.
Date Mon, 06 Nov 2017 15:12:39 GMT
  Well… It worked (mostly) as expected.
  The thing I did not expect is that a fraction of the the scanners
used turned out to be "smart"-ish. They attempt to perform OCR on the
scanned documents/images. They're actually doing a somewhat decent job
(I was impressed). The process however seems to result in a weird PDFs
that contains multiple layers of images stacked on top of each other
and text (where it was detected) that is stacked on top of the
graphics, and is *transparent* with *transparent* background (as far
as I understand), which is obviously invisible, but can be
select-copy-pasted, which is really nice.
  However that makes my job that much harder, since now bits and
pieces of the image are in different layers, and there *is* text
content.

  For the time being I am handling these by rendering the page to a
BufferedImage and then using manual ImageIO to render the page as a
Jpeg. The process seems to be very inefficient, a 124 KByte PDF file
ends up being converted to a 927 KByte Jpeg image (Java Image IO @ 90%
quality). I have asked my colleagues to scan a test page that is
suitable for sharing (limited personal information), I'm open for
sharing method suggestions.

  So I'm looking for ways to improve. Is there any way I can:
  * Detect and skip text when it's transparent (PDFTextStripper)
  * Render the page to a BufferedImage, but detect the density from
the images in the page without the need to guess (currently guess-set
to 3*72 = 216 ppi).
  * Detect and possibly use colour space from the embedded images (to
skip colour for black-grey-white images)
  * (please suggest other items I may have overlooked)


2017-10-31 12:23 GMT+02:00 Tilman Hausherr <THausherr@t-online.de>:
> Heh heh... It's rather the opposite... it's a java library and the command
> line tools are for convenience :-)
>
> Tilman
>
>
> Am 31.10.2017 um 11:18 schrieb Lachezar Dobrev:
>>
>>    Ahh... You mean use the tool as a *ahm* tool?
>>    I'm so used to seeing these as parts of the command-line tools that
>> I've totally forgotten that their inner elements are suitable for use
>> in code. Thanks.
>>
>>    I think I'm going to create a Writer implementation that throws
>> exception if non-white space is written to it, and use the
>> writeText(PDDocument,Writer) to quickly cancel processing when
>> non-white space is found.
>>
>> 2017-10-30 19:54 GMT+02:00 Tilman Hausherr <THausherr@t-online.de>:
>>>
>>> Am 30.10.2017 um 16:52 schrieb Lachezar Dobrev:
>>>>
>>>>     I have been looking at it. I am actually using (a similar) approach
>>>> to read embedded bar-codes, but there I can test all images.
>>>>     The best I can see in ExtractImages is a way to check if there is
>>>> only one image. However I can not check if there is additional text or
>>>> other content, so that I do not mistakenly skip a page that has a
>>>> single logo (for instance) and lots of other text information.
>>>>     I tried looking at PDFTextStripper, but that is hard to follow.
>>>
>>>
>>> That one is easy... just create the object, set start and end page, and
>>> then
>>> call getText().
>>>
>>> Tilman
>>>
>>>
>>>>     Is there any sure(-ish) sign that there is text on a page that I can
>>>> use? Can I check for the existence of something that would tell me
>>>> that there is additional content on the page other than the single
>>>> image?
>>>>
>>>> 2017-10-30 15:53 GMT+02:00 Tilman Hausherr <THausherr@t-online.de>:
>>>>>
>>>>> Am 30.10.2017 um 14:04 schrieb Lachezar Dobrev:
>>>>>>
>>>>>>      I have to process PDF files, that (supposedly) contain one big
>>>>>> image
>>>>>> per page, which is a result from a Document-Scanner. I'd like to
avoid
>>>>>> performing PDF-To-Image in these cases, and use the underlying image
>>>>>> instead.
>>>>>>      I am not well-versed in all things PDF and have no idea how
to
>>>>>> detect if a page has content other than a single image.
>>>>>>      Please advise.
>>>>>
>>>>>
>>>>> Please have a look at the ExtractImages.java source code. You can
>>>>> change
>>>>> that one to your needs.
>>>>>
>>>>> Tilman
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message