pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Handling Graphics from Scanned PDF
Date Sun, 09 Dec 2012 15:37:31 GMT
Hi,

Am 09.12.2012 15:36, schrieb Eliot Kimber:
> Yes, I believe this is a masked image. I did a close reading of the PDF 1.7
> spec and I think that's what I have.
>
> The sample I'm testing with can be found here:
>
> https://dl.dropbox.com/u/20078596/pdfScannedPageWithMaskedImage.pdf
>
> Here are the dictionary entries for the three XObjects in the document:
>
> 9 0 obj
> <</BitsPerComponent
> 8/ColorSpace/DeviceGray/Filter[/FlateDecode/DCTDecode]/Height 1100/Length
> 19570/Name/image_bg0/Subtype/Image/Type/XObject/Width 850>>
>
> 10 0 obj
> <</BitsPerComponent
> 8/ColorSpace/DeviceGray/Filter[/FlateDecode/DCTDecode]/Height 1100/Length
> 8521/Mask 11 0 R/Name/image_fg0/Subtype/Image/Type/XObject/Width 850>>
>
> 11 0 obj
> <</BitsPerComponent 1/DecodeParms<</Columns 2550/K
> -1>>/Filter/CCITTFaxDecode/Height 3300/ImageMask true/Length
> 10266/Name/image_sel/Subtype/Image/Type/XObject/Width 2550>>
>
> So if I understand what this is saying, object 11 is the image mask applied
> to object 10.
Correct. FYI: did you ever try the PDFDebugger which comes with PDFBox? It's a 
tool to inspect the content of a pdf using a hierarchic tree view.

> In my test code I made a little StreamEngine that simply reports on all
> XObjects and writes any PDXObjectImage objects to the file system. This is
> the output I get on this test document:
>
> processOperator(): objectName="image_bg0"
> processOperator(): object type="PDJpeg"
> processOperator(): image class=PDJpeg
> processOperator(): imageWidth="850"
> processOperator(): imageHeight="1100"
> Creating file
> /var/folders/_r/zht66_tx2lzcz4k18rzbxc240000gp/T/TestPdfUtils/image_bg0_0.jp
> g
> processOperator(): objectName="image_fg0"
> processOperator(): object type="PDJpeg"
> processOperator(): image class=PDJpeg
> processOperator(): imageWidth="850"
> processOperator(): imageHeight="1100"
> Creating file
> /var/folders/_r/zht66_tx2lzcz4k18rzbxc240000gp/T/TestPdfUtils/image_fg0_1.jp
> g
>
> Where the objectName="image_bg0" line will be emitted for any XObject of any
> type.
>
> So it looks like the ImageMask object is not being reported as an XObject.
That's correct too. The mask is not a "standalone" XObjectImage, it's part of 
the fg_0 image. The mask represents the alpha channel of the image.
bg_0 is painted first. fg_0 is painted on top of bg_0, due to the alpha channel 
most of fg_0 is treated as transparent and doesn't overwrite anything.

I don't have a clue why the scanned picture is splitted into two parts. At least 
the most recent trunk version of PDFBox is able to handle this after fixing 
improving the mask handling, see [1] for further details.

Maybe you should just use the combined image generated by PDFToImage.

> Thanks,
>
> Eliot
>
> On 12/9/12 6:58 AM, "Andreas Lehmkuehler" <andreas@lehmi.de> wrote:
>
>> Hi,
>>
>> Am 06.12.2012 18:48, schrieb Eliot Kimber:
>>> I am trying to find QR codes on PDFs that are scanned page images. My code
>>> works fine for scans produced by my OfficeJet and for page images produced
>>> out of Acrobat but scans produced by my client's eCopy ShareScan device
>>> (according to the PDF metadata) are not usable.
>>>
>>> Looking into the PDF data stream, each page is represented by two images, a
>>> "bg" image that is what I would expect for the page image, but very faint
>>> grey, and a "fg" image that reflects the page content but with lots of grey
>>> and ghosting.
>> Sounds like masked images, but that's just a guess.
>>
>>> The PDF renderer must be combining these two images in some way to provide
>>> the clear image I see in Acrobat.
>>>
>>> Is there something I can find in the PDF data stream that will tell me how
>>> these images are combined and, if so, can anyone point me in the right
>>> direction for processing these images? I am pretty new to Java image
>>> processing so I'm not sure where to look or what to look for.
>>>
>>> The images themselves are repored by PDFBox as PDJpeg objects.
>>>
>>> I can provide a sample PDF page if it's needed.
>> Due to some restrictions you can't attach it to a posting. Please post a
>> download link referring to a public location or create an issue on jira [1]
>>
>>>
>>> Thanks,
>>>
>>> Eliot
>>>
>>
>>
>> BR
>> Andreas Lehmkühler
>>
>> [1] https://issues.apache.org/jira/browse/PDFBOX
>

BR
Andreas Lehmkühler

[1]https://issues.apache.org/jira/browse/PDFBOX-1445


Mime
View raw message