Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7394318BC9 for ; Tue, 27 Oct 2015 16:53:08 +0000 (UTC) Received: (qmail 75034 invoked by uid 500); 27 Oct 2015 16:53:08 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 75024 invoked by uid 500); 27 Oct 2015 16:53:08 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 75010 invoked by uid 99); 27 Oct 2015 16:53:07 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Oct 2015 16:53:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 7A1E81809D1 for ; Tue, 27 Oct 2015 16:53:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.991 X-Spam-Level: *** X-Spam-Status: No, score=3.991 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_LAZY_DOMAIN_SECURITY=1, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Wm32Fkkl13gX for ; Tue, 27 Oct 2015 16:52:54 +0000 (UTC) Received: from mailout04.t-online.de (mailout04.t-online.de [194.25.134.18]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 412DC44482 for ; Tue, 27 Oct 2015 16:52:54 +0000 (UTC) Received: from fwd12.aul.t-online.de (fwd12.aul.t-online.de [172.20.26.241]) by mailout04.t-online.de (Postfix) with SMTP id 5C48B371353 for ; Tue, 27 Oct 2015 17:52:53 +0100 (CET) Received: from [192.168.2.104] (Z4mTagZfwh2S9O2bf0L4ldFLa7mgtXXx3XHQMRzEz+DLlGoaKM-wKRaO5ATr0L1wxq@[217.231.134.89]) by fwd12.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-SHA encrypted) esmtp id 1Zr7US-1Cf3J20; Tue, 27 Oct 2015 17:52:52 +0100 Subject: Re: Merging pages of two PDF documents To: users@pdfbox.apache.org References: <184B5EB5-BCE9-47AF-AFF3-5154002E06F1@gmx.de> <9BCB8DD8-8A24-4358-9469-97950B41D052@fileaffairs.de> <8B76D677-5BD7-4D9A-9C44-9EDBB565DB52@gmx.de> From: Tilman Hausherr Message-ID: <562FAC1B.1040306@t-online.de> Date: Tue, 27 Oct 2015 17:53:47 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <8B76D677-5BD7-4D9A-9C44-9EDBB565DB52@gmx.de> Content-Type: multipart/alternative; boundary="------------060003040006070307040501" X-ID: Z4mTagZfwh2S9O2bf0L4ldFLa7mgtXXx3XHQMRzEz+DLlGoaKM-wKRaO5ATr0L1wxq X-TOI-MSGID: 32b81357-f5ee-4b5c-a945-b566be420b2f --------------060003040006070307040501 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Am 27.10.2015 um 10:12 schrieb Timm Friedholz: >> On 27.10.2015, at 06:51, Maruan Sahyoun wrote: >> >> Hi, >> >>> Am 26.10.2015 um 22:46 schrieb Timm Friedholz : >>> >>> Hello, >>> >>> I have some PDF documents in which the glyph-Unicode character mapping is destroyed so that it's not possible to search and copy the text. In an attempt to remove this restriction I've converted the PDFs to TIFF images and performed OCR on them using tesseract. Tesseract exports the recognized text as PDF files in which the text is positioned transparently on top of the images such that the text is searchable and selectable. >>> >>> The problem is that the images in the PDF that tesseract outputs are gray-scaled, large and high contrast versions of the original PDFs and I would like to have the quality and file size of the original PDFs instead. Thus my idea is to copy the text objects of the OCR output to the original PDFs. To avoid interference with the old text, I've converted the original PDFs to vector paths using Ghostscript: gs -o out.pdf -dNoOutputFonts -sDEVICE=pdfwrite in.pdf >>> >>> Now the problem is that I'm not sure how to approach this programmatically. Can I simply iterate over the pages and copy the text objects from each page of one document to the corresponding page of the other document? Which operators do I need to copy if I parse it token by token? Should I actually do it as directly via the PDFStreamParser class or are there abstraction in PDFBox that will make this easier? >> the easiest might be to >> a) remove the images from the OCR'ed document >> b) overlay the pages from the OCR'ed document over the original PDF using org.apache.pdfbox.multipdf.Overlay >> >> BR >> Maruan > Hello Maruan, > > Thanks for your reply. How would I go about removing the images exactly? I think this is the line that defines the images in tesseract's PDF renderer: > > https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd44f0396b08f96f/api/pdfrenderer.cpp#L755 > > Would I be able to access the image objects if I run the PDFStreamParser on the contents per page as it’s done in some of the examples, or are they stored somewhere else in the PDF file? Which operators mark the beginning and end of it? > > Timm The easiest would be to remove the invoke calls from the content stream. You can still remove the actual images later (so that you project goes forward) This looks somewhat like this: /Im1 Do To see this better, use PDFDebugger. https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/debugger-app/ If possible, upload such a PDF somewhere. The line you mention is where the object is defined, but the name of the object is defined in the resources dictionary. Just removing it will create just a mess. You can get the token list, and rewrite the tokens with ContentStreamWriter.writeTokens Tilman --------------060003040006070307040501--