Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 59ECE18C57 for ; Tue, 27 Oct 2015 05:51:44 +0000 (UTC) Received: (qmail 64863 invoked by uid 500); 27 Oct 2015 05:51:44 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 64814 invoked by uid 500); 27 Oct 2015 05:51:43 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 64803 invoked by uid 99); 27 Oct 2015 05:51:43 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 27 Oct 2015 05:51:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 161791A2372 for ; Tue, 27 Oct 2015 05:51:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.001 X-Spam-Level: * X-Spam-Status: No, score=1.001 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id bggvGgZa7Lpn for ; Tue, 27 Oct 2015 05:51:33 +0000 (UTC) Received: from www168.your-server.de (www168.your-server.de [213.133.104.168]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 75D3120751 for ; Tue, 27 Oct 2015 05:51:32 +0000 (UTC) Received: from [88.198.220.130] (helo=sslproxy01.your-server.de) by www168.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.80.1) (envelope-from ) id 1ZqxAL-0002T0-W8 for users@pdfbox.apache.org; Tue, 27 Oct 2015 06:51:26 +0100 Received: from [79.242.124.107] (helo=mbp001.local.fileaffairs.de) by sslproxy01.your-server.de with esmtpsa (TLSv1:DHE-RSA-AES256-SHA:256) (Exim 4.84) (envelope-from ) id 1ZqxAI-0007vP-N5 for users@pdfbox.apache.org; Tue, 27 Oct 2015 06:51:22 +0100 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.0 \(3094\)) Subject: Re: Merging pages of two PDF documents From: Maruan Sahyoun In-Reply-To: <184B5EB5-BCE9-47AF-AFF3-5154002E06F1@gmx.de> Date: Tue, 27 Oct 2015 06:51:18 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <9BCB8DD8-8A24-4358-9469-97950B41D052@fileaffairs.de> References: <184B5EB5-BCE9-47AF-AFF3-5154002E06F1@gmx.de> To: users@pdfbox.apache.org X-Mailer: Apple Mail (2.3094) X-Authenticated-Sender: sahyoun@fileaffairs.de X-Virus-Scanned: Clear (ClamAV 0.98.7/21015/Tue Oct 27 05:39:10 2015) Hi, > Am 26.10.2015 um 22:46 schrieb Timm Friedholz : >=20 > Hello, >=20 > I have some PDF documents in which the glyph-Unicode character mapping = is destroyed so that it's not possible to search and copy the text. In = an attempt to remove this restriction I've converted the PDFs to TIFF = images and performed OCR on them using tesseract. Tesseract exports the = recognized text as PDF files in which the text is positioned = transparently on top of the images such that the text is searchable and = selectable. >=20 > The problem is that the images in the PDF that tesseract outputs are = gray-scaled, large and high contrast versions of the original PDFs and I = would like to have the quality and file size of the original PDFs = instead. Thus my idea is to copy the text objects of the OCR output to = the original PDFs. To avoid interference with the old text, I've = converted the original PDFs to vector paths using Ghostscript: gs -o = out.pdf -dNoOutputFonts -sDEVICE=3Dpdfwrite in.pdf >=20 > Now the problem is that I'm not sure how to approach this = programmatically. Can I simply iterate over the pages and copy the text = objects from each page of one document to the corresponding page of the = other document? Which operators do I need to copy if I parse it token = by token? Should I actually do it as directly via the PDFStreamParser = class or are there abstraction in PDFBox that will make this easier? the easiest might be to a) remove the images from the OCR'ed document b) overlay the pages from the OCR'ed document over the original PDF = using org.apache.pdfbox.multipdf.Overlay BR Maruan >=20 > The code for the PDF export by tesseract is here: >=20 > = https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd= 44f0396b08f96f/api/pdfrenderer.cpp#L1 >=20 > Here is the line that specifies the text objects: >=20 > = https://github.com/tesseract-ocr/tesseract/blob/dd8c12997385cf7f5961093bcd= 44f0396b08f96f/api/pdfrenderer.cpp#L317 >=20 > Thanks. >=20 > Timm > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: users-help@pdfbox.apache.org >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org