Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E8AA0200C0F for ; Thu, 2 Feb 2017 17:02:20 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id E701C160B57; Thu, 2 Feb 2017 16:02:20 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3CBEB160B54 for ; Thu, 2 Feb 2017 17:02:20 +0100 (CET) Received: (qmail 74028 invoked by uid 500); 2 Feb 2017 16:02:14 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 74009 invoked by uid 99); 2 Feb 2017 16:02:13 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Feb 2017 16:02:13 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 6D1391A04BB for ; Thu, 2 Feb 2017 16:02:13 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -2 X-Spam-Level: X-Spam-Status: No, score=-2 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id p1se-HQEESnW for ; Thu, 2 Feb 2017 16:02:11 +0000 (UTC) Received: from mailout06.t-online.de (mailout06.t-online.de [194.25.134.19]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 0EDC05F5F8 for ; Thu, 2 Feb 2017 16:02:11 +0000 (UTC) Received: from fwd03.aul.t-online.de (fwd03.aul.t-online.de [172.20.27.148]) by mailout06.t-online.de (Postfix) with SMTP id 78A8741C5949 for ; Thu, 2 Feb 2017 17:02:10 +0100 (CET) Received: from [192.168.2.105] (Z2C1FMZO8hzpI2g-m4t-xngQt7R9AujCj6EXBZNarPLWHSID3tbqIFfs-6Z0tZtQ3b@[217.231.128.226]) by fwd03.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-SHA encrypted) esmtp id 1cZJpp-4Z3eJk0; Thu, 2 Feb 2017 17:02:09 +0100 Subject: Re: Fwd: Trouble reading IEEE pdf To: users@pdfbox.apache.org References: <8028d9db-356e-ca73-d81f-0690db3294d8@t-online.de> From: Tilman Hausherr Message-ID: Date: Thu, 2 Feb 2017 17:03:07 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-ID: Z2C1FMZO8hzpI2g-m4t-xngQt7R9AujCj6EXBZNarPLWHSID3tbqIFfs-6Z0tZtQ3b X-TOI-MSGID: 18ea54f4-3115-490c-8f0f-86ea16ea5f48 archived-at: Thu, 02 Feb 2017 16:02:21 -0000 Am 02.02.2017 um 16:10 schrieb Pulkit Kapur: > Hi > > I have uploaded the pdf here: > https://www.scribd.com/document/338221804/0024-iros-2016 Hello Pulkit, This site requires registration. This is a "don't" from the list: https://pdfbox.apache.org/support.html I don't want to register. Please find a sharehoster that doesn't require registration to download. If the XObject that Karl Heinz Kremer mentioned is a form then text extraction should work, especially if it was possible to extract with Adobe Reader. If it is an image then it won't. Apache Tika might help. Please mention what you did to get the text with PDFBox, and what version you were using. You wrote "using readText function from the pdfbox library". There is no "readText" method in PDFBox. Could it be that you used a different product? Tilman > > I did some more diagnosis last night and it seems that there are two layers > on the pdf. One which is the content and the other with headers and > footers. Pdf box is only reading the headers and footers. > I suspect this must be common with all conference proceedings. > > Thanks, > > Pulkit > > On Thu, Feb 2, 2017 at 1:21 AM, Tilman Hausherr > wrote: > >> Am 02.02.2017 um 05:55 schrieb Pulkit Kapur: >> >>> Hi >>> >>> I am trying to read some past years IEEE conference proceedings i have. >>> I can read the pdf using acrobat and select the text. >>> >>> But when i try to read the text using readText function from the pdfbox >>> library, i only get the headers and footers in the pdf. >>> >>> I did check the document is not encrypted. >>> Also my code works on other pdf documents but all IEEE proceedings that >>> are downloaded form IEEE fail to work. >>> >>> I have attached the pdf document with this message. >>> >> Please upload the pdf somewhere, PDF attachments are not allowed here. >> >> >> >> Tilman >> --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org