Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6A10D200C0F for ; Thu, 2 Feb 2017 16:33:55 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 68B1B160B57; Thu, 2 Feb 2017 15:33:55 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8B8AD160B54 for ; Thu, 2 Feb 2017 16:33:54 +0100 (CET) Received: (qmail 96353 invoked by uid 500); 2 Feb 2017 15:33:53 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 96341 invoked by uid 99); 2 Feb 2017 15:33:53 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Feb 2017 15:33:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E8DDA1A0404 for ; Thu, 2 Feb 2017 15:33:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.48 X-Spam-Level: ** X-Spam-Status: No, score=2.48 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id iqXzKqAqd4w2 for ; Thu, 2 Feb 2017 15:33:51 +0000 (UTC) Received: from mail-vk0-f54.google.com (mail-vk0-f54.google.com [209.85.213.54]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 2962A5F30B for ; Thu, 2 Feb 2017 15:33:51 +0000 (UTC) Received: by mail-vk0-f54.google.com with SMTP id k127so13121819vke.0 for ; Thu, 02 Feb 2017 07:33:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to; bh=C6Gp3Md0cMbw5bmKLOuHMh15fXAoP7Wqz3qwRYgqB14=; b=gAB+RkzUsmNQPC7jlwTo6xHaamfTgH2ivTb0Wuc6uEM+7UMyYDwLAe2KE8BTckiXvq 7n8GhQARYP0nlqoYb1f2KblaUxHXoSFMnpId4tjsBcFq5oshT48xHi4WzEg7aFre4p4m chB4BSfz7jWHW/ZqYKjSj2xuNc+w1/2EqWwuzrBEWO0Y8lQv3cjP++vFd/gEULi6ovE2 qya1DU1IsdbxJOcdgf1QE/wdPVPuf1faCgg2ltnrt+dawzCg6tNxHr1EQ7WMSI/oSC0x TYYx7WkWEaw2kdolP0QYFODfrEfVGgP4G/A4GZWGhfjAeFPjlY8wcwJRrbwHNuQCxD89 lTXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to; bh=C6Gp3Md0cMbw5bmKLOuHMh15fXAoP7Wqz3qwRYgqB14=; b=oow2uAfNCKRheWgg2fYAw8lpAglMByJke6Spim3kCHeicrvzXMcylUsBFNWD6VPyWz wJe1/N+1tqJx4paGAu05rreHxvLDj6EOC4+sGkwi5cKqYwn6YPABWQXKR++1Oia3L6l7 QaPyS4bC2r0a5Usk/SXRkfSUG4A6E32flGhxOBfyrE9b5hSOY6qZ3collzJ2IA4uUQ69 gDuiz643lQwLjk2lU5Kgq/uT31rplUwiFFDXg3br0TX/TIJybyfratvmJkvIvWyTo3rM q04Qizj3OAemQvws1YUqbpn4PDmUtmxqdKSF302h/9oe/R5J3v4gQBGLGmDYCCLkjaRR LGdQ== X-Gm-Message-State: AIkVDXL54CHVBykv5aw1Rcj1brDWtfU7Qb0jxTWGcOJsis0m/8g7Gq0MxFzjN94JYr0OLU7kM1LwStwwyBh/3Q== X-Received: by 10.31.221.4 with SMTP id u4mr3492476vkg.67.1486049624266; Thu, 02 Feb 2017 07:33:44 -0800 (PST) MIME-Version: 1.0 Sender: pulkit.pro@gmail.com Received: by 10.103.78.135 with HTTP; Thu, 2 Feb 2017 07:33:43 -0800 (PST) In-Reply-To: References: <8028d9db-356e-ca73-d81f-0690db3294d8@t-online.de> From: Pulkit Kapur Date: Thu, 2 Feb 2017 10:33:43 -0500 X-Google-Sender-Auth: GBF2LE2vWehgMKpxj_cJzTkng50 Message-ID: Subject: Re: Fwd: Trouble reading IEEE pdf To: users@pdfbox.apache.org Content-Type: multipart/alternative; boundary=001a1148d1c4f91ef205478de334 archived-at: Thu, 02 Feb 2017 15:33:55 -0000 --001a1148d1c4f91ef205478de334 Content-Type: text/plain; charset=UTF-8 Thanks Karl for the reply. Thats helpful. What confuses me is this" very likely because usually such an XObject would just be an image" -> I am able to select the underlying text in the XObject using acrobat and copy/paste it. Thats why i am confused why pdfbox cannot access the XObject. Perhaps it is more nuanced than how i am phrasing it. Thanks, Pulkit On Thu, Feb 2, 2017 at 10:27 AM, Karl Heinz Kremer wrote: > The document does not contain layers (or optional content groups as they > are called in PDF), the problem seems to be that the actual text of > the document is in an XObject - something that is completely legal in a PDF > file. I suspect that the text was created in one application, and then a > second application was used to create a new page, then placed the header on > it as "normal" text, and in a second step placed the original content into > this XObject and then placed it on the page. This is oftentimes what e.g. > an imposition application would do. Without having checked in the sources, > I would assume that when you extract text, PDFBox will just process the > Contents structure on the page, but will not recurse into XObjects that are > encountered - very likely because usually such an XObject would just be an > image. > > > Karl Heinz Kremer > PDF Acrobatics Without a Net > PDF Software Development, Training and More... > > khk@khk.net > http://www.khkonsulting.com > > > On Thu, Feb 2, 2017 at 10:10 AM, Pulkit Kapur > wrote: > > > Hi > > > > I have uploaded the pdf here: > > https://www.scribd.com/document/338221804/0024-iros-2016 > > > > I did some more diagnosis last night and it seems that there are two > layers > > on the pdf. One which is the content and the other with headers and > > footers. Pdf box is only reading the headers and footers. > > I suspect this must be common with all conference proceedings. > > > > Thanks, > > > > Pulkit > > > > On Thu, Feb 2, 2017 at 1:21 AM, Tilman Hausherr > > wrote: > > > > > Am 02.02.2017 um 05:55 schrieb Pulkit Kapur: > > > > > >> Hi > > >> > > >> I am trying to read some past years IEEE conference proceedings i > have. > > >> I can read the pdf using acrobat and select the text. > > >> > > >> But when i try to read the text using readText function from the > pdfbox > > >> library, i only get the headers and footers in the pdf. > > >> > > >> I did check the document is not encrypted. > > >> Also my code works on other pdf documents but all IEEE proceedings > that > > >> are downloaded form IEEE fail to work. > > >> > > >> I have attached the pdf document with this message. > > >> > > > > > > Please upload the pdf somewhere, PDF attachments are not allowed here. > > > > > > > > > > > > Tilman > > > > > > --001a1148d1c4f91ef205478de334--