Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 79E15200C0F for ; Thu, 2 Feb 2017 16:36:46 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 78667160B57; Thu, 2 Feb 2017 15:36:46 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9F530160B54 for ; Thu, 2 Feb 2017 16:36:45 +0100 (CET) Received: (qmail 4867 invoked by uid 500); 2 Feb 2017 15:36:44 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 4854 invoked by uid 99); 2 Feb 2017 15:36:44 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Feb 2017 15:36:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C73ECC08FF for ; Thu, 2 Feb 2017 15:36:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.48 X-Spam-Level: ** X-Spam-Status: No, score=2.48 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=khk-net.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id F1BemNKU2Htm for ; Thu, 2 Feb 2017 15:36:42 +0000 (UTC) Received: from mail-yw0-f173.google.com (mail-yw0-f173.google.com [209.85.161.173]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 6B2A75F3A1 for ; Thu, 2 Feb 2017 15:36:42 +0000 (UTC) Received: by mail-yw0-f173.google.com with SMTP id l19so13904618ywc.2 for ; Thu, 02 Feb 2017 07:36:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=khk-net.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=PPwIXtOpW5cR+pphENHEeXhi2eFaC7LmjFIsrKVmvkk=; b=ciOnggmMFfRRE/zSsFhD50IslsqN3CnG9C9jxrlCEV3NI/ZBi/Ln7qd2JRq6wDt4On KDT95IryHIL7pCUdlzkaj5feEMMXUji7aSoN7YSfyucfEC44EUhiflT9pPlkkncsrWDB USsawpHkiiro36wZNtoNJzyYVtzH45xxLgEMPNRCQVo/YFHwN1leYxIcRU68yzG0dOZT MZHjM8IzNFgmRyowpadLG6JpdJhtdy5K5myM44HagtR48FIQEZ2fgoxZqh7pIekvcnKg g9N4LWgFo4ZGY2BR4qYAE41nmYh8w/0BZG19cNwb2HU976vY0wxak8SEmrPP8Q1764e8 GSBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=PPwIXtOpW5cR+pphENHEeXhi2eFaC7LmjFIsrKVmvkk=; b=DZKeGKSu+ebZzXYk8/eoTa+VT2+nK0oyv6khx4yLd8kBgfH3yGvW4LEPqQRf5bZhRT VS+hLLi+Im65dearHzceu+0cVq9OTw+H6aBLCtpaoibknBo9PobLnM6IOC3wqXqxx4AH t5zl62IeLbuqk0tns6Zwm9B8Sr8Z5J2AXwABbqywnF6e3HVkJi8L/nxtiO2xvtaSkdCe YpkSzm4xcs48cLFWttkkOkUzH70F4ixDcRrc0j8cymsKZ85xMOztcAXfjTMXVsWNSscd JV78s1bzMQbEddOtxYt24PnW7hZuXq4l+w7Ij36B9FQXHH8nx8Bp9wGDITbIgl8IiZS/ XL9g== X-Gm-Message-State: AIkVDXI3Vc3vYOc9hi//1sdvtUUJWUxw2kkrJGmNZ5CgjRIc+bUO0wHNnAgjCnJN1PuaOj6SiN1gjW5pCVaK9Q== X-Received: by 10.55.191.6 with SMTP id p6mr8745641qkf.79.1486049796077; Thu, 02 Feb 2017 07:36:36 -0800 (PST) MIME-Version: 1.0 Received: by 10.140.32.163 with HTTP; Thu, 2 Feb 2017 07:36:15 -0800 (PST) In-Reply-To: References: <8028d9db-356e-ca73-d81f-0690db3294d8@t-online.de> From: Karl Heinz Kremer Date: Thu, 2 Feb 2017 10:36:15 -0500 Message-ID: Subject: Re: Fwd: Trouble reading IEEE pdf To: users@pdfbox.apache.org Content-Type: multipart/alternative; boundary=94eb2c04352c36d3a505478deede archived-at: Thu, 02 Feb 2017 15:36:46 -0000 --94eb2c04352c36d3a505478deede Content-Type: text/plain; charset=UTF-8 Pulpit, I did not say that in your document the XObjects are images, I said that they usually are just images. When you analyze 100 random PDF documents, changes are that that most of them only use the XObject construct for images and vector graphic, not for elements that contain text. Your documents are an exception. Karl Heinz Kremer PDF Acrobatics Without a Net PDF Software Development, Training and More... khk@khk.net http://www.khkonsulting.com On Thu, Feb 2, 2017 at 10:33 AM, Pulkit Kapur wrote: > Thanks Karl for the reply. > Thats helpful. > > What confuses me is this" very likely because usually such an XObject would > just be an > image" > -> I am able to select the underlying text in the XObject using acrobat and > copy/paste it. > Thats why i am confused why pdfbox cannot access the XObject. > > Perhaps it is more nuanced than how i am phrasing it. > > Thanks, > > Pulkit > > On Thu, Feb 2, 2017 at 10:27 AM, Karl Heinz Kremer wrote: > > > The document does not contain layers (or optional content groups as they > > are called in PDF), the problem seems to be that the actual text of > > the document is in an XObject - something that is completely legal in a > PDF > > file. I suspect that the text was created in one application, and then a > > second application was used to create a new page, then placed the header > on > > it as "normal" text, and in a second step placed the original content > into > > this XObject and then placed it on the page. This is oftentimes what e.g. > > an imposition application would do. Without having checked in the > sources, > > I would assume that when you extract text, PDFBox will just process the > > Contents structure on the page, but will not recurse into XObjects that > are > > encountered - very likely because usually such an XObject would just be > an > > image. > > > > > > Karl Heinz Kremer > > PDF Acrobatics Without a Net > > PDF Software Development, Training and More... > > > > khk@khk.net > > http://www.khkonsulting.com > > > > > > On Thu, Feb 2, 2017 at 10:10 AM, Pulkit Kapur > > wrote: > > > > > Hi > > > > > > I have uploaded the pdf here: > > > https://www.scribd.com/document/338221804/0024-iros-2016 > > > > > > I did some more diagnosis last night and it seems that there are two > > layers > > > on the pdf. One which is the content and the other with headers and > > > footers. Pdf box is only reading the headers and footers. > > > I suspect this must be common with all conference proceedings. > > > > > > Thanks, > > > > > > Pulkit > > > > > > On Thu, Feb 2, 2017 at 1:21 AM, Tilman Hausherr > > > > wrote: > > > > > > > Am 02.02.2017 um 05:55 schrieb Pulkit Kapur: > > > > > > > >> Hi > > > >> > > > >> I am trying to read some past years IEEE conference proceedings i > > have. > > > >> I can read the pdf using acrobat and select the text. > > > >> > > > >> But when i try to read the text using readText function from the > > pdfbox > > > >> library, i only get the headers and footers in the pdf. > > > >> > > > >> I did check the document is not encrypted. > > > >> Also my code works on other pdf documents but all IEEE proceedings > > that > > > >> are downloaded form IEEE fail to work. > > > >> > > > >> I have attached the pdf document with this message. > > > >> > > > > > > > > Please upload the pdf somewhere, PDF attachments are not allowed > here. > > > > > > > > > > > > > > > > Tilman > > > > > > > > > > --94eb2c04352c36d3a505478deede--