Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D0AE0200C7D for ; Tue, 16 May 2017 15:26:39 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id CF467160BAC; Tue, 16 May 2017 13:26:39 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EE079160B9D for ; Tue, 16 May 2017 15:26:38 +0200 (CEST) Received: (qmail 43089 invoked by uid 500); 16 May 2017 13:26:38 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 43077 invoked by uid 99); 16 May 2017 13:26:37 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 May 2017 13:26:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 65F3B181944 for ; Tue, 16 May 2017 13:26:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.997 X-Spam-Level: X-Spam-Status: No, score=-0.997 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.796, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id QTeVyEl2Vqzj for ; Tue, 16 May 2017 13:26:35 +0000 (UTC) Received: from mailout06.t-online.de (mailout06.t-online.de [194.25.134.19]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 299D35FC7F for ; Tue, 16 May 2017 13:26:30 +0000 (UTC) Received: from fwd30.aul.t-online.de (fwd30.aul.t-online.de [172.20.26.135]) by mailout06.t-online.de (Postfix) with SMTP id E394441D0845 for ; Tue, 16 May 2017 15:26:23 +0200 (CEST) Received: from [192.168.2.105] (GWxv28ZYoh8jACTSkdHc69LobIEjGskZalABg6Zd0H8KpKDu-EtuvKOTGk3az0CZvD@[217.231.133.100]) by fwd30.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1dAcUN-1nGCtE0; Tue, 16 May 2017 15:26:11 +0200 Subject: Re: More questions about page iteration To: users@pdfbox.apache.org References: <86b065db-76a9-aadc-d43d-a51a772a3eb3@t-online.de> <4c20759e-a5cb-c817-565c-4a25b4cedead@t-online.de> <9d45c7d5-1ae3-7083-92bf-debc7e9be030@t-online.de> <252850ef-8116-bb76-687d-3eba6b0f25be@t-online.de> From: Tilman Hausherr Message-ID: <985c8e28-24be-508f-7001-176cad84f488@t-online.de> Date: Tue, 16 May 2017 15:26:43 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-ID: GWxv28ZYoh8jACTSkdHc69LobIEjGskZalABg6Zd0H8KpKDu-EtuvKOTGk3az0CZvD X-TOI-MSGID: f0cda3bd-976e-4c5d-a2b3-d31f85c8722f archived-at: Tue, 16 May 2017 13:26:40 -0000 Sadly for you, that one has nothing to do with page labels. It's really just a footer on the page. And there is no concept of "footer" in PDF. It's just text at the bottom. Tilman Am 16.05.2017 um 15:21 schrieb David Patterson: > They show up when I print the PDF or open it to read it. I want to extract > the Table of Contents from each of > 100 PDFs so I can make a super-Table > of Contents and allow users to search for the document they need to read. > (The file name of the desired contents is not obvious, and so with a > consolidated Table of Contents, a more novice user can find the content > they want to read and open the correct document to see the text. These are > Standard Operating Procedures for a 24x7 production facility and the > operators might need to review what to do in case of a problem. > > I was hoping that in the transition from Word (where the documents are > authored, the saving as a PDF and combining them into Portfolios some part > of the process would have identified it as a page label, but I guess that > did not happen. > > I'm able to find the text of that string since it only occurs in the footer > of the page. > > Thanks. > > Dave Patterson > > On Tue, May 16, 2017 at 8:42 AM, Tilman Hausherr > wrote: > >> Am 16.05.2017 um 14:35 schrieb David Patterson: >> >>> Tilman, >>> >>> The code I tried is: >>> >>> byte[] bytes = // content of file as a byte array >>> PDDocument pdDocument = PDDocument.load( bytes ); >>> PDDocumentCatalog cat2 = pdDocument.getDocumentCatalog(); >>> PDPageLabels pageLabels = cat2.getPageLabels(); >>> if ( pageLabels == null ) { >>> System.out.println( "Page labels missing " ); >>> } >>> >>> >>> I'm getting "Page labels missing" on each document. >>> >> Then lets go back to the beginning. You mentioned "I've got page numbers >> like "TOC-1", "TOC-2", "Page 1"". Where did these show up? >> >> Tilman >> >> >> >> >>> I have no idea of, or control over the process used to convert a Word file >>> into a PDF. I just inherited a bunch of PDFs that I'm trying to interpret. >>> >>> Dave Patterson >>> >>> On Mon, May 15, 2017 at 1:57 PM, Tilman Hausherr >>> wrote: >>> >>> Am 15.05.2017 um 19:11 schrieb David Patterson: >>>> Alas, after testing with my documents, the PageLabels is null. :-( >>>>> But you said it has "TOC-1". This sounds like pagelabels. You can also >>>> try >>>> with PDFDebugger, it will show the labels if there are some. >>>> >>>> Tilman >>>> >>>> >>>> >>>> Thank you for the help and encouragement. >>>>> Dave Patterson >>>>> >>>>> On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr < >>>>> THausherr@t-online.de> >>>>> wrote: >>>>> >>>>> Am 15.05.2017 um 18:30 schrieb David Patterson: >>>>> >>>>>> Tilman, >>>>>> >>>>>>> Thank you very much. (I feel bad asking some of the questions, but the >>>>>>> data >>>>>>> is stored in "out of the way" corners that are hard to find. >>>>>>> >>>>>>> Don't :-) >>>>>>> >>>>>> Is there any documentation that explains how the linkages work? Would >>>>>> it >>>>>> >>>>>>> help to have the PDF Standard Document? >>>>>>> >>>>>>> >>>>>>> Yes. I read there all the time. The PDFBox API closely follows the PDF >>>>>> specification. So here it's linked from the document catalog, so the >>>>>> methods used are in the PDDocumentCatalog class. But asking was a good >>>>>> decision as this got you that convenience method (that is in >>>>>> PDFDebugger). >>>>>> >>>>>> Tilman >>>>>> >>>>>> >>>>>> >>>>>> Thanks. >>>>>> >>>>>>> Dave Patterson >>>>>>> >>>>>>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr < >>>>>>> THausherr@t-online.de> >>>>>>> wrote: >>>>>>> >>>>>>> Am 15.05.2017 um 15:20 schrieb David Patterson: >>>>>>> >>>>>>> I've now got my code working to iterate through a PDDocument and >>>>>>>> process >>>>>>>> >>>>>>>> it >>>>>>>>> page by page. >>>>>>>>> >>>>>>>>> Next hurdle: Is there a way to get the page number as printed? I've >>>>>>>>> got >>>>>>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ... >>>>>>>>> >>>>>>>>> How much work is it to get the "TOC-1"? >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> >>>>>>>>> Dave Patterson >>>>>>>>> >>>>>>>>> >>>>>>>>> /** >>>>>>>>> >>>>>>>>> * Convenience method to get the page label if available. >>>>>>>> * >>>>>>>> * @param document >>>>>>>> * @param pageIndex 0-based page number. >>>>>>>> * @return a page label or null if not available. >>>>>>>> */ >>>>>>>> public static String getPageLabel(PDDocument document, int >>>>>>>> pageIndex) >>>>>>>> { >>>>>>>> PDPageLabels pageLabels; >>>>>>>> try >>>>>>>> { >>>>>>>> pageLabels = document.getDocumentCatalog(). >>>>>>>> getPageLabels(); >>>>>>>> } >>>>>>>> catch (IOException ex) >>>>>>>> { >>>>>>>> return ex.getMessage(); >>>>>>>> } >>>>>>>> if (pageLabels != null) >>>>>>>> { >>>>>>>> String[] labels = pageLabels.getLabelsByPageIndi >>>>>>>> ces(); >>>>>>>> if (labels[pageIndex] != null) >>>>>>>> { >>>>>>>> return labels[pageIndex]; >>>>>>>> } >>>>>>>> } >>>>>>>> return null; >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------ >>>>>>>> --------- >>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org >>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------ >>>>>>>> --------- >>>>>>>> >>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org >>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org >>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org >>>> For additional commands, e-mail: users-help@pdfbox.apache.org >>>> >>>> >>>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org >> For additional commands, e-mail: users-help@pdfbox.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org