Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@pdfbox.apache.org
Subject: Re: More questions about page iteration
To: users@pdfbox.apache.org
References: <CAH5ZNOrkB+Ju1Xzn5uvpkDysLGaftUnLmVV4pzU_9P-GTpZ0fg@mail.gmail.com>
 <86b065db-76a9-aadc-d43d-a51a772a3eb3@t-online.de>
 <CAH5ZNOq5NqdKRts+rd-kuLcWuJ_r6KOcs2YF50R+5SGiaNBUxA@mail.gmail.com>
 <4c20759e-a5cb-c817-565c-4a25b4cedead@t-online.de>
 <CAH5ZNOo+cNm7C=UX_F1W0vtEi7mmwVuA2rmxb1GWrFnqOBUTNA@mail.gmail.com>
 <9d45c7d5-1ae3-7083-92bf-debc7e9be030@t-online.de>
 <CAH5ZNOo5_1H5B5fu9xd0xRkW6R+K6b=dv=Bhv3nXYjw=cKgz=w@mail.gmail.com>
 <252850ef-8116-bb76-687d-3eba6b0f25be@t-online.de>
 <CAH5ZNOr3M6DqCzzrKzzu8u_VZDiyT7vq0+DFwpQdty2TvYGVDw@mail.gmail.com>
From: Tilman Hausherr <THausherr@t-online.de>
Message-ID: <985c8e28-24be-508f-7001-176cad84f488@t-online.de>
Date: Tue, 16 May 2017 15:26:43 +0200
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.1.1
MIME-Version: 1.0
In-Reply-To: <CAH5ZNOr3M6DqCzzrKzzu8u_VZDiyT7vq0+DFwpQdty2TvYGVDw@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Content-Language: en-US
archived-at: Tue, 16 May 2017 13:26:40 -0000

Sadly for you, that one has nothing to do with page labels. It's really 
just a footer on the page. And there is no concept of "footer" in PDF. 
It's just text at the bottom.

Tilman

Am 16.05.2017 um 15:21 schrieb David Patterson:
> They show up when I print the PDF or open it to read it. I want to extract
> the Table of Contents from each of > 100 PDFs so I can make a super-Table
> of Contents and allow users to search for the document they need to read.
> (The file name of the desired contents is not obvious, and so with a
> consolidated Table of Contents, a more novice user can find the content
> they want to read and open the correct document to see the text. These are
> Standard Operating Procedures for a 24x7 production facility and the
> operators might need to review what to do in case of a problem.
>
> I was hoping that in the transition from Word (where the documents are
> authored, the saving as a PDF and combining them into Portfolios some part
> of the process would have identified it as a page label, but I guess that
> did not happen.
>
> I'm able to find the text of that string since it only occurs in the footer
> of the page.
>
> Thanks.
>
> Dave Patterson
>
> On Tue, May 16, 2017 at 8:42 AM, Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Am 16.05.2017 um 14:35 schrieb David Patterson:
>>
>>> Tilman,
>>>
>>> The code I tried is:
>>>
>>> byte[] bytes = // content of file as a byte array
>>> PDDocument pdDocument = PDDocument.load( bytes );
>>> PDDocumentCatalog cat2 = pdDocument.getDocumentCatalog();
>>> PDPageLabels pageLabels = cat2.getPageLabels();
>>> if ( pageLabels == null ) {
>>> System.out.println( "Page labels missing " );
>>> }
>>>
>>>
>>> I'm getting "Page labels missing" on each document.
>>>
>> Then lets go back to the beginning. You mentioned "I've got page numbers
>> like "TOC-1", "TOC-2", "Page 1"". Where did these show up?
>>
>> Tilman
>>
>>
>>
>>
>>> I have no idea of, or control over the process used to convert a Word file
>>> into a PDF. I just inherited a bunch of PDFs that I'm trying to interpret.
>>>
>>> Dave Patterson
>>>
>>> On Mon, May 15, 2017 at 1:57 PM, Tilman Hausherr <THausherr@t-online.de>
>>> wrote:
>>>
>>> Am 15.05.2017 um 19:11 schrieb David Patterson:
>>>> Alas, after testing with my documents, the PageLabels is null. :-(
>>>>> But you said it has "TOC-1". This sounds like pagelabels. You can also
>>>> try
>>>> with PDFDebugger, it will show the labels if there are some.
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>> Thank you for the help and encouragement.
>>>>> Dave Patterson
>>>>>
>>>>> On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr <
>>>>> THausherr@t-online.de>
>>>>> wrote:
>>>>>
>>>>> Am 15.05.2017 um 18:30 schrieb David Patterson:
>>>>>
>>>>>> Tilman,
>>>>>>
>>>>>>> Thank you very much. (I feel bad asking some of the questions, but the
>>>>>>> data
>>>>>>> is stored in "out of the way" corners that are hard to find.
>>>>>>>
>>>>>>> Don't :-)
>>>>>>>
>>>>>> Is there any documentation that explains how the linkages work? Would
>>>>>> it
>>>>>>
>>>>>>> help to have the PDF Standard Document?
>>>>>>>
>>>>>>>
>>>>>>> Yes. I read there all the time. The PDFBox API closely follows the PDF
>>>>>> specification. So here it's linked from the document catalog, so the
>>>>>> methods used are in the PDDocumentCatalog class. But asking was a good
>>>>>> decision as this got you that convenience method (that is in
>>>>>> PDFDebugger).
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>> Dave Patterson
>>>>>>>
>>>>>>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <
>>>>>>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>>>>>>
>>>>>>> I've now got my code working to iterate through a PDDocument and
>>>>>>>> process
>>>>>>>>
>>>>>>>> it
>>>>>>>>> page by page.
>>>>>>>>>
>>>>>>>>> Next hurdle: Is there a way to get the page number as printed? I've
>>>>>>>>> got
>>>>>>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>>>>>>>
>>>>>>>>> How much work is it to get the "TOC-1"?
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> Dave Patterson
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         /**
>>>>>>>>>
>>>>>>>>>          * Convenience method to get the page label if available.
>>>>>>>>          *
>>>>>>>>          * @param document
>>>>>>>>          * @param pageIndex 0-based page number.
>>>>>>>>          * @return a page label or null if not available.
>>>>>>>>          */
>>>>>>>>         public static String getPageLabel(PDDocument document, int
>>>>>>>> pageIndex)
>>>>>>>>         {
>>>>>>>>             PDPageLabels pageLabels;
>>>>>>>>             try
>>>>>>>>             {
>>>>>>>>                 pageLabels = document.getDocumentCatalog().
>>>>>>>> getPageLabels();
>>>>>>>>             }
>>>>>>>>             catch (IOException ex)
>>>>>>>>             {
>>>>>>>>                 return ex.getMessage();
>>>>>>>>             }
>>>>>>>>             if (pageLabels != null)
>>>>>>>>             {
>>>>>>>>                 String[] labels = pageLabels.getLabelsByPageIndi
>>>>>>>> ces();
>>>>>>>>                 if (labels[pageIndex] != null)
>>>>>>>>                 {
>>>>>>>>                     return labels[pageIndex];
>>>>>>>>                 }
>>>>>>>>             }
>>>>>>>>             return null;
>>>>>>>>         }
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org