pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: More questions about page iteration
Date Tue, 16 May 2017 13:26:43 GMT
Sadly for you, that one has nothing to do with page labels. It's really 
just a footer on the page. And there is no concept of "footer" in PDF. 
It's just text at the bottom.

Tilman

Am 16.05.2017 um 15:21 schrieb David Patterson:
> They show up when I print the PDF or open it to read it. I want to extract
> the Table of Contents from each of > 100 PDFs so I can make a super-Table
> of Contents and allow users to search for the document they need to read.
> (The file name of the desired contents is not obvious, and so with a
> consolidated Table of Contents, a more novice user can find the content
> they want to read and open the correct document to see the text. These are
> Standard Operating Procedures for a 24x7 production facility and the
> operators might need to review what to do in case of a problem.
>
> I was hoping that in the transition from Word (where the documents are
> authored, the saving as a PDF and combining them into Portfolios some part
> of the process would have identified it as a page label, but I guess that
> did not happen.
>
> I'm able to find the text of that string since it only occurs in the footer
> of the page.
>
> Thanks.
>
> Dave Patterson
>
> On Tue, May 16, 2017 at 8:42 AM, Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Am 16.05.2017 um 14:35 schrieb David Patterson:
>>
>>> Tilman,
>>>
>>> The code I tried is:
>>>
>>> byte[] bytes = // content of file as a byte array
>>> PDDocument pdDocument = PDDocument.load( bytes );
>>> PDDocumentCatalog cat2 = pdDocument.getDocumentCatalog();
>>> PDPageLabels pageLabels = cat2.getPageLabels();
>>> if ( pageLabels == null ) {
>>> System.out.println( "Page labels missing " );
>>> }
>>>
>>>
>>> I'm getting "Page labels missing" on each document.
>>>
>> Then lets go back to the beginning. You mentioned "I've got page numbers
>> like "TOC-1", "TOC-2", "Page 1"". Where did these show up?
>>
>> Tilman
>>
>>
>>
>>
>>> I have no idea of, or control over the process used to convert a Word file
>>> into a PDF. I just inherited a bunch of PDFs that I'm trying to interpret.
>>>
>>> Dave Patterson
>>>
>>> On Mon, May 15, 2017 at 1:57 PM, Tilman Hausherr <THausherr@t-online.de>
>>> wrote:
>>>
>>> Am 15.05.2017 um 19:11 schrieb David Patterson:
>>>> Alas, after testing with my documents, the PageLabels is null. :-(
>>>>> But you said it has "TOC-1". This sounds like pagelabels. You can also
>>>> try
>>>> with PDFDebugger, it will show the labels if there are some.
>>>>
>>>> Tilman
>>>>
>>>>
>>>>
>>>> Thank you for the help and encouragement.
>>>>> Dave Patterson
>>>>>
>>>>> On Mon, May 15, 2017 at 12:34 PM, Tilman Hausherr <
>>>>> THausherr@t-online.de>
>>>>> wrote:
>>>>>
>>>>> Am 15.05.2017 um 18:30 schrieb David Patterson:
>>>>>
>>>>>> Tilman,
>>>>>>
>>>>>>> Thank you very much. (I feel bad asking some of the questions,
but the
>>>>>>> data
>>>>>>> is stored in "out of the way" corners that are hard to find.
>>>>>>>
>>>>>>> Don't :-)
>>>>>>>
>>>>>> Is there any documentation that explains how the linkages work? Would
>>>>>> it
>>>>>>
>>>>>>> help to have the PDF Standard Document?
>>>>>>>
>>>>>>>
>>>>>>> Yes. I read there all the time. The PDFBox API closely follows
the PDF
>>>>>> specification. So here it's linked from the document catalog, so
the
>>>>>> methods used are in the PDDocumentCatalog class. But asking was a
good
>>>>>> decision as this got you that convenience method (that is in
>>>>>> PDFDebugger).
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>> Dave Patterson
>>>>>>>
>>>>>>> On Mon, May 15, 2017 at 12:13 PM, Tilman Hausherr <
>>>>>>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Am 15.05.2017 um 15:20 schrieb David Patterson:
>>>>>>>
>>>>>>> I've now got my code working to iterate through a PDDocument
and
>>>>>>>> process
>>>>>>>>
>>>>>>>> it
>>>>>>>>> page by page.
>>>>>>>>>
>>>>>>>>> Next hurdle: Is there a way to get the page number as
printed? I've
>>>>>>>>> got
>>>>>>>>> page numbers like "TOC-1", "TOC-2", "Page 1", ...
>>>>>>>>>
>>>>>>>>> How much work is it to get the "TOC-1"?
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> Dave Patterson
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         /**
>>>>>>>>>
>>>>>>>>>          * Convenience method to get the page label if
available.
>>>>>>>>          *
>>>>>>>>          * @param document
>>>>>>>>          * @param pageIndex 0-based page number.
>>>>>>>>          * @return a page label or null if not available.
>>>>>>>>          */
>>>>>>>>         public static String getPageLabel(PDDocument document,
int
>>>>>>>> pageIndex)
>>>>>>>>         {
>>>>>>>>             PDPageLabels pageLabels;
>>>>>>>>             try
>>>>>>>>             {
>>>>>>>>                 pageLabels = document.getDocumentCatalog().
>>>>>>>> getPageLabels();
>>>>>>>>             }
>>>>>>>>             catch (IOException ex)
>>>>>>>>             {
>>>>>>>>                 return ex.getMessage();
>>>>>>>>             }
>>>>>>>>             if (pageLabels != null)
>>>>>>>>             {
>>>>>>>>                 String[] labels = pageLabels.getLabelsByPageIndi
>>>>>>>> ces();
>>>>>>>>                 if (labels[pageIndex] != null)
>>>>>>>>                 {
>>>>>>>>                     return labels[pageIndex];
>>>>>>>>                 }
>>>>>>>>             }
>>>>>>>>             return null;
>>>>>>>>         }
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message