pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Duff Johnson <duff.john...@pdfa.org>
Subject Re: Unable to mark document as tagged
Date Fri, 13 Jun 2014 17:59:09 GMT
Colette,

It might be a good idea to take a look at 14.8 of ISO 32000-1, which defines tagged PDF.

You can download it for free:

http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Duff.


On Jun 13, 2014, at 1:52 PM, Maruan Sahyoun <sahyoun@fileaffairs.de> wrote:

> Colette,
> 
> you are not corrupting the PDF document but the structure Information needed for tagged
PDF is missing. 
> 
> Maruan Sahyoun
> 
>> Am 13.06.2014 um 19:41 schrieb Colette Joubarne <cjoubarne@privacyanalytics.ca>:
>> 
>> Maruan,
>> 
>> I use the parser to tokenize, and then loop thru the tokens. If a token is a TJ or
Tj operator, I grab the text, in certain cases I replace some of the text (letter by letter,
maintaining the existing structure), and add these tokens to a new token list. If it is not
a TJ or Tj operator I just copy the token to the new token list. I then write the token list
to the doc and save.
>> 
>> If I am corrupting the structure, how is it that the document displays correctly?
>> 
>> Colette
>> 
>> -----Original Message-----
>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>> Sent: June-13-14 12:54 PM
>> To: users@pdfbox.apache.org
>> Subject: Re: Unable to mark document as tagged
>> 
>> Hi Colette,
>> 
>> the modified version does not contain the structure information needed for tagged
PDFs.  How do you create the modified version from the first one?
>> 
>> BR
>> Maruan
>> 
>>> Am 13.06.2014 um 17:48 schrieb Colette Joubarne <cjoubarne@privacyanalytics.ca>:
>>> 
>>> Maruan,
>>> 
>>> I am copying the entire structure from a tagged document and just replacing some
of the text, so I would think that the structure is unchanged. Then again who knows what I
might have messed up.
>>> 
>>> James-pdf is the original file:
>>> https://dl.dropboxusercontent.com/u/7689859/James.pdf
>>> 
>>> James-mod.pdf is the modified file:
>>> https://dl.dropboxusercontent.com/u/7689859/James-mod.pdf
>>> 
>>> Colette
>>> 
>>> -----Original Message-----
>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>>> Sent: June-13-14 10:45 AM
>>> To: users@pdfbox.apache.org
>>> Subject: Re: Unable to mark document as tagged
>>> 
>>> Hi Colette,
>>> 
>>> this information alone doesn't make a document a tagged PDF! You might not have
the structure information needed within your PDF. Would you have a works / doesn't work sample
which you could upload to a public location as attachments are not allowed on the mailing
list?
>>> 
>>> BR
>>> Maruan
>>> 
>>>> Am 13.06.2014 um 15:44 schrieb Colette Joubarne <cjoubarne@privacyanalytics.ca>:
>>>> 
>>>> Maruan,
>>>> 
>>>> Yes you are right, however why is it that when I look at the properties in
Adobe Reader it indicates that the document is not tagged?
>>>> 
>>>> 3 0 obj
>>>> <<
>>>> /Marked true
>>>> 
>>>> Colette
>>>> -----Original Message-----
>>>> From: Maruan Sahyoun [mailto:sahyoun@fileaffairs.de] 
>>>> Sent: June-13-14 9:19 AM
>>>> To: users@pdfbox.apache.org
>>>> Subject: Re: Unable to mark document as tagged
>>>> 
>>>> Dear Colette,
>>>> 
>>>> /MarkInfo 3 0 R indicates that the information you are looking for is referenced
and should be available in 3 0 obj. Could you verify that?
>>>> 
>>>> With kind regards
>>>> 
>>>> Maruan
>>>> 
>>>>> Am 13.06.2014 um 14:21 schrieb Colette Joubarne <cjoubarne@privacyanalytics.ca>:
>>>>> 
>>>>> I have a tagged pdf doc with the following header:
>>>>> 
>>>>>        /Type/Catalog/Pages 2 0 R/Lang(en-CA) /StructTreeRoot 10 0 R/MarkInfo<</Marked
true
>>>>> 
>>>>> I read in the contents, replace some of the text and create a new doc.
I copy the document information from the original doc and set marked to true.
>>>>> 
>>>>>        newDoc = new PDDocument();
>>>>>        newDoc.setDocumentInformation(PTConstants.pdfDoc.getDocumentInformation());
>>>>> 
>>>>>        PDMarkInfo markinfo = new PDMarkInfo();
>>>>>        markinfo.setMarked(true);
>>>>>        newDoc.getDocumentCatalog().setMarkInfo(markinfo);
>>>>> 
>>>>> and when I check that it was set, it returns true:
>>>>> 
>>>>>  PDMarkInfo markInfo = PTConstants.pdfDoc.getDocumentCatalog().getMarkInfo();
>>>>>  if ((markInfo != null) && (markInfo.isMarked())) System.out.println("true");
>>>>> 
>>>>> But, while the resulting document displays correctly, the header indicates
that it is not tagged:
>>>>> 
>>>>> /Type /Catalog
>>>>> /Version /1.4
>>>>> /Pages 2 0 R
>>>>> /MarkInfo 3 0 R
>>>>> 
>>>>> Any idea what is going on?
>>>>> 
>>>>> Colette
>> 


Mime
View raw message