pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Regarding retrieving COSName.getPDFName(PreflightConstants.DICTIONARY_KEY_LINEARIZED
Date Tue, 25 Jul 2017 14:40:17 GMT
But I remember that you were complaining about the performance and I 
explained a way to do this but which would require some work.

Alternatively, try the code below which gets the dictionary without 
using the preflight library by extending the parser and using old-style 
document loading. What the code does is to load "orphan" objects, which 
isn't done in the main parser.

public class LinearizedCheck
{
     public static void main(String[] args) throws IOException
     {
         RandomAccessBufferedFileInputStream raf = new 
RandomAccessBufferedFileInputStream(new File("test_medium.pdf"));
         PDFParser parser = new MyPDFParser(raf);
         parser.parse();
         PDDocument document = parser.getPDDocument();
         System.out.println(getLinearizedDictionary(document));
         document.close();
     }

     // from preflight
     static COSDictionary getLinearizedDictionary(PDDocument document)
     {
         // ---- Get Ref to obj
         COSDocument cDoc = document.getDocument();
         List<?> lObj = cDoc.getObjects();
         for (Object object : lObj)
         {
             COSBase curObj = ((COSObject) object).getObject();
             if (curObj instanceof COSDictionary
                     && ((COSDictionary) 
curObj).keySet().contains(COSName.getPDFName("Linearized")))
             {
                 return (COSDictionary) curObj;
             }
         }
         return null;
     }

     private static class MyPDFParser extends PDFParser
     {
         MyPDFParser(RandomAccessBufferedFileInputStream raf) throws 
IOException
         {
             super(raf);
         }

         // from preflight
         @Override
         protected void initialParse() throws InvalidPasswordException, 
IOException
         {
             super.initialParse();
             // For each ObjectKey, we check if the object has been loaded
             // useful for linearized PDFs
             Map<COSObjectKey, Long> xrefTable = document.getXrefTable();
             for (Map.Entry<COSObjectKey, Long> entry : 
xrefTable.entrySet())
             {
                 COSObject co = document.getObjectFromPool(entry.getKey());
                 if (co.getObject() == null)
                 {
                     // object isn't loaded - parse the object to load 
its content
                     parseObjectDynamically(co, true);
                 }
             }
         }
     }
}


According to the PDF specification: "The linearization parameter 
dictionary shall be entirely contained within the first 1024 bytes of 
the PDF file. This limits the amount of data a conforming reader must 
read before deciding whether the file is linearized."

So if I were you, I'd download the first 1024 bytes, search for 
"/Linearized" with the Knuth-Morris-Pratt Algorithm,
https://stackoverflow.com/questions/1507780/searching-for-a-sequence-of-bytes-in-a-binary-file-with-java
and only if you get a hit, then use the code mentioned.

Tilman



Am 25.07.2017 um 10:34 schrieb karthick g:
> Hi team,
>
> Based on the analysis I have found one thing regarding Linearized PDF in
> 2.0 and above versions of PDFBox.
>
> COSDocument cDoc = pdDoc.getDocument();
> List<COSObject> lObj = cDoc.getObjects();
>          for (COSObject object : lObj)
>          {
>              System.out.println(object.getObjectNumber());
>     }
>
> Based on the code am retrieving  cosobject numbers of PDFDocument
> which prints COSObjects sequentially.......
> PDF 1.8.2 and 2.0.6 works same  except the fact that COSObject pointing to
> Linearized dictionary is not added.
>
> 748 0 obj <</Linearized 1/L 1829691/O 752/E 171783/N 9/T 1814683/H [ 3196
> 824]>> endobj
>
> The 748, 0 which is present in 1.8.2 is not present in 2.0.6. Is the
> finding is correct and can you guide me to fix it.
> If it is fixed I can able to retrieve Linearized dictionary without going
> for preflight jar,
>
> PDFBox 1.8.2
> ===========
> COSObject{1, 0}
> ---------------------
> ------------------------------
> ---------------------------
> COSObject{747, 0}
> COSObject{748, 0}
> COSObject{749, 0}
> ---------
>
> PDFBox 2.0.6
> ===========
> COSObject{1, 0}
> ---------------------
> ------------------------------
> ---------------------------
> COSObject{747, 0}
> COSObject{749, 0}
> ---------
>
> Regard,
> Karthick G
>
> On Thu, Jul 13, 2017 at 12:02 PM, karthick g <ikarthick2002@gmail.com>
> wrote:
>
>> Hi Team,
>>
>> In our project we want to take the Linearised dictionary. Before these 2.0
>> versions,
>> We can able to get that dictionary by normal workarounds that without
>> loading preflight document. Now after 2.0 versions we have to load the
>> preflight document to get the linearized property. Which resulting in
>> additional work around and which cost the project performance. Will their
>> be a workaround in next release, Such that linearized property can be
>> retrieved without loading Preflight document.
>>
>> Regards,
>> Karthick G
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message