pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Xref parsing performance
Date Sat, 28 Feb 2015 16:53:49 GMT
Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
> Hi,
>
> Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <andreas@lehmi.de>:
>
>> Hi
>>
>> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>>> Hi Andrea,
>>>
>>> While a speed improvement in parsing of large files would be much appreciated
>>> (especially by the TIKA users), there are several problems with your change:
>> +1
>>
>>> - don't do changes that need JDK7 or higher even if they are cool. We use JDK6
>>> currently.
>>>
>>> - regressions:
>>>
>>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>>>      at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>      at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>      at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>      at
>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>      at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>      at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>
>>>
>>> Error converting file PDFBOX-2599.pdf
>>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>>>      at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>      at
>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>      at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>      at
>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>      at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>      at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>
>>>
>>> - why change only one of the members of that cosobjectkey class to int?
>>> According to the spec, both are integers. Maybe there's a good reason, but I'd
>>> like to know.
>> ASFAIK there is no good reason not to change both to int.
>
> as the offset is a 10 digit number is that really covered being an int?
It's about the object number not the offset. We are using a long for the offset. 
The spec is quite clear about those numbers. They have to be integers and the 
max value for an integer within a pdf is 2^31-1 due to the fact that the assumed 
default platform for a conforming reader should be 32-bit.

BTW, I've changed the object/generation number to int.

>
> BR
> Maruan
>
>>
>>> - even if you get rid of the regressions, a remaining problem is that
>>>     - Andreas L. is currently working on some parser stuff in PDFBOX-2527
>> That's not a problem. For now I'm focused on the parsing process itself and am working
on one last piece, the rebuild mechanism.
>>
>>>     - your change is too big to evaluate (I'm speaking only for myself there).
>>> It would be better to first submit only small refactorings in PDFBOX-2576, and
>>
>> I agree. We should try to break up the patch into smaller pieces if possible. Let's
start with the long -> int change
>>
>>> then the optimization you mention (or the other way around). The parser is
>>> indeed a tricky part of the code (And SonarQube and Software Diagnostics have
>>> also flagged it as too complex). I did some refactorings a few weeks ago there
>>> (splitting methods), but stopped because I couldn't come up with names for the
>>> new methods. I just didn't understand what they were doing.
>>>
>>> Tilman
>>
>> BR
>> Andreas Lehmkühler
>>
>>>
>>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>>> Hi,
>>>> few days ago I was profiling PDFBox when loading medium/large size
>>>> documents and I think I found something.
>>>> If you try loading the document
>>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
>>>> it takes quite some time and that's mostly spent in the
>>>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
>>>> an object contained in an unparsed object stream is found, the
>>>> XrefTrailerResolver performs a full scan of the xref entries found in the
>>>> document, in this case hundreds of thousands. If the object streams are
>>>> many (like in the given doc), it performs many full scans resulting in poor
>>>> performance.
>>>> I'm trying to get familiar with the PDFBox code and I decided to try and
>>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>>> As you can see I refactored a bit extracting some classes and covered the
>>>> expect behaviour with unit tests. I tested it with few random docs, loading
>>>> and saving them back and the output is exactly the same with or without my
>>>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>>>> this
>>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>> it takes half the time. Other kind of docs loads in a comparable amount of
>>>> time and even profiling memory usage it seems comparable if not a little
>>>> less.
>>>> Maybe someone wants to take a look?
>>>>
>>>> I understand my changes look a bit invasive and the issue could probably
be
>>>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>>>> like a big intimidating monster to a newcomer like me and it's quite
>>>> difficult to follow the expected behaviour so I thought this might be a
>>>> chance to start breaking them down in smaller, distilled classes...
>>>> something a little more manageable and testable... anyway, grab what you
>>>> like, leave what you don't  :)
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message