pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Xref parsing performance
Date Sat, 28 Feb 2015 18:54:01 GMT
Am 28.02.2015 um 18:34 schrieb Maruan Sahyoun:
>
> Am 28.02.2015 um 18:18 schrieb Andreas Lehmkuehler <andreas@lehmi.de>:
>
>> Am 28.02.2015 um 18:07 schrieb Maruan Sahyoun:
>>> Hi,
>>>
>>> Am 28.02.2015 um 17:53 schrieb Andreas Lehmkuehler <andreas@lehmi.de>:
>>>
>>>> Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
>>>>> Hi,
>>>>>
>>>>> Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <andreas@lehmi.de>:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>>>>>>> Hi Andrea,
>>>>>>>
>>>>>>> While a speed improvement in parsing of large files would be
much appreciated
>>>>>>> (especially by the TIKA users), there are several problems with
your change:
>>>>>> +1
>>>>>>
>>>>>>> - don't do changes that need JDK7 or higher even if they are
cool. We use JDK6
>>>>>>> currently.
>>>>>>>
>>>>>>> - regressions:
>>>>>>>
>>>>>>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>>>>>>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>>>>      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>>>>      at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>      at
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>      at
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>
>>>>>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>>>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>>>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>>>>>
>>>>>>>
>>>>>>> Error converting file PDFBOX-2599.pdf
>>>>>>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>>>>      at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>>>>      at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>>>>      at
>>>>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>      at
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>      at
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>
>>>>>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>>>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>>>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>>>>>
>>>>>>>
>>>>>>> - why change only one of the members of that cosobjectkey class
to int?
>>>>>>> According to the spec, both are integers. Maybe there's a good
reason, but I'd
>>>>>>> like to know.
>>>>>> ASFAIK there is no good reason not to change both to int.
>>>>>
>>>>> as the offset is a 10 digit number is that really covered being an int?
>>>> It's about the object number not the offset. We are using a long for the
offset. The spec is quite clear about those numbers. They have to be integers and the max
value for an integer within a pdf is 2^31-1 due to the fact that the assumed default platform
for a conforming reader should be 32-bit.
>>>>
>>>> BTW, I've changed the object/generation number to int.
>>>
>>> Yes, but that's a should in the spec and not a shall so it's recommended but
might not be followed.
>> Hmm, those values shall be integers and integers should be 32 bit. So, do we really
have to be afraid that someone should exceed that limit?
>
> I've yet to come across such a file but the Annex C talks about minimum architectural
limits. So as we are changing from long to int we might be at risk which haven't been before.
btw. one of my customers is producing PDFs in the  (low) GB size range :-)
OK, I'm going to revert the change for the object number to be on the safe side ...

Andreas
>
>>
>>>
>>>
>>>>
>>>>>
>>>>> BR
>>>>> Maruan
>>>>>
>>>>>>
>>>>>>> - even if you get rid of the regressions, a remaining problem
is that
>>>>>>>     - Andreas L. is currently working on some parser stuff in
PDFBOX-2527
>>>>>> That's not a problem. For now I'm focused on the parsing process
itself and am working on one last piece, the rebuild mechanism.
>>>>>>
>>>>>>>     - your change is too big to evaluate (I'm speaking only for
myself there).
>>>>>>> It would be better to first submit only small refactorings in
PDFBOX-2576, and
>>>>>>
>>>>>> I agree. We should try to break up the patch into smaller pieces
if possible. Let's start with the long -> int change
>>>>>>
>>>>>>> then the optimization you mention (or the other way around).
The parser is
>>>>>>> indeed a tricky part of the code (And SonarQube and Software
Diagnostics have
>>>>>>> also flagged it as too complex). I did some refactorings a few
weeks ago there
>>>>>>> (splitting methods), but stopped because I couldn't come up with
names for the
>>>>>>> new methods. I just didn't understand what they were doing.
>>>>>>>
>>>>>>> Tilman
>>>>>>
>>>>>> BR
>>>>>> Andreas Lehmkühler
>>>>>>
>>>>>>>
>>>>>>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>>>>>>> Hi,
>>>>>>>> few days ago I was profiling PDFBox when loading medium/large
size
>>>>>>>> documents and I think I found something.
>>>>>>>> If you try loading the document
>>>>>>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
 you'll see
>>>>>>>> it takes quite some time and that's mostly spent in the
>>>>>>>> XrefTrailerResolver.getContainedObjectNumbers. The issue
is that every time
>>>>>>>> an object contained in an unparsed object stream is found,
the
>>>>>>>> XrefTrailerResolver performs a full scan of the xref entries
found in the
>>>>>>>> document, in this case hundreds of thousands. If the object
streams are
>>>>>>>> many (like in the given doc), it performs many full scans
resulting in poor
>>>>>>>> performance.
>>>>>>>> I'm trying to get familiar with the PDFBox code and I decided
to try and
>>>>>>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>>>>>>> As you can see I refactored a bit extracting some classes
and covered the
>>>>>>>> expect behaviour with unit tests. I tested it with few random
docs, loading
>>>>>>>> and saving them back and the output is exactly the same with
or without my
>>>>>>>> changes. The pdf_reference_1-7.pdf doc loads in half of the
time, same as
>>>>>>>> this
>>>>>>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>> it takes half the time. Other kind of docs loads in a comparable
amount of
>>>>>>>> time and even profiling memory usage it seems comparable
if not a little
>>>>>>>> less.
>>>>>>>> Maybe someone wants to take a look?
>>>>>>>>
>>>>>>>> I understand my changes look a bit invasive and the issue
could probably be
>>>>>>>> fixed differently, on the other hand the couple BaseParser+COSParser
looks
>>>>>>>> like a big intimidating monster to a newcomer like me and
it's quite
>>>>>>>> difficult to follow the expected behaviour so I thought this
might be a
>>>>>>>> chance to start breaking them down in smaller, distilled
classes...
>>>>>>>> something a little more manageable and testable... anyway,
grab what you
>>>>>>>> like, leave what you don't  :)
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message