pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Xref parsing performance
Date Sun, 01 Mar 2015 17:25:39 GMT
Am 28.02.2015 um 19:54 schrieb Andreas Lehmkuehler:
> Am 28.02.2015 um 18:34 schrieb Maruan Sahyoun:
>>
>> Am 28.02.2015 um 18:18 schrieb Andreas Lehmkuehler <andreas@lehmi.de>:
>>
>>> Am 28.02.2015 um 18:07 schrieb Maruan Sahyoun:
>>>> Hi,
>>>>
>>>> Am 28.02.2015 um 17:53 schrieb Andreas Lehmkuehler <andreas@lehmi.de>:
>>>>
>>>>> Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
>>>>>> Hi,
>>>>>>
>>>>>> Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <andreas@lehmi.de>:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>>>>>>>> Hi Andrea,
>>>>>>>>
>>>>>>>> While a speed improvement in parsing of large files would
be much
>>>>>>>> appreciated
>>>>>>>> (especially by the TIKA users), there are several problems
with your
>>>>>>>> change:
>>>>>>> +1
>>>>>>>
>>>>>>>> - don't do changes that need JDK7 or higher even if they
are cool. We
>>>>>>>> use JDK6
>>>>>>>> currently.
>>>>>>>>
>>>>>>>> - regressions:
>>>>>>>>
>>>>>>>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>>>>>>>> java.io.IOException: XREF for 3:0 points to wrong object:
1:0
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>>>>>
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>>>>>
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>>>>>
>>>>>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
>>>>>>>>      at
>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>>
>>>>>>>>      at
>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>
>>>>>>>>
>>>>>>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>>>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>>>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>>>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>>>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>>>>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>>>>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>>>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>>>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>>>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>>>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>>>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>>>>>>
>>>>>>>>
>>>>>>>> Error converting file PDFBOX-2599.pdf
>>>>>>>> java.io.IOException: XREF for 2:0 points to wrong object:
1:0
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>>>>>
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>>>>>
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>>>>>      at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>>>>>      at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>>>>>      at
>>>>>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>>>>>
>>>>>>>>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
>>>>>>>>      at
>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>>
>>>>>>>>      at
>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>
>>>>>>>>
>>>>>>>>      at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>>      at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>>>>>      at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>>>>>      at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>>>>>      at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>>>>>      at junit.framework.TestResult.run(TestResult.java:125)
>>>>>>>>      at junit.framework.TestCase.run(TestCase.java:129)
>>>>>>>>      at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>>>>>      at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>>>>>      at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>>>>>      at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>>>>>      at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>>>>>      at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>>>>>>
>>>>>>>>
>>>>>>>> - why change only one of the members of that cosobjectkey
class to int?
>>>>>>>> According to the spec, both are integers. Maybe there's a
good reason,
>>>>>>>> but I'd
>>>>>>>> like to know.
>>>>>>> ASFAIK there is no good reason not to change both to int.
>>>>>>
>>>>>> as the offset is a 10 digit number is that really covered being an
int?
>>>>> It's about the object number not the offset. We are using a long for
the
>>>>> offset. The spec is quite clear about those numbers. They have to be
>>>>> integers and the max value for an integer within a pdf is 2^31-1 due
to the
>>>>> fact that the assumed default platform for a conforming reader should
be
>>>>> 32-bit.
>>>>>
>>>>> BTW, I've changed the object/generation number to int.
>>>>
>>>> Yes, but that's a should in the spec and not a shall so it's recommended
but
>>>> might not be followed.
>>> Hmm, those values shall be integers and integers should be 32 bit. So, do we
>>> really have to be afraid that someone should exceed that limit?
>>
>> I've yet to come across such a file but the Annex C talks about minimum
>> architectural limits. So as we are changing from long to int we might be at
>> risk which haven't been before. btw. one of my customers is producing PDFs in
>> the  (low) GB size range :-)
> OK, I'm going to revert the change for the object number to be on the safe side ...
Done. I'm refactored the handling of the object+generationnumber within 
COSObject as well.

BR
Andreas
>
> Andreas
>>
>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> BR
>>>>>> Maruan
>>>>>>
>>>>>>>
>>>>>>>> - even if you get rid of the regressions, a remaining problem
is that
>>>>>>>>     - Andreas L. is currently working on some parser stuff
in PDFBOX-2527
>>>>>>> That's not a problem. For now I'm focused on the parsing process
itself
>>>>>>> and am working on one last piece, the rebuild mechanism.
>>>>>>>
>>>>>>>>     - your change is too big to evaluate (I'm speaking only
for myself
>>>>>>>> there).
>>>>>>>> It would be better to first submit only small refactorings
in
>>>>>>>> PDFBOX-2576, and
>>>>>>>
>>>>>>> I agree. We should try to break up the patch into smaller pieces
if
>>>>>>> possible. Let's start with the long -> int change
>>>>>>>
>>>>>>>> then the optimization you mention (or the other way around).
The parser is
>>>>>>>> indeed a tricky part of the code (And SonarQube and Software
Diagnostics
>>>>>>>> have
>>>>>>>> also flagged it as too complex). I did some refactorings
a few weeks ago
>>>>>>>> there
>>>>>>>> (splitting methods), but stopped because I couldn't come
up with names
>>>>>>>> for the
>>>>>>>> new methods. I just didn't understand what they were doing.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>
>>>>>>> BR
>>>>>>> Andreas Lehmkühler
>>>>>>>
>>>>>>>>
>>>>>>>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>>>>>>>> Hi,
>>>>>>>>> few days ago I was profiling PDFBox when loading medium/large
size
>>>>>>>>> documents and I think I found something.
>>>>>>>>> If you try loading the document
>>>>>>>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
 you'll see
>>>>>>>>> it takes quite some time and that's mostly spent in the
>>>>>>>>> XrefTrailerResolver.getContainedObjectNumbers. The issue
is that every
>>>>>>>>> time
>>>>>>>>> an object contained in an unparsed object stream is found,
the
>>>>>>>>> XrefTrailerResolver performs a full scan of the xref
entries found in the
>>>>>>>>> document, in this case hundreds of thousands. If the
object streams are
>>>>>>>>> many (like in the given doc), it performs many full scans
resulting in
>>>>>>>>> poor
>>>>>>>>> performance.
>>>>>>>>> I'm trying to get familiar with the PDFBox code and I
decided to try and
>>>>>>>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>>>>>>>> As you can see I refactored a bit extracting some classes
and covered the
>>>>>>>>> expect behaviour with unit tests. I tested it with few
random docs,
>>>>>>>>> loading
>>>>>>>>> and saving them back and the output is exactly the same
with or without my
>>>>>>>>> changes. The pdf_reference_1-7.pdf doc loads in half
of the time, same as
>>>>>>>>> this
>>>>>>>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>
>>>>>>>>> it takes half the time. Other kind of docs loads in a
comparable amount of
>>>>>>>>> time and even profiling memory usage it seems comparable
if not a little
>>>>>>>>> less.
>>>>>>>>> Maybe someone wants to take a look?
>>>>>>>>>
>>>>>>>>> I understand my changes look a bit invasive and the issue
could
>>>>>>>>> probably be
>>>>>>>>> fixed differently, on the other hand the couple BaseParser+COSParser
looks
>>>>>>>>> like a big intimidating monster to a newcomer like me
and it's quite
>>>>>>>>> difficult to follow the expected behaviour so I thought
this might be a
>>>>>>>>> chance to start breaking them down in smaller, distilled
classes...
>>>>>>>>> something a little more manageable and testable... anyway,
grab what you
>>>>>>>>> like, leave what you don't  :)
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message