pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Xref parsing performance
Date Sat, 28 Feb 2015 17:07:34 GMT
Hi,

Am 28.02.2015 um 17:53 schrieb Andreas Lehmkuehler <andreas@lehmi.de>:

> Am 28.02.2015 um 17:49 schrieb Maruan Sahyoun:
>> Hi,
>> 
>> Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <andreas@lehmi.de>:
>> 
>>> Hi
>>> 
>>> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>>>> Hi Andrea,
>>>> 
>>>> While a speed improvement in parsing of large files would be much appreciated
>>>> (especially by the TIKA users), there are several problems with your change:
>>> +1
>>> 
>>>> - don't do changes that need JDK7 or higher even if they are cool. We use
JDK6
>>>> currently.
>>>> 
>>>> - regressions:
>>>> 
>>>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>>>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>>>>     at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>     at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>     at
>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>     at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>     at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> 
>>>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>>>     at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>     at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>     at junit.framework.TestResult.run(TestResult.java:125)
>>>>     at junit.framework.TestCase.run(TestCase.java:129)
>>>>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>     at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>     at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>     at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>> 
>>>> 
>>>> Error converting file PDFBOX-2599.pdf
>>>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>>>>     at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>>>     at
>>>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>>>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>>>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>>>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>>>     at
>>>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>     at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>     at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> 
>>>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>>>     at junit.framework.TestCase.runTest(TestCase.java:176)
>>>>     at junit.framework.TestCase.runBare(TestCase.java:141)
>>>>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>>>>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>>>>     at junit.framework.TestResult.run(TestResult.java:125)
>>>>     at junit.framework.TestCase.run(TestCase.java:129)
>>>>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>>>     at junit.framework.TestSuite.run(TestSuite.java:250)
>>>>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>>>     at junit.textui.TestRunner.start(TestRunner.java:183)
>>>>     at junit.textui.TestRunner.main(TestRunner.java:137)
>>>>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>>>> 
>>>> 
>>>> - why change only one of the members of that cosobjectkey class to int?
>>>> According to the spec, both are integers. Maybe there's a good reason, but
I'd
>>>> like to know.
>>> ASFAIK there is no good reason not to change both to int.
>> 
>> as the offset is a 10 digit number is that really covered being an int?
> It's about the object number not the offset. We are using a long for the offset. The
spec is quite clear about those numbers. They have to be integers and the max value for an
integer within a pdf is 2^31-1 due to the fact that the assumed default platform for a conforming
reader should be 32-bit.
> 
> BTW, I've changed the object/generation number to int.

Yes, but that's a should in the spec and not a shall so it's recommended but might not be
followed.


> 
>> 
>> BR
>> Maruan
>> 
>>> 
>>>> - even if you get rid of the regressions, a remaining problem is that
>>>>    - Andreas L. is currently working on some parser stuff in PDFBOX-2527
>>> That's not a problem. For now I'm focused on the parsing process itself and am
working on one last piece, the rebuild mechanism.
>>> 
>>>>    - your change is too big to evaluate (I'm speaking only for myself there).
>>>> It would be better to first submit only small refactorings in PDFBOX-2576,
and
>>> 
>>> I agree. We should try to break up the patch into smaller pieces if possible.
Let's start with the long -> int change
>>> 
>>>> then the optimization you mention (or the other way around). The parser is
>>>> indeed a tricky part of the code (And SonarQube and Software Diagnostics
have
>>>> also flagged it as too complex). I did some refactorings a few weeks ago
there
>>>> (splitting methods), but stopped because I couldn't come up with names for
the
>>>> new methods. I just didn't understand what they were doing.
>>>> 
>>>> Tilman
>>> 
>>> BR
>>> Andreas Lehmkühler
>>> 
>>>> 
>>>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>>>> Hi,
>>>>> few days ago I was profiling PDFBox when loading medium/large size
>>>>> documents and I think I found something.
>>>>> If you try loading the document
>>>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll
see
>>>>> it takes quite some time and that's mostly spent in the
>>>>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every
time
>>>>> an object contained in an unparsed object stream is found, the
>>>>> XrefTrailerResolver performs a full scan of the xref entries found in
the
>>>>> document, in this case hundreds of thousands. If the object streams are
>>>>> many (like in the given doc), it performs many full scans resulting in
poor
>>>>> performance.
>>>>> I'm trying to get familiar with the PDFBox code and I decided to try
and
>>>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>>>> As you can see I refactored a bit extracting some classes and covered
the
>>>>> expect behaviour with unit tests. I tested it with few random docs, loading
>>>>> and saving them back and the output is exactly the same with or without
my
>>>>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same
as
>>>>> this
>>>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>> it takes half the time. Other kind of docs loads in a comparable amount
of
>>>>> time and even profiling memory usage it seems comparable if not a little
>>>>> less.
>>>>> Maybe someone wants to take a look?
>>>>> 
>>>>> I understand my changes look a bit invasive and the issue could probably
be
>>>>> fixed differently, on the other hand the couple BaseParser+COSParser
looks
>>>>> like a big intimidating monster to a newcomer like me and it's quite
>>>>> difficult to follow the expected behaviour so I thought this might be
a
>>>>> chance to start breaking them down in smaller, distilled classes...
>>>>> something a little more manageable and testable... anyway, grab what
you
>>>>> like, leave what you don't  :)
>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message