pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Xref parsing performance
Date Sat, 28 Feb 2015 16:49:49 GMT
Hi,

Am 28.02.2015 um 17:32 schrieb Andreas Lehmkuehler <andreas@lehmi.de>:

> Hi
> 
> Am 28.02.2015 um 16:47 schrieb Tilman Hausherr:
>> Hi Andrea,
>> 
>> While a speed improvement in parsing of large files would be much appreciated
>> (especially by the TIKA users), there are several problems with your change:
> +1
> 
>> - don't do changes that need JDK7 or higher even if they are cool. We use JDK6
>> currently.
>> 
>> - regressions:
>> 
>> Error converting file PDFBOX-2250-110264-xref-zeronumber.pdf
>> java.io.IOException: XREF for 3:0 points to wrong object: 1:0
>>     at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>     at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>     at
>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>     at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> 
>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>     at junit.framework.TestCase.runTest(TestCase.java:176)
>>     at junit.framework.TestCase.runBare(TestCase.java:141)
>>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>>     at junit.framework.TestResult.run(TestResult.java:125)
>>     at junit.framework.TestCase.run(TestCase.java:129)
>>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>     at junit.framework.TestSuite.run(TestSuite.java:250)
>>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>     at junit.textui.TestRunner.start(TestRunner.java:183)
>>     at junit.textui.TestRunner.main(TestRunner.java:137)
>>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>> 
>> 
>> Error converting file PDFBOX-2599.pdf
>> java.io.IOException: XREF for 2:0 points to wrong object: 1:0
>>     at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:696)
>>     at
>> org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:639)
>>     at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:600)
>>     at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:346)
>>     at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:373)
>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:811)
>>     at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
>>     at org.apache.pdfbox.util.TestPDFToImage.doTestFile(TestPDFToImage.java:201)
>>     at
>> org.apache.pdfbox.util.TestPDFToImage.testRenderImage(TestPDFToImage.java:343)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>     at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> 
>>     at java.lang.reflect.Method.invoke(Method.java:606)
>>     at junit.framework.TestCase.runTest(TestCase.java:176)
>>     at junit.framework.TestCase.runBare(TestCase.java:141)
>>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>>     at junit.framework.TestResult.run(TestResult.java:125)
>>     at junit.framework.TestCase.run(TestCase.java:129)
>>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>>     at junit.framework.TestSuite.run(TestSuite.java:250)
>>     at junit.textui.TestRunner.doRun(TestRunner.java:116)
>>     at junit.textui.TestRunner.start(TestRunner.java:183)
>>     at junit.textui.TestRunner.main(TestRunner.java:137)
>>     at org.apache.pdfbox.util.TestPDFToImage.main(TestPDFToImage.java:393)
>> 
>> 
>> - why change only one of the members of that cosobjectkey class to int?
>> According to the spec, both are integers. Maybe there's a good reason, but I'd
>> like to know.
> ASFAIK there is no good reason not to change both to int.

as the offset is a 10 digit number is that really covered being an int?

BR
Maruan

> 
>> - even if you get rid of the regressions, a remaining problem is that
>>    - Andreas L. is currently working on some parser stuff in PDFBOX-2527
> That's not a problem. For now I'm focused on the parsing process itself and am working
on one last piece, the rebuild mechanism.
> 
>>    - your change is too big to evaluate (I'm speaking only for myself there).
>> It would be better to first submit only small refactorings in PDFBOX-2576, and
> 
> I agree. We should try to break up the patch into smaller pieces if possible. Let's start
with the long -> int change
> 
>> then the optimization you mention (or the other way around). The parser is
>> indeed a tricky part of the code (And SonarQube and Software Diagnostics have
>> also flagged it as too complex). I did some refactorings a few weeks ago there
>> (splitting methods), but stopped because I couldn't come up with names for the
>> new methods. I just didn't understand what they were doing.
>> 
>> Tilman
> 
> BR
> Andreas Lehmk├╝hler
> 
>> 
>> Am 27.02.2015 um 16:34 schrieb Andrea Vacondio:
>>> Hi,
>>> few days ago I was profiling PDFBox when loading medium/large size
>>> documents and I think I found something.
>>> If you try loading the document
>>> http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf  you'll see
>>> it takes quite some time and that's mostly spent in the
>>> XrefTrailerResolver.getContainedObjectNumbers. The issue is that every time
>>> an object contained in an unparsed object stream is found, the
>>> XrefTrailerResolver performs a full scan of the xref entries found in the
>>> document, in this case hundreds of thousands. If the object streams are
>>> many (like in the given doc), it performs many full scans resulting in poor
>>> performance.
>>> I'm trying to get familiar with the PDFBox code and I decided to try and
>>> fix this herehttps://github.com/torakiki/sambox/tree/xref
>>> As you can see I refactored a bit extracting some classes and covered the
>>> expect behaviour with unit tests. I tested it with few random docs, loading
>>> and saving them back and the output is exactly the same with or without my
>>> changes. The pdf_reference_1-7.pdf doc loads in half of the time, same as
>>> this
>>> http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>> it takes half the time. Other kind of docs loads in a comparable amount of
>>> time and even profiling memory usage it seems comparable if not a little
>>> less.
>>> Maybe someone wants to take a look?
>>> 
>>> I understand my changes look a bit invasive and the issue could probably be
>>> fixed differently, on the other hand the couple BaseParser+COSParser looks
>>> like a big intimidating monster to a newcomer like me and it's quite
>>> difficult to follow the expected behaviour so I thought this might be a
>>> chance to start breaking them down in smaller, distilled classes...
>>> something a little more manageable and testable... anyway, grab what you
>>> like, leave what you don't  :)
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message