lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Litchfield <...@csh.rit.edu>
Subject Re: pdfbox performance.
Date Wed, 28 Jul 2004 23:40:28 GMT


Different PDFs will exhibit different extraction speeds because of the way
that PDF documents are structured.

I assume you are using the latest version 0.6.6, could you give 0.6.5 a
try and see if you notice faster speeds.

Ben

On Thu, 29 Jul 2004, Miroslaw Milewski wrote:

> Paul Smith wrote:
>
>   > The first thing that I would do is wrap the FileInputStream with a
>   > BufferedInputStream.
>   > Change:
>   > > FileInputStream reader = new FileInputStream(file);
>   > To:
>   > InputStream reader = new BufferedInputStream(new
>   > FileInputStream(file));
>   > You get a significant boost reading in from a buffer, particularly as
>   > the size of the file grows. Try that first, and then rebenchmark.
>
>   I tested both, here is the code:
>
> File file = new File("test.pdf");
> InputStream reader = null;
>
> for(int i=1; i<=6; i++) {
>
>    long step01 = Calendar.getInstance().getTimeInMillis();
>    String stream = null;
>
>    if(i%2 == 0) {
>      reader = new BufferedInputStream(new FileInputStream(file));
>        stream = "buffered";
>    }
>    else {
>      reader = new FileInputStream(file);
>      stream = "no buffer";
>    }
>
>    PDFParser parser = null;
>    PDDocument pdDoc = null;
>
>    parser = new PDFParser(reader);
>    parser.parse();
>    pdDoc = parser.getPDDocument();
>
>    long step02 = Calendar.getInstance().getTimeInMillis();
>
>    PDFTextStripper stripper = new PDFTextStripper();
>    tring pdftext = stripper.getText(pdDoc);
>
>    long step03 = Calendar.getInstance().getTimeInMillis();
>
>    pdDoc.close();
>
>    long end = Calendar.getInstance().getTimeInMillis();
>
>    System.out.println("iteration: " + i + " - " + stream);
>    System.out.println("start: " + start);
>    System.out.println("step01: " + (step01-start));
>    System.out.println("step02: " + (step02-start));
>    System.out.println("step03: " + (step03-start));
>    System.out.println("end: " + (end-start));
> }
>
>   And below are the benchmarks for buffered and unbuffered readers. The
> difference is not stunning. It seems to get better with time, but this
> is prably due to some VM optimisation. And I'll extract the text only
> once :-).
>
> file: 9kB, text only;
>
> iteration: 1 - no buffer
> step01: 0; step02: 1492; step03: 13850; end: 13880
>
> iteration: 2 - buffered
> step01: 0; step02: 912; step03: 10245; end: 10265
>
> iteration: 3 - no buffer
> step01: 0; step02: 951 ;step03: 9924; end: 9944
>
> iteration: 4 - buffered
> step01: 0; step02: 842; step03: 10075; end: 10105
>
> iteration: 5 - no buffer
> step01: 0; step02: 831; step03: 9934; end: 9954
>
> iteration: 6 - buffered
> step01: 0; step02: 932; step03: 9944; end: 9965
>
>
> file: 74 kB; text only
>
> iteration: 1 - no buffer
> step01: 0; step02: 4918; step03: 33959; end: 33989
>
> iteration: 2 - buffered
> step01: 0; step02: 4367; step03: 32367; end: 32407
>
> iteration: 3 - no buffer
> step01: 0; step02: 4306; step03: 30995; end: 31025
>
> iteration: 4 - buffered
> step01: 0; step02: 4296; step03: 30734; end: 30764
>
> iteration: 5 - no buffer
> step01: 0; step02: 4266; step03: 30754; end: 30784
>
> iteration: 6 - buffered
> step01: 0; step02: 4256; step03: 30634; end: 30664
>
>
> file: 270 kB, text only
>
> iteration: 1 - no buffer
> step01: 0; step02: 30634; step03: 142225; end: 142265
>
> iteration: 2 - buffered
> step01: 0; step02: 29893; step03: 135354; end: 135394
>
> iteration: 3 - no buffer
> step01: 0; step02: 29553; step03: 134654; end: 134694
>
> iteration: 4 - buffered
> step01: 0; step02: 29613; step03: 134944; end: 134984
>
> iteration: 5 - no buffer
> step01: 0; step02: 29543; step03: 139070; end: 139110
>
> iteration: 6 - buffered
> step01: 0; step02: 32427; step03: 150457; end: 150487
>
>   Anyway, I suppose I made a wrong assumption while designing my app. I
> don't think I can get a performance boost of 90% or so. Thus the
> documents (at least the .pdfs) won't be extracted and indexed at the
> time of adding them to the knowledge base.
>   Since I also have a db involved, I can keep the basic data there, and
> extract and index in the meantime - most likely using a different thread.
>
>   thx,
> --
> 	Miroslaw Milewski
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message