lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Miroslaw Milewski <m...@redshrimp.com>
Subject Re: pdfbox performance.
Date Wed, 28 Jul 2004 23:08:57 GMT
Paul Smith wrote:

  > The first thing that I would do is wrap the FileInputStream with a
  > BufferedInputStream.
  > Change:
  > > FileInputStream reader = new FileInputStream(file);
  > To:
  > InputStream reader = new BufferedInputStream(new
  > FileInputStream(file));
  > You get a significant boost reading in from a buffer, particularly as
  > the size of the file grows. Try that first, and then rebenchmark.

  I tested both, here is the code:

File file = new File("test.pdf");
InputStream reader = null;

for(int i=1; i<=6; i++) {

   long step01 = Calendar.getInstance().getTimeInMillis();
   String stream = null;
	
   if(i%2 == 0) {
     reader = new BufferedInputStream(new FileInputStream(file));
       stream = "buffered";
   }
   else {
     reader = new FileInputStream(file);
     stream = "no buffer";
   }

   PDFParser parser = null;
   PDDocument pdDoc = null;

   parser = new PDFParser(reader);
   parser.parse();
   pdDoc = parser.getPDDocument();

   long step02 = Calendar.getInstance().getTimeInMillis();

   PDFTextStripper stripper = new PDFTextStripper();
   tring pdftext = stripper.getText(pdDoc);

   long step03 = Calendar.getInstance().getTimeInMillis();

   pdDoc.close();

   long end = Calendar.getInstance().getTimeInMillis();

   System.out.println("iteration: " + i + " - " + stream);
   System.out.println("start: " + start);
   System.out.println("step01: " + (step01-start));
   System.out.println("step02: " + (step02-start));
   System.out.println("step03: " + (step03-start));
   System.out.println("end: " + (end-start));
}

  And below are the benchmarks for buffered and unbuffered readers. The 
difference is not stunning. It seems to get better with time, but this 
is prably due to some VM optimisation. And I'll extract the text only 
once :-).

file: 9kB, text only;

iteration: 1 - no buffer
step01: 0; step02: 1492; step03: 13850; end: 13880

iteration: 2 - buffered
step01: 0; step02: 912; step03: 10245; end: 10265

iteration: 3 - no buffer
step01: 0; step02: 951 ;step03: 9924; end: 9944

iteration: 4 - buffered
step01: 0; step02: 842; step03: 10075; end: 10105

iteration: 5 - no buffer
step01: 0; step02: 831; step03: 9934; end: 9954

iteration: 6 - buffered
step01: 0; step02: 932; step03: 9944; end: 9965


file: 74 kB; text only

iteration: 1 - no buffer
step01: 0; step02: 4918; step03: 33959; end: 33989

iteration: 2 - buffered
step01: 0; step02: 4367; step03: 32367; end: 32407

iteration: 3 - no buffer
step01: 0; step02: 4306; step03: 30995; end: 31025

iteration: 4 - buffered
step01: 0; step02: 4296; step03: 30734; end: 30764

iteration: 5 - no buffer
step01: 0; step02: 4266; step03: 30754; end: 30784

iteration: 6 - buffered
step01: 0; step02: 4256; step03: 30634; end: 30664


file: 270 kB, text only

iteration: 1 - no buffer
step01: 0; step02: 30634; step03: 142225; end: 142265

iteration: 2 - buffered
step01: 0; step02: 29893; step03: 135354; end: 135394

iteration: 3 - no buffer
step01: 0; step02: 29553; step03: 134654; end: 134694

iteration: 4 - buffered
step01: 0; step02: 29613; step03: 134944; end: 134984

iteration: 5 - no buffer
step01: 0; step02: 29543; step03: 139070; end: 139110

iteration: 6 - buffered
step01: 0; step02: 32427; step03: 150457; end: 150487

  Anyway, I suppose I made a wrong assumption while designing my app. I 
don't think I can get a performance boost of 90% or so. Thus the 
documents (at least the .pdfs) won't be extracted and indexed at the 
time of adding them to the knowledge base.
  Since I also have a db involved, I can keep the basic data there, and 
extract and index in the meantime - most likely using a different thread.

  thx,
-- 
	Miroslaw Milewski


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message