lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <t...@hypermall.net>
Subject Re: pdfbox performance.
Date Wed, 28 Jul 2004 23:39:03 GMT
On Wednesday 28 July 2004 15:44, Paul Smith wrote:
> The first thing that I would do is wrap the FileInputStream with a
> BufferedInputStream.
....
> You get a significant boost reading in from a buffer, particularly as the
> size of the file grows.

Benchmarking is good; whether there's any significant performance difference 
depends on how the app reads data from the stream. Most high-performance apps 
read straight to a local buffer, in which case BufferedInputStream offers 
nothing more than the buffer overhead... :-)
You do get nice performance improvement only if most reads are done using 
single byte read methods (read()), though, so there's always a chance I 
guess.

-+ Tatu +-

> Try that first, and then rebenchmark.
> Cheers
> Paul Smith
>
> > -----Original Message-----
> > From: Miroslaw Milewski [mailto:miro@redshrimp.com]
> > Sent: Thursday, July 29, 2004 7:24 AM
> > To: lucene-user@jakarta.apache.org
> > Subject: pdfbox performance.
> >
> >
> >   Hi,
> >
> >   I have a serious performance problem while extracting text from pdf.
> >
> >   Here is the code (w/o try/catch blocks):
> >
> >   File file = new File("test.pdf");
> >   FileInputStream reader = new FileInputStream(file);
> >
> >   PDFParser parser = new PDFParser(reader);
> >   parser.parse();
> >   PDDocument pdDoc = parser.getPDDocument();
> >
> >   PDFTextStripper stripper = new PDFTextStripper();
> >   String pdftext = stripper.getText(pdDoc);
> >
> >   pdDoc.close();
> >
> >   Now, the whole process takes:
> >   - 37,4 sec w. a 74 kB file (parsing took 5,3 sec.)
> >   - 156,7 sec w. a 150 kB file (parsing: 11,0 sec.)
> >   - 157,8 sec w. a 270 kB file (parsing: 34,3 sec.)
> >   - 313,3 sec w. a 151 kB file (parsing: 5,9 sec.)
> >
> >   Now, I can't really get the point here. Is this performance standard
> > for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code,
> > or maybe the pdf docs (text only, the last one with some UML diags.)
> >
> >   I am writing a knowledge base system at the moment, and planned to do
> > real-time text extraction and indexing (using Lucene.) But this is not
> > realistic, considering the extraction thime.
> >   Then maybe it is a better idea to run the extraction and indexing once
> > every 24 h, processing all the documents added during that period.
> >
> >   TIA for any comments/suggestions.
> >
> > --
> > 	Miroslaw Milewski
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message