lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dror Matalon <d...@zapatec.com>
Subject Re: Peculiar (?) Indexing Performance
Date Tue, 13 Jan 2004 21:55:12 GMT

Hi Terry,

It's usually useful to give some information about your environment. How
many documents you are indexing, what is the average size of a document,
etc. But I'll answer anyway :-).

For details about indexing see Otis' article at
http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html

On Tue, Jan 13, 2004 at 04:35:43PM -0500, Terry Steichen wrote:
> I just aborted a re-indexing operation (because it was taking too much time - will run
it overnight instead).  But I was surprised by what I found in the index directory, which
contained a total of 1,402 index files!  It started out with 36 files with the name of "_I9a.*",
followed by groups of 72 files with names like "_17si.*" and so forth.
> 
> Is this normal?

Yes. As it indexes more documents, Lucene creates more files.

> 
> Also, I noticed that during the indexing it would chug along, indexing at a pretty decent
rate, and then, every so often (I would estimate every several hundred added files) it would
stop for perhaps 10 - 30 seconds (occasionally longer), doing a bunch of disk activity.  Then
it would resume again - almost like it was optimizing.  (I'm doing this on a notebook, so
the disk IO is probably fairly slow.)
> 

Probably at this point, Lucene is merging some of the files. At this
point, if you look you'll probably see that the number of files has been
reduced.

> Is this normal?

Yes. Once the indexing is done you can optimize the index, which will
result in one index instead of several hundred.

Read the article, it's good stuff.

Regards, 

Dror

> 
> Regards,
> 
> Terry
> 
> PS: The code I'm using to do the indexing is below:
> 
> import npg1.search.WebExecAnalyzer;
> import org.apache.lucene.index.IndexWriter;
> import npg1.search.WESimilarity2;
> import npg1.search.WPDocument2a;
> 
> import java.io.File;
> import java.util.Date;
> 
> class IndexWPFiles2a {
>   public static void main(String[] args) {
>  
>  //args[0] = location of target directory to be indexed
>  //args[1] = location of index directory (in which to create index files)
>  
>  System.out.println("starting"); 
>  try {
>   Date start = new Date();
>   
>   String target = "c:/master_db/master_xml";
>   if(args[0] != null) {
>    target = args[0];
>   }  
>   String index = "c:/master_db/master_index";
>   if(args[1] != null) {
>    index = args[1];
>   }
>         
>   IndexWriter writer = null;
>   if(args.length < 3) {
>    writer = new IndexWriter(index, new WebExecAnalyzer(), true);
>    writer.mergeFactor = 50;
>    writer.setSimilarity(new WESimilarity2());
>    indexDocs(writer, new File(target));
>   } else {
>    writer = new IndexWriter(index, new WebExecAnalyzer(), false);
>    writer.setSimilarity(new WESimilarity2());
>   }
>   writer.optimize();
>   writer.close();
>   
>   Date end = new Date();
>   
>   System.out.print(end.getTime() - start.getTime());
>   System.out.println(" total milliseconds");
>  
>  } catch (Exception e) {
>   System.out.println(" caught a " + e.getClass() +
>     "\n with message: " + e.getMessage());
>  }
>   }
> 
>   public static void indexDocs(IndexWriter writer, File file)
>        throws Exception {
>  //System.out.println("starting indexing with internal path"); 
>  if (file.isDirectory()) {
>   String[] files = file.list();
>   for (int i = 0; i < files.length; i++){
>    //System.out.println("recursive call");
>    indexDocs(writer, new File(file, files[i]));
>   }
>  } else {
>   try {
>    System.out.println("adding " + file);
>    writer.addDocument(WPDocument2a.Document(file));
>   } catch (Exception e) {
>    System.out.println("error adding "+file+" -  Exception: "+e.getMessage());
>   }
>  }
>   }
> }

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message