lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Indexing a large number of DB records
Date Wed, 15 Dec 2004 16:36:13 GMT
Hello Homam,

The batches I was referring to were batches of DB rows.
Instead of SELECT * FROM table... do SELECT * FROM table ... OFFSET=X
LIMIT=Y.

Don't close IndexWriter - use the single instance.

There is no MakeStable()-like method in Lucene, but you can control the
number of in-memory Documents, the frequence of segment merges, and
maximal size of an index segments with 3 IndexWriter parameters,
described fairly verbosely in the javadocs.

Since you are using the .Net version, you should really consult
dotLucene guy(s).  Running under the profiler should also tell you
where the time and memory go.

Otis

--- "Homam S.A." <homam_sa@yahoo.com> wrote:

> Thanks Otis!
> 
> What do you mean by building it in batches? Does it
> mean I should close the IndexWriter every 1000 rows
> and reopen it? Does that releases references to the
> document objects so that they can be
> garbage-collected?
> 
> I'm calling optimize() only at the end.
> 
> I agree that 1500 documents is very small. I'm
> building the index on a PC with 512 megs, and the
> indexing process is quickly gobbling up around 400
> megs when I index around 1800 documents and the whole
> machine is grinding to a virtual halt. I'm using the
> latest DotLucene .NET port, so may be there's a memory
> leak in it.
> 
> I have experience with AltaVista search (acquired by
> FastSearch), and I used to call MakeStable() every
> 20,000 documents to flush memory structures to disk.
> There doesn't seem to be an equivalent in Lucene.
> 
> -- Homam
> 
> 
> 
> 
> 
> 
> --- Otis Gospodnetic <otis_gospodnetic@yahoo.com>
> wrote:
> 
> > Hello,
> > 
> > There are a few things you can do:
> > 
> > 1) Don't just pull all rows from the DB at once.  Do
> > that in batches.
> > 
> > 2) If you can get a Reader from your SqlDataReader,
> > consider this:
> >
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
> > 
> > 3) Give the JVM more memory to play with by using
> > -Xms and -Xmx JVM
> > parameters
> > 
> > 4) See IndexWriter's minMergeDocs parameter.
> > 
> > 5) Are you calling optimize() at some point by any
> > chance?  Leave that
> > call for the end.
> > 
> > 1500 documents with 30 columns of short
> > String/number values is not a
> > lot.  You may be doing something else not Lucene
> > related that's slowing
> > things down.
> > 
> > Otis
> > 
> > 
> > --- "Homam S.A." <homam_sa@yahoo.com> wrote:
> > 
> > > I'm trying to index a large number of records from
> > the
> > > DB (a few millions). Each record will be stored as
> > a
> > > document with about 30 fields, most of them are
> > > UnStored and represent small strings or numbers.
> > No
> > > huge DB Text fields.
> > > 
> > > But I'm running out of memory very fast, and the
> > > indexing is slowing down to a crawl once I hit
> > around
> > > 1500 records. The problem is each document is
> > holding
> > > references to the string objects returned from
> > > ToString() on the DB field, and the IndexWriter is
> > > holding references to all these document objects
> > in
> > > memory, so the garbage collector is getting a
> > chance
> > > to clean these up.
> > > 
> > > How do you guys go about indexing a large DB
> > table?
> > > Here's a snippet of my code (this method is called
> > for
> > > each record in the DB):
> > > 
> > > private void IndexRow(SqlDataReader rdr,
> > IndexWriter
> > > iw) {
> > > 	Document doc = new Document();
> > > 	for (int i = 0; i < BrowseFieldNames.Length; i++)
> > {
> > > 		doc.Add(Field.UnStored(BrowseFieldNames[i],
> > > rdr.GetValue(i).ToString()));
> > > 	}
> > > 	iw.AddDocument(doc);
> > > }
> > > 
> > > 
> > > 
> > > 
> > > 		
> > > __________________________________ 
> > > Do you Yahoo!? 
> > > Yahoo! Mail - Find what you need with new enhanced
> > search.
> > > http://info.mail.yahoo.com/mail_250
> > > 
> > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail:
> > lucene-user-help@jakarta.apache.org
> > > 
> > > 
> > 
> > 
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> > lucene-user-help@jakarta.apache.org
> > 
> > 
> 
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> Take Yahoo! Mail with you! Get it on your mobile phone. 
> http://mobile.yahoo.com/maildemo 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message