Thanks Otis!
What do you mean by building it in batches? Does it
mean I should close the IndexWriter every 1000 rows
and reopen it? Does that releases references to the
document objects so that they can be
garbage-collected?
I'm calling optimize() only at the end.
I agree that 1500 documents is very small. I'm
building the index on a PC with 512 megs, and the
indexing process is quickly gobbling up around 400
megs when I index around 1800 documents and the whole
machine is grinding to a virtual halt. I'm using the
latest DotLucene .NET port, so may be there's a memory
leak in it.
I have experience with AltaVista search (acquired by
FastSearch), and I used to call MakeStable() every
20,000 documents to flush memory structures to disk.
There doesn't seem to be an equivalent in Lucene.
-- Homam
--- Otis Gospodnetic <otis_gospodnetic@yahoo.com>
wrote:
> Hello,
>
> There are a few things you can do:
>
> 1) Don't just pull all rows from the DB at once. Do
> that in batches.
>
> 2) If you can get a Reader from your SqlDataReader,
> consider this:
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
>
> 3) Give the JVM more memory to play with by using
> -Xms and -Xmx JVM
> parameters
>
> 4) See IndexWriter's minMergeDocs parameter.
>
> 5) Are you calling optimize() at some point by any
> chance? Leave that
> call for the end.
>
> 1500 documents with 30 columns of short
> String/number values is not a
> lot. You may be doing something else not Lucene
> related that's slowing
> things down.
>
> Otis
>
>
> --- "Homam S.A." <homam_sa@yahoo.com> wrote:
>
> > I'm trying to index a large number of records from
> the
> > DB (a few millions). Each record will be stored as
> a
> > document with about 30 fields, most of them are
> > UnStored and represent small strings or numbers.
> No
> > huge DB Text fields.
> >
> > But I'm running out of memory very fast, and the
> > indexing is slowing down to a crawl once I hit
> around
> > 1500 records. The problem is each document is
> holding
> > references to the string objects returned from
> > ToString() on the DB field, and the IndexWriter is
> > holding references to all these document objects
> in
> > memory, so the garbage collector is getting a
> chance
> > to clean these up.
> >
> > How do you guys go about indexing a large DB
> table?
> > Here's a snippet of my code (this method is called
> for
> > each record in the DB):
> >
> > private void IndexRow(SqlDataReader rdr,
> IndexWriter
> > iw) {
> > Document doc = new Document();
> > for (int i = 0; i < BrowseFieldNames.Length; i++)
> {
> > doc.Add(Field.UnStored(BrowseFieldNames[i],
> > rdr.GetValue(i).ToString()));
> > }
> > iw.AddDocument(doc);
> > }
> >
> >
> >
> >
> >
> > __________________________________
> > Do you Yahoo!?
> > Yahoo! Mail - Find what you need with new enhanced
> search.
> > http://info.mail.yahoo.com/mail_250
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >
> >
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
>
>
__________________________________
Do you Yahoo!?
Take Yahoo! Mail with you! Get it on your mobile phone.
http://mobile.yahoo.com/maildemo
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
|