lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject RE: Indexing a large number of DB records
Date Wed, 15 Dec 2004 18:42:40 GMT
Note that this really includes some extra steps.
You don't need a temp index.  Add everything to a single index using a
single IndexWriter instance.  No need to call addIndexes nor optimize
until the end.  Adding Documents to an index takes a constant amount of
time, regardless of the index size, because new segments are created as
documents are added, and existing segments don't need to be updated
(only when merges happen).  Again, I'd run your app under a profiler to
see where the time and memory are going.

Otis

--- Garrett Heaver <garrett.heaver@researchandmarkets.com> wrote:

> Hi Homan
> 
> I had a similar problem as you in that I was indexing A LOT of data
> 
> Essentially how I got round it was to batch the index.
> 
> What I was doing was to add 10,000 documents to a temporary index,
> use
> addIndexes() to merge to temporary index into the live index (which
> also
> optimizes the live index) then delete the temporary index. On the
> next loop
> I'd only query rows from the db above the id in the maxdoc of the
> live index
> and set the max rows of the query to to 10,000
> i.e
> 
> SELECT TOP 10000 [fields] FROM [tables] WHERE [id_field] > {ID from
> Index.MaxDoc()} ORDER BY [id_field] ASC
> 
> Ensuring that the documents go into the index sequentially your
> problem is
> solved and memory usage on mine (dotlucene 1.3) is low
> 
> Regards
> Garrett
> 
> -----Original Message-----
> From: Homam S.A. [mailto:homam_sa@yahoo.com] 
> Sent: 15 December 2004 02:43
> To: Lucene Users List
> Subject: Indexing a large number of DB records
> 
> I'm trying to index a large number of records from the
> DB (a few millions). Each record will be stored as a
> document with about 30 fields, most of them are
> UnStored and represent small strings or numbers. No
> huge DB Text fields.
> 
> But I'm running out of memory very fast, and the
> indexing is slowing down to a crawl once I hit around
> 1500 records. The problem is each document is holding
> references to the string objects returned from
> ToString() on the DB field, and the IndexWriter is
> holding references to all these document objects in
> memory, so the garbage collector is getting a chance
> to clean these up.
> 
> How do you guys go about indexing a large DB table?
> Here's a snippet of my code (this method is called for
> each record in the DB):
> 
> private void IndexRow(SqlDataReader rdr, IndexWriter
> iw) {
> 	Document doc = new Document();
> 	for (int i = 0; i < BrowseFieldNames.Length; i++) {
> 		doc.Add(Field.UnStored(BrowseFieldNames[i],
> rdr.GetValue(i).ToString()));
> 	}
> 	iw.AddDocument(doc);
> }
> 
> 
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> Yahoo! Mail - Find what you need with new enhanced search.
> http://info.mail.yahoo.com/mail_250
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message