lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Garrett Heaver" <>
Subject RE: Indexing a large number of DB records
Date Thu, 16 Dec 2004 09:46:36 GMT
There was other reasons for my choice of going with a Temp Index - namely I
was having terrible write times to my Live index as it was stored on a
different server, also, while I was writing to my live index people were
trying to search on it and were getting "file not found" exceptions so
rather than spend hours or days trying to fix it I took the easiest route by
creating a temp index on the server which had the application and merging to
the server with the live index. This greatly increased my indexing speed.

Best of luck

-----Original Message-----
From: Otis Gospodnetic [] 
Sent: 15 December 2004 18:43
To: Lucene Users List
Subject: RE: Indexing a large number of DB records

Note that this really includes some extra steps.
You don't need a temp index.  Add everything to a single index using a
single IndexWriter instance.  No need to call addIndexes nor optimize
until the end.  Adding Documents to an index takes a constant amount of
time, regardless of the index size, because new segments are created as
documents are added, and existing segments don't need to be updated
(only when merges happen).  Again, I'd run your app under a profiler to
see where the time and memory are going.


--- Garrett Heaver <> wrote:

> Hi Homan
> I had a similar problem as you in that I was indexing A LOT of data
> Essentially how I got round it was to batch the index.
> What I was doing was to add 10,000 documents to a temporary index,
> use
> addIndexes() to merge to temporary index into the live index (which
> also
> optimizes the live index) then delete the temporary index. On the
> next loop
> I'd only query rows from the db above the id in the maxdoc of the
> live index
> and set the max rows of the query to to 10,000
> i.e
> SELECT TOP 10000 [fields] FROM [tables] WHERE [id_field] > {ID from
> Index.MaxDoc()} ORDER BY [id_field] ASC
> Ensuring that the documents go into the index sequentially your
> problem is
> solved and memory usage on mine (dotlucene 1.3) is low
> Regards
> Garrett
> -----Original Message-----
> From: Homam S.A. [] 
> Sent: 15 December 2004 02:43
> To: Lucene Users List
> Subject: Indexing a large number of DB records
> I'm trying to index a large number of records from the
> DB (a few millions). Each record will be stored as a
> document with about 30 fields, most of them are
> UnStored and represent small strings or numbers. No
> huge DB Text fields.
> But I'm running out of memory very fast, and the
> indexing is slowing down to a crawl once I hit around
> 1500 records. The problem is each document is holding
> references to the string objects returned from
> ToString() on the DB field, and the IndexWriter is
> holding references to all these document objects in
> memory, so the garbage collector is getting a chance
> to clean these up.
> How do you guys go about indexing a large DB table?
> Here's a snippet of my code (this method is called for
> each record in the DB):
> private void IndexRow(SqlDataReader rdr, IndexWriter
> iw) {
> 	Document doc = new Document();
> 	for (int i = 0; i < BrowseFieldNames.Length; i++) {
> 		doc.Add(Field.UnStored(BrowseFieldNames[i],
> rdr.GetValue(i).ToString()));
> 	}
> 	iw.AddDocument(doc);
> }
> __________________________________ 
> Do you Yahoo!? 
> Yahoo! Mail - Find what you need with new enhanced search.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message