lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@lucene.com>
Subject Re: Indexing Tips and Hints
Date Tue, 25 Feb 2003 17:18:28 GMT
I doubt this will make Lucene much faster, since Lucene already 
implements buffering in its InputStream and OutputStream classes.  So 
Lucene already has this optimization built-in.

Doug

Andrzej Bialecki wrote:
> Hello,
> 
> Since you are trying this anyway, and looking for ways to improve 
> indexing times... Could you perhaps try to replace use of 
> java.io.RandomAccessFile in FSDirectory implementation, with the 
> attached implementation? It supposedly increases I/O throughput by 
> orders of magnitude, by using partial buffering.
> 
> Terry Steichen wrote:
> 
>> Mike,
>>
>> By way of comparison, I've got a collection of about 50,000 XML files, 
>> each
>> of which averages about 8K.  It takes about 1.25 hours to index (on a 
>> 1.8Ghz
>> machine).  I use basically the standard configuration (mergeFactor, etc.)
>> and I've got about 30 fields per document.  I add about 200 new ones per
>> day.  I don't recall how long that it takes to index the 200 (I do it
>> through a background task), but it takes a couple of minutes to merge the
>> new 200 document index with the master index.
>>
>> HTH,
>>
>> Terry
>>
>> ----- Original Message -----
>> From: "Michael Barry" <mbarry@cos.com>
>> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
>> Sent: Monday, February 24, 2003 2:00 PM
>> Subject: Indexing Tips and Hints
>>
>>
>>
>>> All,
>>>   I'm in need of some pointers, hints or tips on indexing large
>>
>>
>> collections
>>
>>> of data. I know I saw some tips on this list before but when I tried
>>> searching
>>> the list, I came up blank.
>>>   I have a large collection of XML files (336000 files around 5K
>>> apiece) that I'm
>>> indexing and its taking quite a bit of time (27 hours). I've played
>>> around with the
>>> mergeFactor, RAMDirectories and multiple threads (X number of threads
>>> indexing
>>> a subset of the data and then merging the indexes at the end) but I
>>> cannot seem
>>> to bring the time down. I'm probably not doing these things properly but
>>> from
>>> what I read I believe I am.  Maybe this is the best I can do with this
>>> data but I
>>> would be really grateful to hear how others have tackled this same 
>>> issue.
>>>   As always pointers to places in the mailing list archive or other
>>> places would be
>>> appreciated.
>>>
>>> Thanks, Mike.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>
> 
> 
> 
> ------------------------------------------------------------------------
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message