lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents
Date Fri, 22 Jun 2007 19:37:44 GMT
Hi Michael,

I know you've got your hands full, but was wondering if you could  
either post your benchmark code, or better yet, hook it into the  
benchmarker contrib (it is quite easy).

Let me know if I can help,
Grant

On Jun 21, 2007, at 10:01 AM, Michael McCandless (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-843? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12506907 ]
>
> Michael McCandless commented on LUCENE-843:
> -------------------------------------------
>
> OK I ran tests comparing analyzer performance.
>
> It's the same test framework as above, using the ~5,500 byte Europarl
> docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor
> vectors, and CFS=false, indexing 200,000 documents.
>
> The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes
> GC cost by not allocating a Term or String for every token in every
> document.
>
> Each run is best time of 2 runs:
>
>   ANALYZER            PATCH (sec) TRUNK (sec)  SPEEDUP
>   SimpleSpaceAnalyzer  79.0       326.5        4.1 X
>   StandardAnalyzer    449.0       674.1        1.5 X
>   WhitespaceAnalyzer  104.0       338.9        3.3 X
>   SimpleAnalyzer      104.7       328.0        3.1 X
>
> StandardAnalyzer is definiteely rather time consuming!
>
>
>> improve how IndexWriter uses RAM to buffer added documents
>> ----------------------------------------------------------
>>
>>                 Key: LUCENE-843
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>          Components: Index
>>    Affects Versions: 2.2
>>            Reporter: Michael McCandless
>>            Assignee: Michael McCandless
>>            Priority: Minor
>>         Attachments: index.presharedstores.cfs.zip,  
>> index.presharedstores.nocfs.zip, LUCENE-843.patch,  
>> LUCENE-843.take2.patch, LUCENE-843.take3.patch,  
>> LUCENE-843.take4.patch, LUCENE-843.take5.patch,  
>> LUCENE-843.take6.patch, LUCENE-843.take7.patch,  
>> LUCENE-843.take8.patch, LUCENE-843.take9.patch
>>
>>
>> I'm working on a new class (MultiDocumentWriter) that writes more  
>> than
>> one document directly into a single Lucene segment, more efficiently
>> than the current approach.
>> This only affects the creation of an initial segment from added
>> documents.  I haven't changed anything after that, eg how segments  
>> are
>> merged.
>> The basic ideas are:
>>   * Write stored fields and term vectors directly to disk (don't
>>     use up RAM for these).
>>   * Gather posting lists & term infos in RAM, but periodically do
>>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>>     merge them later when it's time to make a real segment).
>>   * Recycle objects/buffers to reduce time/stress in GC.
>>   * Other various optimizations.
>> Some of these changes are similar to how KinoSearch builds a segment.
>> But, I haven't made any changes to Lucene's file format nor added
>> requirements for a global fields schema.
>> So far the only externally visible change is a new method
>> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
>> deprecated) so that it flushes according to RAM usage and not a fixed
>> number documents added.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message