Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (asf.osuosl.org: domain of mike@curtin.com designates
 205.158.62.199 as permitted sender)
Message-ID: <448D9064.8000007@curtin.com>
Date: Mon, 12 Jun 2006 12:03:48 -0400
From: "Michael D. Curtin" <mike@curtin.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
 rv:1.7.2) Gecko/20040804 Netscape/7.2 (ax)
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: Does more memory help Lucene?
References: 
 <OF1CBFD492.7732AB08-ONC225718B.004DCC55-C225718B.004EB500@il.ibm.com>
In-Reply-To: 
 <OF1CBFD492.7732AB08-ONC225718B.004DCC55-C225718B.004EB500@il.ibm.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

Nadav Har'El wrote:

> Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote on 12/06/2006 04:36:45
> PM:
> 
> 
>>Nadav,
>>
>>Look up one of my onjava.com Lucene articles, where I talk about
>>this.  You may also want to tell Lucene to merge segments on disk
>>less frequently, which is what mergeFactor does.
> 
> 
> Thanks. Can you please point me to the appropriate article (I found one
> from March 2003, but I'm not sure if it's the one you meant).
> 
> About mergeFactor() - thanks for the hint, I'll try changing it too (I used
> 20 so far), and see if it helps performance.
> 
> Still, there is one thing about mergeFactor(), and the merge process, that
> I don't understand: does having more memory help this process at all? Does
> having a large mergeFactor() actually require more memory? The reason I'm
> asking this that I'm still trying to figure out whether having a machine
> with huge ram actually helps Lucene, or not.

I'm using 1.4.3, so I don't know if things are the same in 2.0.  Anyhow, I 
found a significant performance benefit from changing minMergeDocs and 
mergeFactor from their defaults of 10 and 10 to 1,000 and 70, respectively. 
The improvement seems to come from a reduction in the number of merges as the 
index is created.  Each merge involves reading and writing a bunch of data 
already indexed, sometimes everything indexed so far, so it's easy to see how 
reducing the number of merges reduces the overall indexing time.

I can't remember why, but I also little benefit to increasing minMergeDocs 
beyond 1000.  A lot of time was being spent in the first merge, which takes a 
bunch of one-document "segments" in a RAMDirectory and merges them into the 
first-level segments on disk.  I hacked the code to make this first merge (and 
ONLY the first merge) operate on minMergeDocs * mergeFactor documents instead, 
which greatly increased the RAM consumption but reduced the indexing time.  In 
detail, what I started with was:
   a.  read minMergeDocs of docs, creating one-doc segments in RAM
   b.  read those one-doc RAM segments and merge them
   c.  write the merged results to a disk segment
   ...
   i.  read mergeFactor first-level disk segments and merge them
   j.  write second-level segments to disk
   ...
   p.  normal disk-based merging thereafter, as necessary

And what I ended up with was:
   A.  read minMergeDocs * mergeFactor docs, and remember them in RAM
   B.  write a segment from all the remembered RAM docs (a modified merge)
   ...
   F.  normal disk-based merging thereafter, as necessary

In essence, I eliminated that first level merge, one that involved lots and 
lots of teeny-weeny I/O operations that were very inefficient.

In my case, steps A & B worked on 70,000 documents instead of 1,000. 
Remembering all those docs required a lot of RAM (almost 2GB), but it almost 
tripled indexing performance.  Later, I had to knock the 70 down to 35 (maybe 
because my docs got a lot bigger but I don't remember now), but you get the 
idea.  I couldn't use a mergeFactor of 70,000 because that's way more file 
descriptors than I could have without recompiling the kernel (I seem to 
remember my limit being 1,024, and each segment took 14 file descriptors).

Hope it helps.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org