Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Message-ID: <010a01c381d9$71119590$9e0010ac@thopny.triplehop.com>
From: "hui" <hui@triplehop.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
References: <5A4393E2-EB42-11D7-83C0-000393A564E6@ehatchersolutions.com>
 <02b801c3810f$07d46b40$9e0010ac@thopny.triplehop.com>
 <00f301c381b6$82d0f870$680010ac@teck>
Subject: Re: Proposition :adding minMergeDoc to IndexWriter
Date: Tue, 23 Sep 2003 09:48:59 -0400
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

It is a great. Julien. Thanks.
Next time I am going to post the requests to the developer groups.

Regards,
Hui
----- Original Message ----- 
From: "Julien Nioche" <Julien.Nioche@lingway.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Tuesday, September 23, 2003 5:38 AM
Subject: Proposition :adding minMergeDoc to IndexWriter


> Hui,
>
> Concerning an other point of your request list I proposed a patch this
week
> end on the lucene-dev list and i totally forgot that this feature was
> requested on the user list.
>
> This new feature should help you to set a number of Documents to be merged
> in memory independently of the mergeFactor.
>
> Any comments would be appreciated
>
> Best regards
>
> Julien Nioche
> http://www.lingway.com
>
> ---------- Debut du message initial -----------
>
> De     : "fp235-5" <julien.nioche@lingway.com>
> A      : "lucene-dev" <lucene-dev@jakarta.apache.org>
> Copies :
> Date   : Sat, 20 Sep 2003 16:06:06 +0200
> Sujet  : [PATCH] IndexWriter : controling the number of Docs merged
>
> Hello,
>
> Someone made a suggestion yesterday about adding a variable to IndexWriter
> in
> order to control the number of Documents merged in RAMDirectory
> independently of
> the mergeFactor. (I'm sorry I don't remember who exactly and the mail
> arrived at
> my office).
> I'm proposing a tiny modification of IndexWriter to add this
functionality.
> A
> variable minMergeDocs specifies the number of Documents to be merged in
> memory
> before starting a new Segment. The mergeFactor still control the number of
> Segments created in the Directory and thus it's possible to avoid the file
> number limitation problem.
>
> The diff file is attached.
>
> As noticed by Dmitry and Erik there are no true JUnit tests. I'd be OK to
> write
> a JUnit test for this feature. The problem is that the SegmentInfos field
is
> private in IndexWriter and can't be used to check the number and size of
the
> Segments. I ran a test using the infoStream variable of IndexWriter -
> everything
> seems to be OK.
>
> Any comments / suggestions are welcome.
>
> Regards
>
> Julien
>
>
>
>
>
>
>
>
>
> ----- Original Message -----
> From: "hui" <hui@triplehop.com>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Monday, September 22, 2003 3:40 PM
> Subject: Re: per-field Analyzer (was Re: some requests)
>
>
> > Good work, Erik.
> >
> > Hui
> >
> > ----- Original Message -----
> > From: "Erik Hatcher" <erik@ehatchersolutions.com>
> > To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> > Sent: Saturday, September 20, 2003 4:13 AM
> > Subject: per-field Analyzer (was Re: some requests)
> >
> >
> > > On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
> > > > On Friday, September 19, 2003, at 11:15  AM, hui wrote:
> > > >> 1. Move the Analyzer down to field level from document level so
some
> > > >> fields
> > > >> could be applied a specail analyzer.Other fields still use the
> default
> > > >> analyzer from the document level.
> > > >> For example, I do not need to index the number for the "content"
> > > >> field. It
> > > >> helps me reduce the index size a lot when I have some excel files.
> > > >> But I
> > > >> always need the "created_date" to be indexed though it is a number
> > > >> field.
> > > >>
> > > >> I know there are some workarounds put in the group, but I think it
> > > >> should be
> > > >> a good feature to have.
> > > >
> > > > The "workaround" is to write a custom analyzer and and have it do
the
> > > > desired thing per-field.
> > > >
> > > > Hmmm.... just thinking out loud here without knowing if this is
> > > > possible, but could a generic "wrapper" Analyzer be written that
> > > > allows other analyzers to be used under the covers based on a field
> > > > name/analyzer mapping?   If so, that would be quite cool and save
> > > > folks from having to write custom analyzers as much to handle this
> > > > pretty typical use-case.  I'll look into this more in the very near
> > > > future personally, but feel free to have a look at this yourself and
> > > > see what you can come up with.
> > >
> > > What about something like this?
> > >
> > > public class PerFieldWrapperAnalyzer extends Analyzer {
> > >    private Analyzer defaultAnalyzer;
> > >    private Map analyzerMap = new HashMap();
> > >
> > >
> > >    public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
> > >      this.defaultAnalyzer = defaultAnalyzer;
> > >    }
> > >
> > >    public void addAnalyzer(String fieldName, Analyzer analyzer) {
> > >      analyzerMap.put(fieldName, analyzer);
> > >    }
> > >
> > >    public TokenStream tokenStream(String fieldName, Reader reader) {
> > >      Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
> > >      if (analyzer == null) {
> > >        analyzer = defaultAnalyzer;
> > >      }
> > >
> > >      return analyzer.tokenStream(fieldName, reader);
> > >    }
> > > }
> > >
> > > This would allow you to construct a single analyzer out of others, on
a
> > > per-field basis, including a default one for any fields that do not
> > > have a special one.  Whether the constructor should take the map or
the
> > > addAnalyzer method is implemented is debatable, but I prefer the
> > > addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could
> > > chain: new PerFieldWrapperAnalyzer(new
> > > StandardAnalyzer).addAnalyzer("field1", new
> > > WhitespaceAnalyzer()).addAnalyzer(.....).  And I'm more inclined to
> > > call this thing PerFieldAnalyzerWrapper instead.  Any naming
> > > suggestions?
> > >
> > > This simple little class would seem to be the answer to a very common
> > > question asked.
> > >
> > > Thoughts?  Should this be made part of the core?
> > >
> > > Erik
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>


----------------------------------------------------------------------------
----


> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org