lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche" <Julien.Nio...@lingway.com>
Subject Proposition :adding minMergeDoc to IndexWriter
Date Tue, 23 Sep 2003 09:38:55 GMT
Hui,

Concerning an other point of your request list I proposed a patch this week
end on the lucene-dev list and i totally forgot that this feature was
requested on the user list.

This new feature should help you to set a number of Documents to be merged
in memory independently of the mergeFactor.

Any comments would be appreciated

Best regards

Julien Nioche
http://www.lingway.com

---------- Debut du message initial -----------

De     : "fp235-5" <julien.nioche@lingway.com>
A      : "lucene-dev" <lucene-dev@jakarta.apache.org>
Copies :
Date   : Sat, 20 Sep 2003 16:06:06 +0200
Sujet  : [PATCH] IndexWriter : controling the number of Docs merged

Hello,

Someone made a suggestion yesterday about adding a variable to IndexWriter
in
order to control the number of Documents merged in RAMDirectory
independently of
the mergeFactor. (I'm sorry I don't remember who exactly and the mail
arrived at
my office).
I'm proposing a tiny modification of IndexWriter to add this functionality.
A
variable minMergeDocs specifies the number of Documents to be merged in
memory
before starting a new Segment. The mergeFactor still control the number of
Segments created in the Directory and thus it's possible to avoid the file
number limitation problem.

The diff file is attached.

As noticed by Dmitry and Erik there are no true JUnit tests. I'd be OK to
write
a JUnit test for this feature. The problem is that the SegmentInfos field is
private in IndexWriter and can't be used to check the number and size of the
Segments. I ran a test using the infoStream variable of IndexWriter -
everything
seems to be OK.

Any comments / suggestions are welcome.

Regards

Julien









----- Original Message -----
From: "hui" <hui@triplehop.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Monday, September 22, 2003 3:40 PM
Subject: Re: per-field Analyzer (was Re: some requests)


> Good work, Erik.
>
> Hui
>
> ----- Original Message -----
> From: "Erik Hatcher" <erik@ehatchersolutions.com>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Saturday, September 20, 2003 4:13 AM
> Subject: per-field Analyzer (was Re: some requests)
>
>
> > On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
> > > On Friday, September 19, 2003, at 11:15  AM, hui wrote:
> > >> 1. Move the Analyzer down to field level from document level so some
> > >> fields
> > >> could be applied a specail analyzer.Other fields still use the
default
> > >> analyzer from the document level.
> > >> For example, I do not need to index the number for the "content"
> > >> field. It
> > >> helps me reduce the index size a lot when I have some excel files.
> > >> But I
> > >> always need the "created_date" to be indexed though it is a number
> > >> field.
> > >>
> > >> I know there are some workarounds put in the group, but I think it
> > >> should be
> > >> a good feature to have.
> > >
> > > The "workaround" is to write a custom analyzer and and have it do the
> > > desired thing per-field.
> > >
> > > Hmmm.... just thinking out loud here without knowing if this is
> > > possible, but could a generic "wrapper" Analyzer be written that
> > > allows other analyzers to be used under the covers based on a field
> > > name/analyzer mapping?   If so, that would be quite cool and save
> > > folks from having to write custom analyzers as much to handle this
> > > pretty typical use-case.  I'll look into this more in the very near
> > > future personally, but feel free to have a look at this yourself and
> > > see what you can come up with.
> >
> > What about something like this?
> >
> > public class PerFieldWrapperAnalyzer extends Analyzer {
> >    private Analyzer defaultAnalyzer;
> >    private Map analyzerMap = new HashMap();
> >
> >
> >    public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
> >      this.defaultAnalyzer = defaultAnalyzer;
> >    }
> >
> >    public void addAnalyzer(String fieldName, Analyzer analyzer) {
> >      analyzerMap.put(fieldName, analyzer);
> >    }
> >
> >    public TokenStream tokenStream(String fieldName, Reader reader) {
> >      Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
> >      if (analyzer == null) {
> >        analyzer = defaultAnalyzer;
> >      }
> >
> >      return analyzer.tokenStream(fieldName, reader);
> >    }
> > }
> >
> > This would allow you to construct a single analyzer out of others, on a
> > per-field basis, including a default one for any fields that do not
> > have a special one.  Whether the constructor should take the map or the
> > addAnalyzer method is implemented is debatable, but I prefer the
> > addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could
> > chain: new PerFieldWrapperAnalyzer(new
> > StandardAnalyzer).addAnalyzer("field1", new
> > WhitespaceAnalyzer()).addAnalyzer(.....).  And I'm more inclined to
> > call this thing PerFieldAnalyzerWrapper instead.  Any naming
> > suggestions?
> >
> > This simple little class would seem to be the answer to a very common
> > question asked.
> >
> > Thoughts?  Should this be made part of the core?
> >
> > Erik
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>

Mime
View raw message