lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Wang" <john.w...@gmail.com>
Subject Re: indexing api wrt Analyzer
Date Fri, 14 Mar 2008 06:09:05 GMT
Excellent!
Exactly what I was looking for!

Thanks Grant!

-John

On Thu, Mar 13, 2008 at 5:39 PM, Grant Ingersoll <gsingers@apache.org>
wrote:

> There is an addDocument method that takes an Analyzer and overrides
> the one used at construction of the IndexWriter.  See
>
> http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/index/IndexWriter.html#addDocument(org.apache.lucene.document.Document,%20org.apache.lucene.analysis.Analyzer)<http://lucene.apache.org/java/2_3_1/api/core/org/apache/lucene/index/IndexWriter.html#addDocument%28org.apache.lucene.document.Document,%20org.apache.lucene.analysis.Analyzer%29>
> .
>
>
>
> On Mar 13, 2008, at 4:12 PM, John Wang wrote:
>
> > Hi Grant:
> >
> >    For our corpus, we don't rely on idf in scoring calculation that
> > much,
> > so I don't see that being a problem that much.
> >
> >    About performance, instantiating 1 indexWriter for a batch of say
> > 1000
> > docs, e.g. iterate over 1000 docs and do addDocument; comparing with
> > instantiating and closing 1000 indexWriters each doing 1
> > addDocument. Are
> > you saying the expected performance is the same? I thought when you
> > call
> > addDocument, it adds to memory and flush when segment needs to be
> > merged or
> > writer closes.
> >
> >    Maybe I am missing something.
> >
> > Thanks
> >
> > -john
> >
> > On Thu, Mar 13, 2008 at 11:37 AM, Grant Ingersoll
> > <gsingers@apache.org>
> > wrote:
> >
> >>
> >> On Mar 13, 2008, at 11:03 AM, John Wang wrote:
> >>
> >>> Yes, but usually it's a good idea to add documents in batch and not
> >>> having
> >>> to reinstantiate the writer for every document and then closing it.
> >>>
> >>> It would be nice if one can specify to the writer which analyzer to
> >>> use.
> >>>
> >>> PerfieldAnalyzer wouldn't work because different analyzers may apply
> >>> on the
> >>> same field depending on the doc, e.g.
> >>>
> >>
> >> Also, I don't know that it is wise to put different langs in the same
> >> field.  I can't prove it definitively, but it seems to me your corpus
> >> statistics could be skewed by terms that are spelled the same but
> >> have
> >> different meanings across languages.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message