lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: hit highlighting in lucene ?
Date Thu, 21 May 2009 14:33:46 GMT
its definitely an area in lucene that could use some improvement.

my recommendation for multilingual text is to apply the unicode "default"
algorithms:

Tokenize text according to UAX #29: unicode text segmentation
Apply full case-folding (unicode ch. 3.13) with FC_NFKC closure
Apply UAX #15: unicode normalization

for now you will have to write code to do this, but i'm looking forward to
contributing my implementation soon.

i definitely feel your pain.

On Thu, May 21, 2009 at 9:12 AM, Joel Halbert <joel@su3analytics.com> wrote:

>
> > If I index english pages
> > with the same indexer, it will not take care of stemming and stop word
> > removal?
>
> correct
>
>
> > Cant we have a single indexer that handles non-eng and eng in
> > equally good ways?
>
> You can have a single indexer, but, if you wanted to use one Analyzer for
> English documents (with stemming/stops) and another analyzer for other
> language documents
> then you would need to know, at the point of both *indexing* and *querying*
> what language your indexed document and your query were in.
>
> This makes the assumption that when a query is in English you only want to
> query English lang docs, and vica versa.
> You would also have to mark up your documents with a language identifier
> (i.e. 0=English, 1=Other Languages) so that when you query you have a
> conditional on the language.
>
>
>
> I've not had to deal with multi-language documents though - so I'm sure
> others will be better placed to offer their experience.
>
>
>
> -----Original Message-----
> From: KK <dioxide.software@gmail.com>
> Reply-To: java-user@lucene.apache.org
> To: java-user@lucene.apache.org
> Subject: Re: hit highlighting in lucene ?
> Date: Thu, 21 May 2009 18:31:44 +0530
>
> Initially I was using standardAnalyzer but I switched to simpleAnalyzer
> which I guess doesnot do more that tokenizing[and may be tokenizing] and I
> think this does not do stemming which I dont/cant do because I've no
> stemmer
> for the languages I'm indexing.
> For indexing and querring I'm using the same SimpelAnalyzer. So as you say
> I
> can go for the standard highlighter api which I mentioned in my last mail,
> and this will handle any language for highlighting support. I should start
> using this one, right?
>
> One more thing. I've a single indexer and searcher that I'm usign for
> indexing pages of many different non-english languages and as I mentioned
> earier I'm using simpleAnalyzer, does that mean If I index english pages
> with the same indexer, it will not take care of stemming and stop word
> removal? But I dont want to have multiple indexer that is specific to
> languages. Cant we have a single indexer that handles non-eng and eng in
> equally good ways? Or any other ideas on the same ?
>
> Thanks,
> KK.
>
> On Thu, May 21, 2009 at 6:18 PM, Joel Halbert <joel@su3analytics.com>
> wrote:
>
> > The highlighter should be language independent. So long as you are
> > consistent with your use of Analyzer between
> > indexing/query/highlighting.
> >
> > As for the most appropriate Analyzer to use for your local language,
> > this is a seperate question - especially if you are using stop word and
> > stemming filters.
> >
> > The StandardAnalyzer is designed for English since it used the
> > StopFilter (English words only).
> >
> >
> > -----Original Message-----
> > From: KK <dioxide.software@gmail.com>
> > Reply-To: java-user@lucene.apache.org
> > To: java-user@lucene.apache.org
> > Subject: hit highlighting in lucene ?
> > Date: Thu, 21 May 2009 17:51:13 +0530
> >
> > Hi All,
> > I was looking for various ways of implementing hit highlighting in Lucene
> > and found some standard classes that does support highlighting like this
> > *lucene*.
> apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight*
> > /package-summary.html<
> http://apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight*%0A/package-summary.html
> >
> >
> > ik but what i believe is that this is only for english or does it support
> > other languages. I actually wanted to support highlighting for some
> > non-english languages which I'm able to index and fetch using utf-8
> > encoding. So  this means that if I want to have highlighting then I've to
> > get the utf-8 query and look for the same in the result and add apt tags
> > whereever required, it essentially boils down to implementing the
> standard
> > highlighter. I think the standard highlighter also supports other
> > languages.
> > Correct me if i'm wrong.
> >
> > Due to my requirement constraints I'm using just simpleAnalyzer and we
> dont
> > have tokenizers for these regional languages. Any other ideas of doing
> the
> > same would be helpful as well.
> >
> > Thanks,
> > KK.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message