lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joel Halbert <j...@su3analytics.com>
Subject Re: hit highlighting in lucene ?
Date Thu, 21 May 2009 13:12:33 GMT

> If I index english pages
> with the same indexer, it will not take care of stemming and stop word
> removal? 

correct


> Cant we have a single indexer that handles non-eng and eng in
> equally good ways?

You can have a single indexer, but, if you wanted to use one Analyzer for English documents
(with stemming/stops) and another analyzer for other language documents
then you would need to know, at the point of both *indexing* and *querying* what language
your indexed document and your query were in.

This makes the assumption that when a query is in English you only want to query English lang
docs, and vica versa. 
You would also have to mark up your documents with a language identifier (i.e. 0=English,
1=Other Languages) so that when you query you have a conditional on the language.



I've not had to deal with multi-language documents though - so I'm sure others will be better
placed to offer their experience.



-----Original Message-----
From: KK <dioxide.software@gmail.com>
Reply-To: java-user@lucene.apache.org
To: java-user@lucene.apache.org
Subject: Re: hit highlighting in lucene ?
Date: Thu, 21 May 2009 18:31:44 +0530

Initially I was using standardAnalyzer but I switched to simpleAnalyzer
which I guess doesnot do more that tokenizing[and may be tokenizing] and I
think this does not do stemming which I dont/cant do because I've no stemmer
for the languages I'm indexing.
For indexing and querring I'm using the same SimpelAnalyzer. So as you say I
can go for the standard highlighter api which I mentioned in my last mail,
and this will handle any language for highlighting support. I should start
using this one, right?

One more thing. I've a single indexer and searcher that I'm usign for
indexing pages of many different non-english languages and as I mentioned
earier I'm using simpleAnalyzer, does that mean If I index english pages
with the same indexer, it will not take care of stemming and stop word
removal? But I dont want to have multiple indexer that is specific to
languages. Cant we have a single indexer that handles non-eng and eng in
equally good ways? Or any other ideas on the same ?

Thanks,
KK.

On Thu, May 21, 2009 at 6:18 PM, Joel Halbert <joel@su3analytics.com> wrote:

> The highlighter should be language independent. So long as you are
> consistent with your use of Analyzer between
> indexing/query/highlighting.
>
> As for the most appropriate Analyzer to use for your local language,
> this is a seperate question - especially if you are using stop word and
> stemming filters.
>
> The StandardAnalyzer is designed for English since it used the
> StopFilter (English words only).
>
>
> -----Original Message-----
> From: KK <dioxide.software@gmail.com>
> Reply-To: java-user@lucene.apache.org
> To: java-user@lucene.apache.org
> Subject: hit highlighting in lucene ?
> Date: Thu, 21 May 2009 17:51:13 +0530
>
> Hi All,
> I was looking for various ways of implementing hit highlighting in Lucene
> and found some standard classes that does support highlighting like this
> *lucene*.apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight*
> /package-summary.html<http://apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight*%0A/package-summary.html>
>
> ik but what i believe is that this is only for english or does it support
> other languages. I actually wanted to support highlighting for some
> non-english languages which I'm able to index and fetch using utf-8
> encoding. So  this means that if I want to have highlighting then I've to
> get the utf-8 query and look for the same in the result and add apt tags
> whereever required, it essentially boils down to implementing the standard
> highlighter. I think the standard highlighter also supports other
> languages.
> Correct me if i'm wrong.
>
> Due to my requirement constraints I'm using just simpleAnalyzer and we dont
> have tokenizers for these regional languages. Any other ideas of doing the
> same would be helpful as well.
>
> Thanks,
> KK.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message