lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: hit highlighting in lucene ?
Date Fri, 22 May 2009 12:50:14 GMT
Hello,
I think if you analyze text correctly, then your highlighting will work too.
Your problem is you need an analyzer that analyzes text correctly, then I
think everything will work!

Here's a short intro with some links:
You can get code that applies these algorithms here:
http://site.icu-project.org/

none of it is too complex unless you need high performance, then it gets a
bit tricky. so that is why my code is not ready yet :(

segmentation (tokenization):
Basically, each character in unicode has default word-break properties
defined. This will break your hindi words correctly.
Simple/StandardAnalyzer incorrectly break words around non-spacing marks
such as your hindi dependent vowels and nukta dot, because the isLetter(x)
property happens to be false.

It is not possible to provide a uniform set of rules that resolves all
issues across languages or that handles all ambiguous situations within a
given language. The goal for the specification presented in this annex is to
provide a workable default; tailored implementations can be more
sophisticated.
http://www.unicode.org/reports/tr29/tr29-13.html
This is what you get if you apply BreakIterator

For a "demo", put some text into windows notepad, and start double-clicking.
The way in which words are highlighted by your mouse selection is basically
what we are talking about here.

normalization:
For round-trip compatibility with existing standards, Unicode has encoded
many entities that are really variants of the same abstract character. This
is the part that will ensure your PHA + NUKTA DOT and FA are treated the
same.
http://www.unicode.org/reports/tr15/tr15-29.html
This is what you get if you apply Normalizer

case folding:
Case folding is a special mapping, which if applied, erases case
differences.
This is different than lower-casing, for example 'ß' maps to 'ss'.
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf (page 61)
this is what you get if you apply UCharacter.foldCase


On Fri, May 22, 2009 at 12:38 AM, KK <dioxide.software@gmail.com> wrote:

> Thank you all.
> @Muir
> Thanks for sharing your views. I'ld like to have some more details on the
> process you mentioned as I've absolutely no idea on this highlighting
> stuffs, could not make much out of our mail. Can you point me to some
> tutorials/good write ups on the same, if you have some write ups on the
> same, do give me the pointers. it'll help me a lot.
> Pointers to the unicode default algorithms mentioned in your mail will be
> equally helpful.
>
> Thanks,
> KK.
>
> On Thu, May 21, 2009 at 8:03 PM, Robert Muir <rcmuir@gmail.com> wrote:
>
> > its definitely an area in lucene that could use some improvement.
> >
> > my recommendation for multilingual text is to apply the unicode "default"
> > algorithms:
> >
> > Tokenize text according to UAX #29: unicode text segmentation
> > Apply full case-folding (unicode ch. 3.13) with FC_NFKC closure
> > Apply UAX #15: unicode normalization
> >
> > for now you will have to write code to do this, but i'm looking forward
> to
> > contributing my implementation soon.
> >
> > i definitely feel your pain.
> >
> > On Thu, May 21, 2009 at 9:12 AM, Joel Halbert <joel@su3analytics.com>
> > wrote:
> >
> > >
> > > > If I index english pages
> > > > with the same indexer, it will not take care of stemming and stop
> word
> > > > removal?
> > >
> > > correct
> > >
> > >
> > > > Cant we have a single indexer that handles non-eng and eng in
> > > > equally good ways?
> > >
> > > You can have a single indexer, but, if you wanted to use one Analyzer
> for
> > > English documents (with stemming/stops) and another analyzer for other
> > > language documents
> > > then you would need to know, at the point of both *indexing* and
> > *querying*
> > > what language your indexed document and your query were in.
> > >
> > > This makes the assumption that when a query is in English you only want
> > to
> > > query English lang docs, and vica versa.
> > > You would also have to mark up your documents with a language
> identifier
> > > (i.e. 0=English, 1=Other Languages) so that when you query you have a
> > > conditional on the language.
> > >
> > >
> > >
> > > I've not had to deal with multi-language documents though - so I'm sure
> > > others will be better placed to offer their experience.
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: KK <dioxide.software@gmail.com>
> > > Reply-To: java-user@lucene.apache.org
> > > To: java-user@lucene.apache.org
> > > Subject: Re: hit highlighting in lucene ?
> > > Date: Thu, 21 May 2009 18:31:44 +0530
> > >
> > > Initially I was using standardAnalyzer but I switched to simpleAnalyzer
> > > which I guess doesnot do more that tokenizing[and may be tokenizing]
> and
> > I
> > > think this does not do stemming which I dont/cant do because I've no
> > > stemmer
> > > for the languages I'm indexing.
> > > For indexing and querring I'm using the same SimpelAnalyzer. So as you
> > say
> > > I
> > > can go for the standard highlighter api which I mentioned in my last
> > mail,
> > > and this will handle any language for highlighting support. I should
> > start
> > > using this one, right?
> > >
> > > One more thing. I've a single indexer and searcher that I'm usign for
> > > indexing pages of many different non-english languages and as I
> mentioned
> > > earier I'm using simpleAnalyzer, does that mean If I index english
> pages
> > > with the same indexer, it will not take care of stemming and stop word
> > > removal? But I dont want to have multiple indexer that is specific to
> > > languages. Cant we have a single indexer that handles non-eng and eng
> in
> > > equally good ways? Or any other ideas on the same ?
> > >
> > > Thanks,
> > > KK.
> > >
> > > On Thu, May 21, 2009 at 6:18 PM, Joel Halbert <joel@su3analytics.com>
> > > wrote:
> > >
> > > > The highlighter should be language independent. So long as you are
> > > > consistent with your use of Analyzer between
> > > > indexing/query/highlighting.
> > > >
> > > > As for the most appropriate Analyzer to use for your local language,
> > > > this is a seperate question - especially if you are using stop word
> and
> > > > stemming filters.
> > > >
> > > > The StandardAnalyzer is designed for English since it used the
> > > > StopFilter (English words only).
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: KK <dioxide.software@gmail.com>
> > > > Reply-To: java-user@lucene.apache.org
> > > > To: java-user@lucene.apache.org
> > > > Subject: hit highlighting in lucene ?
> > > > Date: Thu, 21 May 2009 17:51:13 +0530
> > > >
> > > > Hi All,
> > > > I was looking for various ways of implementing hit highlighting in
> > Lucene
> > > > and found some standard classes that does support highlighting like
> > this
> > > > *lucene*.
> > > apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight*
> > > > /package-summary.html<
> > >
> >
> http://apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight*%0A/package-summary.html
> > > >
> > > >
> > > > ik but what i believe is that this is only for english or does it
> > support
> > > > other languages. I actually wanted to support highlighting for some
> > > > non-english languages which I'm able to index and fetch using utf-8
> > > > encoding. So  this means that if I want to have highlighting then
> I've
> > to
> > > > get the utf-8 query and look for the same in the result and add apt
> > tags
> > > > whereever required, it essentially boils down to implementing the
> > > standard
> > > > highlighter. I think the standard highlighter also supports other
> > > > languages.
> > > > Correct me if i'm wrong.
> > > >
> > > > Due to my requirement constraints I'm using just simpleAnalyzer and
> we
> > > dont
> > > > have tokenizers for these regional languages. Any other ideas of
> doing
> > > the
> > > > same would be helpful as well.
> > > >
> > > > Thanks,
> > > > KK.
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>



-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message