lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Rowe <>
Subject Re: text highlighting problem
Date Mon, 12 Mar 2007 14:35:58 GMT
Hi Jan,

It sounds like your "canonizer" reduces the number of Java characters in
the input.  Have you have performed character decomposition, so that
each diacritic is a separate Java character, before the canonizer does
its work?  If so, the character offsets recorded by StandardTokenizer
will not match the original input text.

If the above is true:

1. You could implement character decomposition[1], as well as diacritic
stripping (which I assume is part of its job) in your canonizer, instead
of in some other pre-processing step.  This way, the character counts
will remain the same before and after the canonizer does its work.
(This may require you to add a character composition step to your

2. If the input text cannot be changed (that is, it must remain
decomposed), then you could have your canonizer put one space per
diacritic after each word containing one, instead of just stripping
diacritics.  In this way, word boundaries will remain in the same positions.

Hope it helps,


JanK wrote:
> Hi !
>  I've got some problems with text highlighting. I'am using some special
> library that handles text reduceing/canonization (used language is
> Polish). The reduced words has no flexion (which is usefull feature in
> the context of searching). The problem is when i try to perform text
> highlighting because the highlighting tags are not in a correct position
> (they are shifted), for example:
> this is s<br>ome tex</br> t to highlight. The canonization library is
> integrated with analyzer in the following manner:
> public final class MyAnalyzer extends Analyzer
> {
>   public TokenStream tokenStream(String fieldName, Reader reader)
>  {
>   MyCanonizer textCanonizer = new MyCanonizer();
>   TokenStream ts = new
> StandardTokenizer(textCanonizer.peformCanonization(reader));
>   return ts;
>  }
> }
> Could anybody say why the highlights are shifted and/or how to solve the
> problem ?
> Thanks,
> JanK

View raw message