lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Wolanin <peter.wola...@acquia.com>
Subject Re: Error with highlighter and UTF-8 chars?
Date Mon, 23 Feb 2009 13:24:14 GMT
We are using Solr trunk (1.4)  - currently " nightly exported - yonik
- 2009-02-05 08:06:00"

-Peter

On Mon, Feb 23, 2009 at 8:07 AM, Koji Sekiguchi <koji@r.email.ne.jp> wrote:
> Jacob,
>
> What Solr version are you using? There is a bug in SolrHighlighter of Solr
> 1.3,
> you may want to look at:
>
> https://issues.apache.org/jira/browse/SOLR-925
> https://issues.apache.org/jira/browse/LUCENE-1500
>
> regards,
>
> Koji
>
>
> Jacob Singh wrote:
>>
>> Hi,
>>
>> We ran into a weird one today.  We have a document which is written in
>> German and everytime we make a query which matches it, we get the
>> following:
>>
>> java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
>>        at java.lang.String.substring(String.java:1935)
>>        at
>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)
>>
>>
>> >From source diving it looks like Lucene's highlighter is trying to
>> subStr against an offset that is outside the bounds of the body field
>> which it is highlighting against.  Running a fq against the ID of the
>> doucment returns it fine (because no highlighting is done) and I took
>> the body and tried to cut the first 2822 chars and while it is near
>> the end of the body, it is still in range.
>>
>> Here is the related code:
>>
>> startOffset = tokenGroup.matchStartOffset;
>> endOffset = tokenGroup.matchEndOffset;
>> tokenText = text.substring(startOffset, endOffset);
>>
>>
>> This leads me to believe there is some problem with mb string encoding
>> and Lucene's counting.
>>
>> Any ideas here?  Tomcat is configured with UTF-8 btw.
>>
>> Best,
>> Jacob
>>
>>
>>
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wolanin@acquia.com

Mime
View raw message