lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Bug in Lucene 2.2.0 code? Simple code included (StringIndexOutOfBoundsException).
Date Mon, 30 Jul 2007 12:25:55 GMT
Hey Lukas,

I was being simplistic when I said that the text and TokenSteam must be 
exactly the same. It's difficult to think of a reason why you would not 
want them to be the same though. Each Token records the offsets where it 
can be found in the original text -- that is how the Highlighter knows 
where to highlight in the original text with the only the Tokens to 
inspect. So if a Token is scored >0, then the offsets for that Token 
must be valid indexes into the text String (In the case of the 
HTMLFormmatter, which only marks Tokens that score >0).

Now an issue I see you having:

The TokenStream for "example long text" is:
(term,startoffset,endoffset)

(example,0,7)
(long,8,12)
(text,13,17)

So for the query "example long" the Highlighter will highlight offsets 
0-7 and 8-12 in the source text. In your example, with the text only 
being "example", the attempt to highlight the Token "long" will index 
into the source text 8 and cause an outofbounds.

In your case you are even worse off because you are building the 
TokenStream from a field that was added more than once. This gives you 
seemingly wrong offsets of:

(example,0,7)
(long,14,18)
(text,22,26)

Each word has its space accounted for twice. Maybe there is a reason for 
this, but it looks wrong. I have not investigated enough to know if 
TokenSources is responsible for this, or if core Lucene is the culprit. 
Even if it was done differently though, there would still seem to be 
possible issues with the possible spacing between words when you are 
adding the words one at a time with no spacing in the same field.

Looking at your original email though, you may be trying to do something 
that is best done without the Highlighter.

In summary , you should use Document.getFields (more efficient if you 
are getting more than one field anyway) and get around the offset issues 
above.

- Mark

Lukas Vlcek wrote:
> Mark,
> thank you for this. I will wait for your other responses.
> This will keep me going on :-)
>
> I didn't know that there is a design restriction in Lucene that the text and
> TokenStream must be exactly the same (still this seems redundant, I will
> dive into Lucene API more).
>
> BR
> Lukas
>
> On 7/29/07, Mark Miller <markrmiller@gmail.com> wrote:
>   
>> I'm am going to try and write up some more info for you tomorrow, but
>> just to point out: I do think there is a bug in the way offsets are
>> being handled. I don't think this is causing your current problem (what
>> I mentioned is) but it will prob cause you problems down the road. I
>> will look into this further.
>>
>> - Mark
>>
>> Lukas Vlcek wrote:
>>     
>>> Hi Lucene experts,
>>>
>>> The following is a simple Lucene code which generates
>>> StringIndexOutOfBoundsException exception. I am using Lucene 2.2.0official
>>> releasse. Can anyone tell me what is wrong with this code? Is this a bug
>>>       
>> or
>>     
>>> a feature of Lucene? Any comments/hits highly welcommed!
>>>
>>> In a nutshell I have a document with two (or four) fileds:
>>> 1) all
>>> 2-4) small
>>>
>>> I use [all] for searching and [small] for highlighting.
>>>
>>> [packkage and imports truncated...]
>>>
>>> public class MemoryIndexCase {
>>>     static public void main(String[] arg) {
>>>
>>>         Document doc = new Document();
>>>
>>>         doc.add(new Field("all","example long text",
>>>                 Field.Store.NO, Field.Index.TOKENIZED));
>>>         doc.add(new Field("small","example",
>>>                 Field.Store.YES, Field.Index.UN_TOKENIZED,
>>> Field.TermVector.WITH_POSITIONS_OFFSETS));
>>>         doc.add(new Field("small","long",
>>>                 Field.Store.YES, Field.Index.UN_TOKENIZED,
>>> Field.TermVector.WITH_POSITIONS_OFFSETS));
>>>         doc.add(new Field("small","text",
>>>                 Field.Store.YES, Field.Index.UN_TOKENIZED,
>>> Field.TermVector.WITH_POSITIONS_OFFSETS));
>>>
>>>         try {
>>>             Directory idx = new RAMDirectory();
>>>             IndexWriter writer = new IndexWriter(idx, new
>>> StandardAnalyzer(), true);
>>>
>>>             writer.addDocument(doc);
>>>             writer.optimize();
>>>             writer.close();
>>>
>>>             Searcher searcher = new IndexSearcher(idx);
>>>
>>>             QueryParser qp = new QueryParser("all", new
>>>       
>> StandardAnalyzer());
>>     
>>>             Query query = qp.parse("example text");
>>>             Hits hits = searcher.search(query);
>>>
>>>             Highlighter highlighter =    new Highlighter(new
>>> QueryScorer(query));
>>>
>>>             IndexReader ir = IndexReader.open(idx);
>>>             for (int i = 0; i < hits.length(); i++) {
>>>
>>>                 String text = hits.doc(i).get("small");
>>>
>>>                 TermFreqVector tfv = ir.getTermFreqVector(hits.id(i),
>>> "small");
>>>                 TokenStream tokenStream=
>>> TokenSources.getTokenStream((TermPositionVector)
>>> tfv);
>>>
>>>                 String result =
>>>                     highlighter.getBestFragment(tokenStream,text);
>>>                 System.out.println(result);
>>>             }
>>>
>>>         } catch (Throwable e) {
>>>             e.printStackTrace();
>>>         }
>>>     }
>>> }
>>>
>>> The exception is:
>>> java.lang.StringIndexOutOfBoundsException: String index out of range: 11
>>>     at java.lang.String.substring(String.java:1935)
>>>     at
>>>       
>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(
>>     
>>> Highlighter.java:235)
>>>     at org.apache.lucene.search.highlight.Highlighter.getBestFragments(
>>> Highlighter.java:175)
>>>     at org.apache.lucene.search.highlight.Highlighter.getBestFragment(
>>> Highlighter.java:101)
>>>     at org.lucenetest.MemoryIndexCase.main(MemoryIndexCase.java:70)
>>>
>>> Best regards,
>>> Lukas
>>>
>>>
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message