lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: [jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream
Date Tue, 11 Nov 2008 20:53:56 GMT
Michael McCandless wrote:
> This stuff is confusing!  I think your numbers are not right.  Let's 
> try reformatting with CHAR=POS.
>
> Here's your example without the +1:
>
> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
> a=0 b=1 c=2 d=3  =4 t=5 h=6 e=7 c=8 r=9 u=10 n=11 c=12 h=13  =14 m=15 
> a=16 n=17
>
>   abcd 0-4
> crunch 8-14
>    man 15-18
>
> This is not how Lucene works today.  Lucene adds the +1 ("virtual
> space character"):
>
> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
> a=0 b=1 c=2 d=3  =4 t=5 h=6 e=7  =8 c=9 r=10 u=11 n=12 c=13 h=14  =15 
> m=16 a=17 n=18
>
>   abcd 0-4
> crunch 9-15
>    man 16-19
>
> I think?
>
Man, I'm sorry. Just reran my stuff and it didn't jive with my earlier 
results. Keep the +1 off it looks. Don't know what happened...I have 
java code and lucene calculating for me :)

At least that jives with earlier reports of people saying that have to 
insert that space to get things highlighted. Here are the results i get now:

Old:
a=0 b=1 c=2 d=3  =4 t=5 h=6 e=7 c=8 r=9 u=10 n=11 c=12 h=13  =14 m=15 
a=16 n=17
term:abcd s:0 e:4
term:crunch s:5 e:11
term:man s:12 e:15

New Without +1:

a=0 b=1 c=2 d=3  =4 t=5 h=6 e=7 c=8 r=9 u=10 n=11 c=12 h=13  =14 m=15 
a=16 n=17
term:abcd s:0 e:4
term:crunch s:8 e:14
term:man s:15 e:18

New With +1:

a=0 b=1 c=2 d=3  =4 t=5 h=6 e=7 c=8 r=9 u=10 n=11 c=12 h=13  =14 m=15 
a=16 n=17
term:abcd s:0 e:4
term:crunch s:9 e:15
term:man s:16 e:19


We are on the same page and I'm sorry for taking you down that path - 
except now you might be more sure it doesn't belong ;)

I see some of my initial and continued confusion was caused by that char 
tokenizer bug...your original tests now look right (second abcd starting 
at 8 rather than 7).

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message