lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hiroaki Kawai <ka...@apache.org>
Subject Re: NGrams and positions
Date Fri, 16 May 2008 15:13:52 GMT
I think LUCENE-1224 is more complex than LUCENE-1225.

First, I want to solve LUCENE-1225. It might be more 
simple to understand.

For LUCENE-1224, I came to the same issue. My current 
understanding is this comes from mismatch of TokenFilter and position.
I apologyze for that the patch is confusing. I'm aware that the patch
still has another issue. 

When we process a stream to tokenize by two TokenFilters,
the first will tokenize "abc fgh" to 'abc', 'fgh'.
then we tokenize 'ab' 'bc' 'fg' 'gh'.
The question is what about the "positioin"?
The first tokenFilter will create an alignment, and the second
tokenFilter will create a subalignment.
'abc', 'fgh' => position 0, 1
'ab' 'bc' 'fg' 'gh' => position 0-0, 0-1, 1-0, 1-1
Current lucene can't handle the subalignment, and I think supporting 
subalignment is an overkill.

By the way, I want to alter the patch of LUCENE-1224. 
Aligning 'ab' and 'bc' at the same position was bad, sorry!
LUCENE-1225 is good I think.


Grant Ingersoll <gsingers@apache.org> wrote:
> See https://issues.apache.org/jira/browse/LUCENE-1224
> 
> Do people have an opinion on what positions ngrams should be output  
> at?  For instance, given 1-grams on "abc fgh", these are currently  
> output as: a, b, c, f, g,h all with a position increment of 1.  That  
> seems somewhat reasonable, but it has tradeoffs, namely you can't  
> query for something like: "a f" without some amount of slop, which I  
> think is a reasonable thing to do (but don't have an actual use case  
> for at the moment.)  An alternative way might be to output a, b, c all  
> at the same position, then increment for f and then put g and h at the  
> same position.
> 
> I am _wondering_ whether it makes more sense to add an option to the  
> NGram token streams such that we could have the choice of either  
> outputting the n-grams within a "token" at the same position or at  
> successive positions (to be back-compatible.)  It isn't clear to me  
> which is correct, or if there is even a notion of correctness here, in  
> so much as they are both correct if that is the functionality you want  
> in your application.  As DM Smith noted, if Lucene supported the  
> notion of "sub" positions, one could output 1.a, 1.b, 1.c, 2.a, 2.b  
> and 2.c for the example above, but that capability doesn't exist in  
> Lucene right now, AFAIK.
> 
> -Grant
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message