lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-1224) NGramTokenFilter creates bad TokenStream
Date Thu, 15 May 2008 16:07:55 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597174#action_12597174
] 

otis edited comment on LUCENE-1224 at 5/15/08 9:07 AM:
-------------------------------------------------------------------

Hiroaki:
I agree with Grant about unit tests.  I looked at the unit tests and thought the same thing
as Grant - why is Hiroaki adding indexing/searching into the mix?  Your change is about modifying
the positions of n-grams, and you don't need to index or search for that.  The test will be
a lot simpler if you just test for positions, like Grant suggested.

Also, once you change the unit test this way, it will be a lot easier to play with positions
and figure out what the "right" way to handle positions is.

Finally, it might turn out that people have different needs or different expectations for
n-gram positions.  Thus, when making changes, perhaps you can think of a mechanism that allows
the caller to instruct the n-gram tokenizer which token positioning approach to take (e.g.
the "incremental" one, or the one based on the position of the originating token, or...)


      was (Author: otis):
    Hiroaki:
I agree with Grant about unit tests.  I looked at the unit tests and thought the same thing
as Grant - why is Hiroaki adding indexing/searching into the mix?  Your change is about modifying
the positions of n-grams, and you don't need to index or search for that.  The test will be
a lot simpler if you just test for positions, like Grant suggested.
  
> NGramTokenFilter creates bad TokenStream
> ----------------------------------------
>
>                 Key: LUCENE-1224
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1224
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Hiroaki Kawai
>            Assignee: Grant Ingersoll
>            Priority: Critical
>         Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, NGramTokenFilter.patch
>
>
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string into an index,
but I can't query it with "abc". If I query with "ab", I can get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. Query is
based on the Token order in the TokenStream, that how stemming or phrase should be anlayzed
is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message