lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hiroaki Kawai (JIRA)" <>
Subject [jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream
Date Thu, 15 May 2008 15:15:56 GMT


Hiroaki Kawai commented on LUCENE-1224:

About test code: I'm not going to say that "I'm right". I just wanted to address the issue
and share what we should solve. If you don't like the code, please just tell me how I should
do (the better way). I initially put the code there because I thought it was reasonable and
proper, but I'm fine with changing it.

For example, I think it makes sense to search for "th ex" as a phrase query

For example, I think it makes sense to search for "example" as a phrase query instead.

I want to address that NGramTokenizer is very useful for non-white-space-separated languages,
for example Japanese. In that case, we won't search "th ex", because it assumes sentences
are separated by whte space. I want to search by a fragment of a text sequence.

I agree that this might be a big problem. IMHO, the issues comes from concept mismatch of
TokenFilter and TermPosition. The discussion should moved to mailing-list?

> NGramTokenFilter creates bad TokenStream
> ----------------------------------------
>                 Key: LUCENE-1224
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Hiroaki Kawai
>            Assignee: Grant Ingersoll
>            Priority: Critical
>         Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, NGramTokenFilter.patch
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string into an index,
but I can't query it with "abc". If I query with "ab", I can get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. Query is
based on the Token order in the TokenStream, that how stemming or phrase should be anlayzed
is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message