lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <>
Subject [jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream
Date Thu, 15 May 2008 12:03:55 GMT


Grant Ingersoll commented on LUCENE-1224:

{quote}Umm..., if you don't like indexing and querying in the unit test, where should I place
the join test that use NGramTokenizer? It might be nice if we could place that join test in
a proper place.{quote}

My point is, I don't think the test needs to do any indexing/querying at all to satisfy the
change.  It adds absolutely nothing to the test and only complicates the matter.

{quote}I placed the testIndexAndQuery in the code because the other code like KeywordAnalyzer
(in the core) test code has index&query test code in its unit tests.{quote}

Just because another does it doesn't make it right.

If we want to tokenize with white space tokenizer, the tokens are
"This", "is", "an", "example"
positions are 0,1,2,3

If we want to tokenize with 2-gram, the tokens are
"Th" "hi" "is" "s " " i" "is" "s " " a" "an" "n " " e" "ex" "xa" "am" "mp" "pl" "le"
positions are 0,1,2,3,4,...

Yes, I understand how it currently works.  My question is more along the lines of is this
the right way of doing it?  I don't know that it is, but it is a bigger question than you
and me.  I mean, if we are willing to accept that this issue is a bug, then it presents plenty
of other problems in terms of position related queries.  For example, I think it makes sense
to search for "th ex" as a phrase query, but that is not possible do to the positions (at
least not w/o a lot of slop)

> NGramTokenFilter creates bad TokenStream
> ----------------------------------------
>                 Key: LUCENE-1224
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Hiroaki Kawai
>            Assignee: Grant Ingersoll
>            Priority: Critical
>         Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, NGramTokenFilter.patch
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string into an index,
but I can't query it with "abc". If I query with "ab", I can get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. Query is
based on the Token order in the TokenStream, that how stemming or phrase should be anlayzed
is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message