lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1224) NGramTokenFilter creates bad TokenStream
Date Tue, 16 Feb 2010 21:13:28 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834471#action_12834471
] 

Robert Muir commented on LUCENE-1224:
-------------------------------------

I too think its really important we fix this. I have to agree with Hiroaki's analysis of the
situation, and the problems can be seen by looking at the code in both the filter/tokenizers
and the tests themselves.

Currently the tokenizers are limited to 1024 characters (LUCENE-1227), this is very related
to this issue.
Look at the test for 1,3 ngrams of "abcde":
{code}
public void testNgrams() throws Exception {
        NGramTokenizer tokenizer = new NGramTokenizer(input, 1, 3);
        assertTokenStreamContents(tokenizer,
          new String[]{"a","b","c","d","e", "ab","bc","cd","de", "abc","bcd","cde"}, 
{code}

in my opinion the output should instead be: a, ab, ...
Otherwise the tokenizer will either always be limited to 1024 chars or must read the entire
document into RAM.
This same problem exists for the EdgeNgram variants.

I agree with Grant's comment about the philosophical discussion about positions of the tokens,
perhaps we need an option for this (where they are all posInc=1, or the posInc=0 is generated
based on whitespace). I guess I think we could accomodate both needs by having tokenizer/filter
variants too, but I'm not sure.

The general problem i have with trying to determine a fix is that it will break backwards
compatibility, and I also know that EdgeNGram is being used for some purposes such as "auto-suggest".
So I don't really have any idea beyond making new filters/tokenizers, as I think there is
another use case where the old behavior fits?


> NGramTokenFilter creates bad TokenStream
> ----------------------------------------
>
>                 Key: LUCENE-1224
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1224
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Hiroaki Kawai
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1224.patch, NGramTokenFilter.patch, NGramTokenFilter.patch
>
>
> With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string into an index,
but I can't query it with "abc". If I query with "ab", I can get a hit result.
> The reason is that the NGramTokenFilter generates badly ordered TokenStream. Query is
based on the Token order in the TokenStream, that how stemming or phrase should be anlayzed
is based on the order (Token.positionIncrement).
> With current filter, query string "abc" is tokenized to : ab bc abc 
> meaning "query a string that has ab bc abc in this order".
> Expected filter will generate : ab abc(positionIncrement=0) bc
> meaning "query a string that has (ab|abc) bc in this order"
> I'd like to submit a patch for this issue. :-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message