lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hiroaki Kawai (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter
Date Wed, 18 Jun 2008 05:06:45 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12605836#action_12605836
] 

Hiroaki Kawai commented on LUCENE-1306:
---------------------------------------

First of all, my comment No.3 was not wrong, sorry. We don't have to insert $^ token in the
ngram stream.

{quote}
I don't want separate fields for the prefix, inner and suffix grams, I want to use the same
single filter at query time. 
{quote}

I agree with that. :)

Then, let's consider about the phrase query.
1. At store time, we want to store a sentence "This is a pen"
2. At query time, we want to query with "This is"

At store time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
^T Th hi is s$ ^i is s$ ^a a$ ^p pe en n$

At query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
^T Th hi is s$ ^i is s$

We can find that the stored sequence because it contains the query sequence.

{quote}
If you are creating ngrams over multiple words, say a sentence, then I state that there should
only be a prefix in the start of the senstance and a suffix in the end of the sentance and
that grams will contain whitespace.
{quote}

If so, at query time, with WhitespaceTokenizer+CombinedNGramTokenFilter(2,2), we get:
"^T","Th","hi","is","s "," i","is","s$"

We can't find the stored sequence because it does not contain the query sequence. n-gram query
is always phrase query in the micro scope. 

+1 for prefix and suffix markers in the token.

{quote}
Note, also, that one could use the "flags" to indicate what the token is. I know that's a
little up in the air just yet, but it does exist. 
{quote}

Yes, there is a flags. Of cource, we can use it. But I can't find the way to use them efficiently
in THIS CASE, right now.

{quote}
This would mean that no stripping of special chars is required.
{quote}

Unfortunately, stripping is done outside of the ngram filter by WhitespaceTokenizer.

> CombinedNGramTokenFilter
> ------------------------
>
>                 Key: LUCENE-1306
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1306
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt
>
>
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(ts.next());
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message