lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wettin (JIRA)" <>
Subject [jira] Commented: (LUCENE-1306) CombinedNGramTokenFilter
Date Tue, 17 Jun 2008 20:58:45 GMT


Karl Wettin commented on LUCENE-1306:

I'll refine and document this patch soon. Terrible busy though. Hasty responses:

bq. Should there be a way for the client of this class to specify the prefix and suffix char?

bq. 1. prefix and suffix chars should be configurable. Because user must choose a char that
is not used in the terms.

There are getters and setters, but nothing in the constructor.

bq. Is having, for example, "^h" as the first bi-gram token really the right thing to do?
Would "^he" make more sense? I know that makes it 3 characters long, but it's 2 chars from
the input string. Not sure, so I'm asking.

I always considered 'start of word' and 'end of word' as a single character and a part of
n. I might be wrong though. I'll have to take a look at what other people did. It would not
be a very hard thing to include a setting for that.

bq. Is this primarily to distinguish between the edge and inner n-grams? If so, would it make
more sense to just make use of Token type variable instead?
bq. one could use the "flags" to indicate what the token is. 

I might be missing something in your line of questioning. Don't understand what it would help
to have the flag or token type as they are not stored in the index.

I don't want separate fields for the prefix, inner and suffix grams, I want to use the same
single filter at query time. I typically pass down the gram boost in the payload, evaluated
on gram size, how far away it is from the prefix and suffix, et c. 

bq. 3. If you want to do a phrase query (for example, "This is"), we have to generate $^ token
in the gap to make the positions valid.

If you are creating ngrams over multiple words, say a sentence, then I state that there should
only be a prefix in the start of the senstance and a suffix in the end of the sentance and
that grams will contain whitespace. I never did phrase queries using grams but I'd probably
want prefix and suffix around each token. This is another good reason to keep them in the
same field with prefix and suffix markers in the token, or?

> CombinedNGramTokenFilter
> ------------------------
>                 Key: LUCENE-1306
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-1306.txt
> Alternative NGram filter that produce tokens with composite prefix and suffix markers.
> {code:java}
> ts = new WhitespaceTokenizer(new StringReader("hello"));
> ts = new CombinedNGramTokenFilter(ts, 2, 2);
> assertNext(ts, "^h");
> assertNext(ts, "he");
> assertNext(ts, "el");
> assertNext(ts, "ll");
> assertNext(ts, "lo");
> assertNext(ts, "o$");
> assertNull(;
> {code}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message