lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Harris (JIRA)" <>
Subject [jira] Commented: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated
Date Wed, 09 Sep 2009 18:19:57 GMT


Chris Harris commented on LUCENE-1370:

Any ideas on whether it will ever make sense for this patch to make it into the trunk? Some
random thoughts:

 * The latest patch does make ShingleFilter marginally less clean. (Maybe my least favorite
part is how fillShingleBuffer() is now responsible for setting firstToken; that it should
do so is not obvious from the method name.)
 * This patch's functionality could potentially be implemented without modifying ShingleFilter
itself. For example, maybe instead of patching ShingleFilter we could have a ShingleFilterUnigramWrapper
class, that would delegate to ShingleFilter, except when ShingleFilter failed to produce ngrams.
I'm a little worried that this would require using a CachingTokenFilter, and that might not
be ideal from an efficiency perspective.
 * It might be good to rename outputUnigramIfNoNgrams to something like forceUnigramIfNoNgrams.
With the current naming scheme, you end up setting some contradictory-sounding options, e.g.
setting outputUnigrams==false and outputUnigramIfNoNgrams==true. If you look at the code this
might not be confusing, but it'd be nice if it were more straightforward without making you
look at the code.
 * I gather that some people have an interest in making a minShingleSize option. (See
I'm not sure how best to modify this patch should that get implemented. It might depend on
the typical use cases for minShingleSize, and if there's any overlap with my use case here.

> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --------------------------------------------------------------------------
>                 Key: LUCENE-1370
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Chris Harris
>            Assignee: Karl Wettin
>         Attachments: LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, ShingleFilter.patch
> Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only
one token long, then won't return any tokens. This patch provides a new
option, outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one
token long, then ShingleFilter will return that token, regardless of the setting of outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby
expanding things as follows:
> "please divide this sentence into shingles" ->
>  "please", "please divide"
>  "divide", "divide this"
>  "this", "this sentence"
>  "sentence", "sentence into"
>  "into", "into shingles"
>  "shingles"
> Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and outputUnigramIfNoNgrams==true).
If the user enters a phrase query, it will get tokenized in the following manner:
> "please divide this sentence into shingles" ->
>  "please divide"
>  "divide this"
>  "this sentence"
>  "sentence into"
>  "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very considerable speedup.
Without the outputUnigramIfNoNgrams option, then a single word query would tokenize like this:
> "please" ->
>    [no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this:
> "please" ->
>   "please"
> ****
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> ****
> I'm not sure if the patch in this state is useful to anyone else, but I thought I should
throw it up here and try to find out.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message