Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Message-ID: <16530113.2601286342370789.JavaMail.jira@thor>
Date: Wed, 6 Oct 2010 01:19:30 -0400 (EDT)
From: "Steven Rowe (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Subject: [jira] Updated: (LUCENE-1370) Add ShingleFilter option to output
 unigrams if no shingles can be generated
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-1370:
--------------------------------

    Attachment: LUCENE-1370.patch

bq. Looks good to me, I tested it and had no problems.

Thanks for testing, Robert.

bq. Maybe we should add the option to ShingleAnalyzerWrapper too?

Yes, I missed that one - I've added support for it in this version of the patch, along with one test in ShingleAnalyzerWrapperTest (for the single-token case).


> Add ShingleFilter option to output unigrams if no shingles can be generated
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-1370
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1370
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 2.9.3, 3.0.2, 3.1, 4.0
>            Reporter: Chris Harris
>            Assignee: Steven Rowe
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only one token long, then ShingleFilter.next() won't return any tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one token long, then ShingleFilter will return that token, regardless of the setting of outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby expanding things as follows:
> "please divide this sentence into shingles" ->
>  "please", "please divide"
>  "divide", "divide this"
>  "this", "this sentence"
>  "sentence", "sentence into"
>  "into", "into shingles"
>  "shingles"
> Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters a phrase query, it will get tokenized in the following manner:
> "please divide this sentence into shingles" ->
>  "please divide"
>  "divide this"
>  "this sentence"
>  "sentence into"
>  "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very considerable speedup. Without the outputUnigramIfNoNgrams option, then a single word query would tokenize like this:
> "please" ->
>    [no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this:
> "please" ->
>   "please"
> ****
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> ****
> I'm not sure if the patch in this state is useful to anyone else, but I thought I should throw it up here and try to find out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org