lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Harris (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated
Date Sun, 31 Aug 2008 19:53:44 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627337#action_12627337
] 

Chris Harris commented on LUCENE-1370:
--------------------------------------

Karl, that's a good point about my setup being incompatible with non-0 slop. However, the
performance gains I'm seeing with this patch on my data are substantial. When I last tested
on the same index, same machine, same # of threads in my testing process, etc., and went from
analyzing my queries with

outputUnigrams==true

to analyzing with

using outputUnigrams==false and outputUnigramIfNoNgrams==true

phrase queries ended up performing something like 50x as fast. Which is good, because the
initial performance wasn't acceptable.

The performance gains from outputUnigramIfNoNgrams were greater than those from when I instead
tried moving the index to a solid state drive. (It was a a fairly entry-level SSD drive, but
still.) It would be interesting to compare to moving to a machine with an obscene amount of
RAM. (Not quite sure what would count as "obscene", but my index is 90+GB. Maybe half of that
is taken up by stored fields.)

> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-1370
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1370
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Chris Harris
>         Attachments: ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token stream is only
one token long, then ShingleFilter.next() won't return any tokens. This patch provides a new
option, outputUnigramIfNoNgrams; if this option is set and the underlying stream is only one
token long, then ShingleFilter will return that token, regardless of the setting of outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using outputUnigrams==true), thereby
expanding things as follows:
> "please divide this sentence into shingles" ->
>  "please", "please divide"
>  "divide", "divide this"
>  "this", "this sentence"
>  "sentence", "sentence into"
>  "into", "into shingles"
>  "shingles"
> Second, do query-time analysis using ShingleFilter (using outputUnigrams==false and outputUnigramIfNoNgrams==true).
If the user enters a phrase query, it will get tokenized in the following manner:
> "please divide this sentence into shingles" ->
>  "please divide"
>  "divide this"
>  "this sentence"
>  "sentence into"
>  "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very considerable speedup.
Without the outputUnigramIfNoNgrams option, then a single word query would tokenize like this:
> "please" ->
>    [no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like this:
> "please" ->
>   "please"
> ****
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> ****
> I'm not sure if the patch in this state is useful to anyone else, but I thought I should
throw it up here and try to find out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message