lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Turnbull <dturnb...@opensourceconnections.com>
Subject Re: Poor performances with Shingle and Phrase query
Date Thu, 21 Jan 2016 18:52:35 GMT
In my experience, shingles can hurt query performance because the term
dictionary grows quite a bit. There's far more unique bigrams than there
are words. While the lookup time doesn't grow linearly with the number of
terms, it still grows.

I haven't specifically compared performance numbers shingles vs phrase, but
your numbers don't strike me as particularly shocking with performance
issues I've had in the past with larger term dictionary sizes.

Hope that helps
-Doug




On Thu, Jan 21, 2016 at 1:23 PM, Bertil Chapuis <bchapuis@gmail.com> wrote:

> Hello,
>
> I'm trying improve the speed of an index when searching for long phrases. I
> performed some tests with the benchmark module. With a simple analyser and
> PhraseQueries and get a throughput of 118 rec/sec. My test dataset is the
> latest dump of wikipedia. Here is the filters I use at indexation and query
> time:
>
> var filter: TokenFilter = new StandardFilter(tokenizer)
> filter = new LowerCaseFilter(filter)
> filter = new EnglishPossessiveFilter(filter)
> filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> filter = new SnowballFilter(filter, "English")
>
> In order to improve performances I tried to add a ShingleFilter and did
> some benchmark with PhraseQueries and BooleanQueries (Should, Must) and in
> both cases got a lower throughput (respectively 83rec/sec and 84 rec/sec).
> Here is the filter:
>
> var filter: TokenFilter = new StandardFilter(tokenizer)
> filter = new LowerCaseFilter(filter)
> filter = new EnglishPossessiveFilter(filter)
> filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> filter = new SnowballFilter(filter, "English")
> val shingleFilter =  new ShingleFilter(filter, 2, 2)
> shingleFilter.setOutputUnigrams(false)
> filter = shingleFilter
>
> From what I read, the performances should be better, but I'm unable to get
> the desired results. Has anyone some advices on the best way to use shingle
> in order to improve performances? Should I use some other form of Query?
>
> Thank you in advance for your help.
>
> Regards,
>
> Bertil
>



-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message