lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: Poor performances with Shingle and Phrase query
Date Thu, 21 Jan 2016 20:08:38 GMT
Be sure to check and see if your app is compute or I/O bound during this
process - whether too little of your index is cached in system memory and
each query requires I/O, lots of it.

-- Jack Krupansky

On Thu, Jan 21, 2016 at 1:52 PM, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> In my experience, shingles can hurt query performance because the term
> dictionary grows quite a bit. There's far more unique bigrams than there
> are words. While the lookup time doesn't grow linearly with the number of
> terms, it still grows.
>
> I haven't specifically compared performance numbers shingles vs phrase, but
> your numbers don't strike me as particularly shocking with performance
> issues I've had in the past with larger term dictionary sizes.
>
> Hope that helps
> -Doug
>
>
>
>
> On Thu, Jan 21, 2016 at 1:23 PM, Bertil Chapuis <bchapuis@gmail.com>
> wrote:
>
> > Hello,
> >
> > I'm trying improve the speed of an index when searching for long
> phrases. I
> > performed some tests with the benchmark module. With a simple analyser
> and
> > PhraseQueries and get a throughput of 118 rec/sec. My test dataset is the
> > latest dump of wikipedia. Here is the filters I use at indexation and
> query
> > time:
> >
> > var filter: TokenFilter = new StandardFilter(tokenizer)
> > filter = new LowerCaseFilter(filter)
> > filter = new EnglishPossessiveFilter(filter)
> > filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> > filter = new SnowballFilter(filter, "English")
> >
> > In order to improve performances I tried to add a ShingleFilter and did
> > some benchmark with PhraseQueries and BooleanQueries (Should, Must) and
> in
> > both cases got a lower throughput (respectively 83rec/sec and 84
> rec/sec).
> > Here is the filter:
> >
> > var filter: TokenFilter = new StandardFilter(tokenizer)
> > filter = new LowerCaseFilter(filter)
> > filter = new EnglishPossessiveFilter(filter)
> > filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> > filter = new SnowballFilter(filter, "English")
> > val shingleFilter =  new ShingleFilter(filter, 2, 2)
> > shingleFilter.setOutputUnigrams(false)
> > filter = shingleFilter
> >
> > From what I read, the performances should be better, but I'm unable to
> get
> > the desired results. Has anyone some advices on the best way to use
> shingle
> > in order to improve performances? Should I use some other form of Query?
> >
> > Thank you in advance for your help.
> >
> > Regards,
> >
> > Bertil
> >
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message