lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Lea <ian....@gmail.com>
Subject Re: Anyway to not bother scoring less good matches ?
Date Thu, 05 May 2011 10:59:49 GMT
See http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1
for an excellent article and solution to the problem with common
words.

You could also consider using, and caching and reusing, filters for
the tnum and tracks fields.


--
Ian.


On Thu, May 5, 2011 at 11:31 AM, Paul Taylor <paul_t100@fastmail.fm> wrote:
> On 05/05/2011 11:13, Ahmet Arslan wrote:
>>>
>>> Yes correct, but I have looked and the list of
>>> optimizations before. What was clear from profiling was that
>>> it wasnt the searching part that was slow (a query run on
>>> the same index with only a few matching docs ran super fast)
>>> the slowness only occurs when there are loads of matching
>>> docs, and spends most of its time in scorer that is why I
>>> was trying to remove the poor matches.
>>
>> Okey all clear. Can you give us some example query strings where there are
>> loads of matching?
>>
>> Do you use stop word filter? Could it be case described as
>>
>> "As you approach the upper limits of a single machine,
>> extremely frequent terms (called stop words) can become very
>> expensive in the wrong query. If part of a top level BooleanQuery, a
>> SHOULD clause that appears in every document will cause a match and
>> score for every document in your index."
>>
>>
>> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
>>
> We used to use the default stop word list but have no stop words because
> Lucene is used to match very short fields  relating to Musicdata such as
> artist or album name, therefore the default stop words really need to be
> included to get good matches, for example how would you match the artist
> 'The The' otherwise, so use of a stop word word list is not an option.
>
> If people construct good queries thee is no problem, but the trouble is that
> many users just OR everything they are looking for because they don't want a
> good match rejected because just one term fails, but the problem is there
> are a number of very popular terms, for example the following query:
>
> tnum:(6) qdur:(189) artist:(tama) track:(ibata) tracks:(10) release:(the
> global rhythm september 2002)
>
> will match any song that is on an album with 10 tracks, any song which is
> trackno 6 on an album, and any release containg the word 'the' , when really
> what they are looking for is the song 'ibata' by artist 'tama',
>
> This matches over a million documents (songs) , but doesn't match any well,
> because the song 'ibata' by 'tama' isnt actually in the index !
>
> So I dont think the query is very good but I cannot force users to submit
> better queries, but I want to protect the server by reducing the time these
> kind of query take (upto 1 second as opposed to the more usual 100
> milliseconds) and I hope that forcing x number of terms to match would do
> that.
>
> Paul
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message