lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanna Josium" <prasanna.jos...@clustr.co.in>
Subject RE: Returned number of result rows as a function of maxScore or numFound.
Date Fri, 10 Jun 2016 04:03:32 GMT
Thanks Erick & Binoy,
I will try out the 2 query technique. Guess this will work for numFound related issue.

Guess I was not very clear in stating  my problem. The problem I'm dealing with is mostly
with maxScore.
I have collection (~500K docs) where I look for matches to the query.
Because of the nature of the data in the collection, I get for some of them a very high score
which soon fades to very low score for others(5 to 0.5); 
For some queries even within the first 10 docs; 8  have score between 5 to 3.8 and the 9th
onwards falls to 0.4 & 0.3 and so on into a long tail.

The business guys thinks that docs with very low score compared to the highs scores ones should
not be part of the result set.
and must be cut off below a threshold defined as a percent of maxScore. Any thought about
how to work with max score.

Thanks 
Prasanna Josium




-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: 09 June 2016 22:43
To: solr-user
Subject: Re: Returned number of result rows as a function of maxScore or numFound.

Why do this at all? I have a hard time understanding what benefit this is to the _user_.

And even returning 5% is risky. I mean what happens for a query of *:*? For a corpus of 100M
docs that's still 5M documents which is would hurt.

Sure, you say, well I'll cap it at XXX docs. The principle still holds though.
Users usually don't want to deal with very many docs at a time.

If you must do this for some kind of reporting or something, just fire two queries. The first
has a rows of 0 and the second has a rows=5% of what was returned the first time.

Under the covers, you really can't do this without writing some sort of custom collector.
Solr (Well, Lucene) uses the rows parameter as the dimension of the list where the most relevant
docs are stored, and replaced as "better" docs some along. You can't know how many doc are
going to be found before you score them all.
So how would you know what 5% was when you start? You'd have to write something that would
keep 20X whatever your max was set to and then grow it as necessary.... but by that time you
_might_ have already thrown away docs that should be in the expanded list....... Or you'd
have to keep _all_ the results which would be very expensive usually.

All in all, I think a 2-query solution is much simpler than hacking into your own collector,
not to mention far more efficient in the general case.

Best,
Erick

On Wed, Jun 8, 2016 at 10:26 PM, Binoy Dalal <binoydalal93@gmail.com> wrote:
> I don't think you can do such a thing ootb with solr but this is 
> pretty easy to achieve using a custom search component.
>
> Just write some custom code which will limit your resultset and plug 
> it into your request handler as the last component.
>
> On Thu, 9 Jun 2016, 08:53 Prasanna Josium, 
> <prasanna.josium@clustr.co.in>
> wrote:
>
>> Hi,
>> I use a dse stack with has solr4.10.
>> I want to control the number of rows from result set as a percent of 
>> the max hit 'numFound' or  'maxScore' for a query.
>> e.g.,
>> 1)  for a query 'foo', if I get 100 hits and if I want to get the top 
>> 5% percent (say rows=5%). Then I get only 5 rows.
>> for a query 'bar', if I get 1000 hits, I want to get the top 5% 
>> (rows=5%).Then I get top 50 rows.
>>
>> 2) for a query 'foo' if the maxScore is 4.5, I want to get say all 
>> records within 10% of maxScore ..I want to get all records whose 
>> score is between
>> 4.5 to 4.0(this could be the any number of records)
>>
>> in  other words, the returned set is a percent of hits, instead of a 
>> static row count.
>> Is there a way to do this readily or via some custom implementation?
>>
>> Thanks
>> Cheers
>> Prasanna Josium
>>
> --
> Regards,
> Binoy Dalal
Mime
View raw message