lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Falck" <marcus.fa...@observer.se>
Subject SV: Sort problematics
Date Thu, 18 May 2006 20:53:23 GMT
 

________________________________

Från: Yonik Seeley [mailto:yseeley@gmail.com]
Skickat: to 2006-05-18 20:39
Till: java-user@lucene.apache.org
Ämne: Re: Sort problematics



On 5/18/06, Marcus Falck <marcus.falck@observer.se> wrote:
>> I'm well aware of the trade offs. But if you were aware of the large amounts of data
that this system should be able to search you woldn't propose the usage of a database.

>If you have a hard requirement of instantly seeing any update, you
can't use Lucene.  That's more database-like functionallity. That's
why I asked.

Instant in this case isn't really instant. Lets say that the MAXIMUM time that will be accepted
is 5 minutes. Since i have the altert service up and running all users that pays for immediate
alerts will get their hits from it.

>> Since I have an separate alert service for immediatly alerts up and running i may
be able to do trade offs with the data availability timings, and hold the indexsearcher open
for a longer period.

>That's pretty much a requirement for using Lucene to support a decent
query rate.

 I thought i had good query rates when i instantiated the IndexSearcher for every search.
I mean in my example index on 10GB i had response times under a second when queried using
a boolean query containing approximently 200 terms. But I will redesign using a more static
behavior for the IndexSearcher and recreate it on a reqular basis. After this redesign i suppose
the search will benefit from the fieldcache (until i recreate). And as you say 800MB is nothing.
I will without problems have atleast 4GB RAM in each search machine. 


>> But still. The memory is the problem.
>> I mean how much memory would the fieldcache take for 500 Millon newsletter articles?
Probably a lot,
>> ok the system is scaled out over different machines so in reality each machine won't
have 500 Million docs but maybe around 100Million.

>Depends on what you are sorting by... for an int/float 100M*4 or
800MB.  Big, but possible.

>> So i'm still interesting in changing the relevance.
>> Any ideas?

>Depends on what you are sorting by, and how many different ways you
>want to sort.  If it's a single sort criteria, you can use index-time
>boosts.  If you can sort multiple ways, avoiding the fieldcache
>probably won't help you because the time to retrieve the per-doc sort
>info via termvectors or stored fields will take too long.


There is however a problem with boosting the docs since the boost factor is of type float.
A float doesn't have the resolution needed to differ on a second basis. 

I will illustrate my sort/ranking need with an example: 

If i use lucene default implementation of the TermScorer and search for

"you" OR "her"

The term scorer will give higher score on documents containing both terms. This is a problem
(in our application) since in this case want the same score on documents as long as they contain
1 of the terms (since we are dealing with newsletter observation for companies they want to
get the hits ordered by date to get the complete overview).  I tested to rewrite the TermScorer
to give me the same score with success. So my question is. 
If i can modify the score at search time using the Score class why does everybody talk about
the Similarity class?

 

/ 

Marcus

 


 



Mime
View raw message