lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Russell M. Allen" <Russell.Al...@aebn.net>
Subject RE: Scoring a document (count?)
Date Fri, 04 Aug 2006 13:45:37 GMT
Doron, thanks for the code offer.  That would be great.  I was able to
get a partial implementation working myself, but I ran into some issues
(most of which are rooted in a lack of understanding of Lucene internals
on my part).  I am sure I can learn a few things from your solution to
this problem.

You may email me directly with the code, or if it's small enough, post
it to the list for posterity.

Thanks again!
-Russell

-----Original Message-----
From: Doron Cohen [mailto:DORONC@il.ibm.com] 
Sent: Thursday, August 03, 2006 6:04 PM
To: java-user@lucene.apache.org
Subject: RE: Scoring a document (count?)

Hi Russel,

I am also interested in the internals of Lucene's ranking and how one
can/should alter the scoring. For now I was just learning from existing
code of Lucene scorers and Weights. Your question seemed interesting, so
I in fact implemented a quick scorer that would return the raw tf as a
score, as an exercise. It is not a product level implementation of
course, but if you think this will help you (?) I can share the code.
(Would have responded sooner, for my working computer went off for a few
days with a fan error...:-).

Regards,
Doron

"Russell M. Allen" <Russell.Allen@aebn.net> wrote on 31/07/2006
07:35:50:

> Thank you for the reply Doran!  You are exactly right about the sql 
> count(*).  I need the equivalent of group by, and count().
>
> We have considered a 'joined' index where we would have a document for

> each permutation.  We discarded it (possibly prematurely) based on the

> rapid explosion in the number of documents.  In our domain, we have 
> movies as the main document type, and 5 other satellite document types

> with their own indexes: Star, Studio, Director, Series, and Category 
> (genre).  With the exception of series, a movie has a many to many 
> relationship with the other indexes.  So, with 60k movies, 20k stars, 
> 2k studios, ... The document count quickly shoots through the roof.
>
> Also, the majority of our searching is based on a single domain type, 
> such as movie.  It is only a small handful of corner cases where we 
> want what amounts to a joined query.  If we merged these indexes, we 
> would constantly have to 'roll up' the results into distinct instances

> of a type.  (The equivalent of an SQL 'group by')
>
>
> I find the parallels between the expressiveness of Lucene and SQL 
> interesting.  I'm glad to see you compared what I was looking for to 
> an sql count(*) as well.  We have a handful of indexing issues that I 
> am attempting to solve/optimize, of which performing a count(*) is 
> only one.  I also have the need to perform a JOIN across two indexes.

> I have 'ideas' about how I might go about this, but for now we are 
> fortunate enough to have fairly static data and half of the join is 
> static.  As a result we can cache a bitset filter for the results of 
> half the join and apply it to the other (dynamic) half of the join
query.
>
> Anyway, I digress...
>
> I saw you second post regarding creating a scorer.  I'd like to 
> continue down that path.  My main issue now is simply understanding 
> how lucene works under the covers enough to write the TermQuery
variant.
>
> Thanks for the help,
> Russell.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message