lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: Federated relevance ranking
Date Fri, 03 Jun 2011 01:20:06 GMT
My gut feel is there isn't really a good solution to intermingling the
results, since
they come from different sources, index different kinds of data etc.
The irreducible
problem is that a hit in one index is not comparable to a hit in
another, either from
a Lucene scoring perspective or from the user's viewpoint.

What people will do is fiddle with the ranking process (omitting
norms, etc), but that's
pretty clumsy all told.

Another approach is to look at it from a presentation perspective. How
can the various
types of documents be presented to a user helpfully? Does it make
sense to present,
say, tabs across the top for each data source? Present the top 3 items
from each source
and let the user follow links for more results from that datasource?
Facet on the
data sources?

If you go this route, you can put it all in a single index and just
facet, group (4.x), filter,
etc.(oops, got confused. grouping and faceting are Solr constructs
last I knew, but you
may want to check them out)....

In some ways this is more a UI design issue. You have different data
sets, what serves
the user best?

I guess my challenge for your product manager folks is to define what
a good user
experience is. Provide concrete use cases etc. Because I'll also guess
that you can
make Lucene do what's necessary to support the use-cases if they're
something other
than "mystically do the right thing, and I get to define the right
thing dynamically" <G>...

I remember when "indexes" became acceptable, and I still think it's wrong...


On Thu, Jun 2, 2011 at 7:49 PM, Clint Gilbert
<> wrote:
> Hash: SHA1
> Thank you very much for your reply.  Yeah, our indexes (indices?)
> contain different types and amounts of data. :( The data being indexed
> is all the same format - RDF - but it describes different numbers and
> kinds of things.
> What is your gut feeling on whether or not it's a good idea for us to
> roll our own?  Katta is a contender, but we already have a fairly
> complex system, and adding anything Hadoop-related feels like it might
> push us over a tipping point into the realm of unwieldy overcomplexity.
>  But, this is a hard problem after all, so some amount of complexity is
> inevitable.
> On 06/02/2011 07:05 PM, Erick Erickson wrote:
>> As you've found out, raw scores certainly aren't comparable across
>> different indexes
>> #unless# the documents are fairly distributed. You're talking large
>> indexes here,
>> so if the documents are balanced across all your indexes, the results should be
>> pretty comparable. This pre-supposes that the indexes share a common schema
>> and that the distributions of terms are "close enough to identical" to be truly
>> comparable. And it supposes that your indexes are similar in
>> character. It wouldn't
>> work if one of your indexes had, say, meta-data from videos and another had
>> scholarly journal articles.
>> Otherwise, there's work going on in Solr that might help, although I
>> don't know when
>> that'll be available.
>> Other than that, I don't know what to suggest. It's not an easy
>> problem or Solr/Lucene
>> would already have solved it.. siiiggggh.
>> Best
>> Erick
>> On Thu, Jun 2, 2011 at 3:51 PM, Clint Gilbert
>> <> wrote:
>> Hi everyone,
>> I searched the list archives, but couldn't find a question that closely
>> matches mine.
>> The project I'm working on is designed to allow searching a distributed
>> collection of data repositories.  Currently, we index each repository to
>> build a central Lucene index.  This works ok, but for practical (the
>> central index is getting very large) and architectural (decentralization
>> is a design goal) reasons, we'd like to distribute the index.
>> In the past, we had basic federation system in place: when a user
>> submitted a query, the query was broadcast to each data repository,
>> which had its own independent Lucene index.  Results from each repo were
>> aggregated in reverse order.
>> The problem was, of course, that since each index was constructed
>> independently of all the others, and documents are distributed in the
>> repos unevenly, it was impossible to rank the results from all the
>> indices in a meaningful way.  We basically punted and interleaved
>> results, which didn't gave a bad user experience, hence the temporary
>> switch to a central index.
>> So, what options exist for searching distributed collections of Lucene
>> indices and ranking results meaningfully?
>> Katta seems promising, but I don't know enough about it yet.  It also
>> seems to want to open its own ports for RPC.  I'd prefer something that
>> could tunnel over HTTP to minimize firewall drama.  (We will have 10s
>> and then 100s of data repos running in separate locations.)
>> We're also considering a home-grown scheme involving normalizing the
>> denominators of all the index components in all our indices, based on
>> the sums of counts obtained from all the indices.  This feels like
>> re-inventing the wheel, and it's not clear to me yet that the low-level
>> manipulation of indices that we'd need to do is even possible.
>> Any suggestions for distributing indices while ranking results well are
>> very welcome!
> - ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla -
> 0VIAn2Xf+ypB5PRS45ekmiXEDhmvDdhZ
> =jtE9
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message