lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clint Gilbert <>
Subject Re: Federated relevance ranking
Date Thu, 02 Jun 2011 23:49:41 GMT
Hash: SHA1

Thank you very much for your reply.  Yeah, our indexes (indices?)
contain different types and amounts of data. :( The data being indexed
is all the same format - RDF - but it describes different numbers and
kinds of things.

What is your gut feeling on whether or not it's a good idea for us to
roll our own?  Katta is a contender, but we already have a fairly
complex system, and adding anything Hadoop-related feels like it might
push us over a tipping point into the realm of unwieldy overcomplexity.
 But, this is a hard problem after all, so some amount of complexity is

On 06/02/2011 07:05 PM, Erick Erickson wrote:
> As you've found out, raw scores certainly aren't comparable across
> different indexes
> #unless# the documents are fairly distributed. You're talking large
> indexes here,
> so if the documents are balanced across all your indexes, the results should be
> pretty comparable. This pre-supposes that the indexes share a common schema
> and that the distributions of terms are "close enough to identical" to be truly
> comparable. And it supposes that your indexes are similar in
> character. It wouldn't
> work if one of your indexes had, say, meta-data from videos and another had
> scholarly journal articles.
> Otherwise, there's work going on in Solr that might help, although I
> don't know when
> that'll be available.
> Other than that, I don't know what to suggest. It's not an easy
> problem or Solr/Lucene
> would already have solved it.. siiiggggh.
> Best
> Erick
> On Thu, Jun 2, 2011 at 3:51 PM, Clint Gilbert
> <> wrote:
> Hi everyone,
> I searched the list archives, but couldn't find a question that closely
> matches mine.
> The project I'm working on is designed to allow searching a distributed
> collection of data repositories.  Currently, we index each repository to
> build a central Lucene index.  This works ok, but for practical (the
> central index is getting very large) and architectural (decentralization
> is a design goal) reasons, we'd like to distribute the index.
> In the past, we had basic federation system in place: when a user
> submitted a query, the query was broadcast to each data repository,
> which had its own independent Lucene index.  Results from each repo were
> aggregated in reverse order.
> The problem was, of course, that since each index was constructed
> independently of all the others, and documents are distributed in the
> repos unevenly, it was impossible to rank the results from all the
> indices in a meaningful way.  We basically punted and interleaved
> results, which didn't gave a bad user experience, hence the temporary
> switch to a central index.
> So, what options exist for searching distributed collections of Lucene
> indices and ranking results meaningfully?
> Katta seems promising, but I don't know enough about it yet.  It also
> seems to want to open its own ports for RPC.  I'd prefer something that
> could tunnel over HTTP to minimize firewall drama.  (We will have 10s
> and then 100s of data repos running in separate locations.)
> We're also considering a home-grown scheme involving normalizing the
> denominators of all the index components in all our indices, based on
> the sums of counts obtained from all the indices.  This feels like
> re-inventing the wheel, and it's not clear to me yet that the low-level
> manipulation of indices that we'd need to do is even possible.
> Any suggestions for distributing indices while ranking results well are
> very welcome!
- ---------------------------------------------------------------------
To unsubscribe, e-mail:
For additional commands, e-mail:

> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla -


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message